I/O Fencing

Oracle RAC Cluster Tips by Burleson Consulting

This is an excerpt from the bestselling book Oracle Grid & Real Application Clusters. To get immediate access to the code depot of working RAC scripts, buy it directly from the publisher and save more than 30%.

I/O Fencing ? exclusion strategy

There will be some situations where the leftover write operations from failed database instances reach the storage system after the recovery process starts, such as when the cluster function failed on the nodes, but the nodes are still running at OS level. Since these write operations are no longer in the proper serial order, they can damage the consistency of the stored data. Therefore, when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O Fencing, sometimes called Disk Fencing or failure fencing.

The main function of the I/O fencing includes preventing updates by failed instances, and detecting failure and preventing split brain in cluster. Cluster Volume Manager, in association with the shared storage unit, and Cluster File System play a significant role in preventing the failed nodes accessing shared devices.

For example, in Sun Cluster, disk fencing is done through SCSI-2 reservation for dual hosted SCSI devices and for multi-hosted environment through SCSI-3 PR. Veritas Advance Cluster uses the SCSI-3 persistent reservation to perform I/O fencing. In the case of Linux clusters, CFS like Polyserve and Sistina GFS are able to perform I/O fencing by using different methods like fabric fencing that uses SAN access control mechanism.

SCSI-3 PR

SCSI-3 PR, which stands for Persistent Reservation, supports multiple nodes accessing a device while at the same time blocking access to other nodes. SCSI-3 PR reservations are persistent across SCSI bus resets or node reboots and also support multiple paths from host to disk. For SCSI-2 disks, reservations are not persistent which means they do not survive node reboots.

SCSI-3 PR uses a concept of registration and reservation. Systems that participate, register a key with SCSI-3 device. Each system registers its own key. Then registered systems can establish a reservation. With this method, blocking write access is as simple as removing registration from a device. A system wishing to eject another system issues a pre-empt and abort command and that ejects another node. Once a node is ejected, it has no key registered so that it cannot eject others. This method effectively avoids the split-brain condition.

Another benefit of the SCSI-3 PR method is that since a node registers the same key down each path, ejecting a single key blocks all I/O paths from that node. For example, SCSI-3 PR is implemented by EMC Symmetrix, Sun T3, and Hitachi Storage systems. In case of SCSI-2 reservation, it works only with one path with one host.

Arbitration through Quorum Disks

In case of SCSI-2 reservation, the Clusterware seeks to reserve a quorum disk to break the tie in cases of split cluster. A quorum disk is a nominated device in the shared storage connected to the relevant nodes. The reservation is enacted as a SCSI-2 ioctl. The node that is granted the reservation causes the second attempt to fail. The SCSI-2 reservation ioctl used is part of the SCSI-2 command set. This is commonly implemented in most modern disk firmware. However, the reservation call is neither persistent, or capable of surviving reboots, nor able to cope with multiple paths to the same device.

A quorum disk must be defined for a two-node cluster. This arrangement enables any single node that obtains the vote of the quorum disk to maintain majority and continue as a viable cluster. Clusterware forces the loosing node out of the cluster.

This is an excerpt from the bestselling book Oracle Grid & Real Application Clusters, Rampant TechPress, by Mike Ault and Madhu Tumma.

You can buy it direct from the publisher for 30%-off and get instant access to the code depot of Oracle tuning scripts.

http://www.rampant-books.com/book_2004_1_10g_grid.htm

��