This is an excerpt from the bestselling book
Oracle Grid & Real Application Clusters. To get immediate
access to the code depot of working RAC scripts, buy it
directly from the publisher and save more than 30%.
I/O Fencing ? exclusion
strategy
There will be some situations
where the leftover write operations from failed database instances
reach the storage system after the recovery process starts, such as
when the cluster function failed on the nodes, but the nodes are
still running at OS level. Since these write operations are no
longer in the proper serial order, they can damage the consistency
of the stored data. Therefore, when a cluster node fails, the failed
node needs to be fenced off from all the shared disk devices or disk
groups. This methodology is called I/O Fencing, sometimes called
Disk Fencing or failure fencing.
The main function of the I/O
fencing includes preventing updates by failed instances, and
detecting failure and preventing split brain in cluster. Cluster
Volume Manager, in association with the shared storage unit, and
Cluster File System play a significant role in preventing the failed
nodes accessing shared devices.
For example, in Sun Cluster,
disk fencing is done through SCSI-2 reservation for dual hosted SCSI
devices and for multi-hosted environment through SCSI-3 PR. Veritas
Advance Cluster uses the SCSI-3 persistent reservation to perform
I/O fencing. In the case of Linux clusters, CFS like Polyserve and
Sistina GFS are able to perform I/O fencing by using different
methods like fabric fencing that uses SAN access control mechanism.
SCSI-3 PR
SCSI-3 PR, which stands for
Persistent Reservation, supports multiple nodes accessing a device
while at the same time blocking access to other nodes. SCSI-3 PR
reservations are persistent across SCSI bus resets or node reboots
and also support multiple paths from host to disk. For SCSI-2 disks,
reservations are not persistent which means they do not survive node
reboots.
SCSI-3 PR uses a concept of
registration and reservation. Systems that participate, register a
key with SCSI-3 device. Each system registers its own key. Then
registered systems can establish a reservation. With this method,
blocking write access is as simple as removing registration from a
device. A system wishing to eject another system issues a pre-empt
and abort command and that ejects another node. Once a node is
ejected, it has no key registered so that it cannot eject others.
This method effectively avoids the split-brain condition.
Another benefit of the SCSI-3 PR
method is that since a node registers the same key down each path,
ejecting a single key blocks all I/O paths from that node. For
example, SCSI-3 PR is implemented by EMC Symmetrix, Sun T3, and
Hitachi Storage systems. In case of SCSI-2 reservation, it works
only with one path with one host.
Arbitration through Quorum
Disks
In case of SCSI-2 reservation,
the Clusterware seeks to reserve a quorum disk to break the tie in
cases of split cluster. A quorum disk is a nominated device in the
shared storage connected to the relevant nodes. The reservation is
enacted as a SCSI-2 ioctl. The node that is granted the reservation
causes the second attempt to fail. The SCSI-2 reservation ioctl used
is part of the SCSI-2 command set. This is commonly implemented in
most modern disk firmware. However, the reservation call is neither
persistent, or capable of surviving reboots, nor able to cope with
multiple paths to the same device.
A quorum disk must be defined
for a two-node cluster. This arrangement enables any single node
that obtains the vote of the quorum disk to maintain majority and
continue as a viable cluster. Clusterware forces the loosing node
out of the cluster.