Fabric Fencing

Oracle RAC Cluster Tips by Burleson Consulting

This is an excerpt from the bestselling book Oracle Grid & Real Application Clusters. To get immediate access to the code depot of working RAC scripts, buy it directly from the publisher and save more than 30%.

Polyserve Matrix Server (Cluster File System), which is widely used on Linux clusters, implements node exclusion strategy by following the Fabric-Fencing approach. The Polyserve matrix server includes a Storage Control Layer that uses SAN access control mechanism to arbitrate which servers have access to which storage resources. This is achieved by turning off the Fibre-Channel ports to which the offending node is attached.

Advantages of this approach:

* Isolates the SAN access only

* Permits non-SAN applications to continue to run

* No extra hardware is required

Exclusion with STOMITH approach

This method uses a network controlled power switch to cut off a server?s power supply when it is no longer deemed to be a reliable member of the cluster. Some of the characteristics of this approach:

* It is highly disruptive to the problem node.

* It is universal - it operates on all resource types equally well, and simultaneously.

* It is very simple in concept and in practice.

* There are virtually no support problems or version interactions to complicate development, testing, and maintenance.

* Overall system availability is often helped by the reboot.

This method was adopted by Linux clusters in the earlier period of their growth and when the Linux SCSI reserve/release support was immature and not consistently implemented.

However, there are certain issues with method:

* Potential data integrity issues on account of forceful shutdown of node.

* Nodes can shoot each other and shutdown the entire cluster.

* Shot down server cannot be accessed to diagnose issues.

Sistina GFS supports multiple cascading I/O fencing methods including manual, network power control, and fibre channel switch zone control.

Cache Coherency and Lock Management

One of the most critical features for a parallel database is its ability to control global concurrency of the data (pages or blocks) located in the individual node?s cache. As each of the nodes has its own local cache that has current data blocks, their status and access need to be controlled globally. Other node?s cache might need to access concurrently. Blocks are moved frequently across the nodes when needed. In addition, there should be effective and accurate monitoring of the status of the blocks in cache. Lock acquisition, lock release, and lock conversions should be performed at rapid speeds. Low latency and High-speed communication between the nodes is an essential requirement.

Since a data block can be present in the database buffers of more than one node when an update occurs, all other buffered copies become obsolete. The global cache control mechanism invalidates the obsolete data blocks. Another important feature is the way in which the reconfiguration of cache occurs when a node fails. To maintain integrity of data blocks, failed instance?s resources need to be taken over or re-mastered by another node?s instance.

This is an excerpt from the bestselling book Oracle Grid & Real Application Clusters, Rampant TechPress, by Mike Ault and Madhu Tumma.

You can buy it direct from the publisher for 30%-off and get instant access to the code depot of Oracle tuning scripts.

http://www.rampant-books.com/book_2004_1_10g_grid.htm

��