Redundancy Features

Oracle RAC Cluster Tips by Burleson Consulting

This is an excerpt from the bestselling book Oracle Grid & Real Application Clusters. To get immediate access to the code depot of working RAC scripts, buy it directly from the publisher and save more than 30%.

There are many features or options that add value to the redundancy at the server level. Taking advantage of such features helps avoid failures, and avoids degraded cluster performance in systems like the RAC system. These features address different subsystems of the server, such as the memory and processors. Redundant components such as fans, power supplies, and adapters can also provide higher availability, particularly when used with software that provides monitoring and alerting capability to the system administrators.

To make the servers more reliable, high-reliability components and best-system practices should be used. The following section examines some features of the redundancy that administrators need to focus on.

Dynamic Reorganization (DR) with in a server

DR is an operating environment feature that provides the ability to replace and reconfigure system hardware while the system is running. This feature is optional and can be implemented at the discretion of the system administrator. The main benefit of DR is that an administrator can add or replace hardware resources, such as CPUs, memory, and I/O interfaces, with little interruption of normal system operations. The DR process helps to increase the overall uptime and availability of servers.

For example, the DR method is available for Sun system architectures that contain multiple system boards and use board slots that support hot plugging. The DR facility is very well implemented for Sun Fire server series 3800-6800. By using the DR methodology, hardware components can be added or removed from a system with minimal interruption. The DR is performed at attachment points. DR allows connect or disconnect attachment points. The Sun Fire series supports the following attachment points for dynamic reorganization.

* I/O Assembly (PCI / ePCI assemblies)

* CPU/Memory Boards

* CPCI cards

* System Memory

* CPU/s

Predictive Failure Analysis (PFA)

Many server vendors provide a mechanism to anticipate system failures. It is called Predictive Failure Analysis (PFA). Servers keep running until they don't run anymore. Often, there are not clear signs that the servers will go down. If zero downtime is necessary, consider using predictive failure analysis technology. This technology warns a DBA up to 48 hours in advance of an imminent server failure. That's plenty of time to prevent disaster. The analysis method and terminology may differ, but most of the leading vendors provide PFA for the servers.

Error Correcting and Checking (ECC) Memory

Error Correcting and Checking (ECC) memory detects and corrects all single bit errors without impacting the operation of the system. It also detects all, and corrects some, double-bit errors. All error correction events are logged by the system.

IBM Chipkill memory is a good example. Chipkill ECC memory and automatic server restart features work to minimize server downtime. With the latest Chipkill memory technology available in select IBM xSeries and Netfinity servers, they are protected from any single memory chip that fails and any number of multi-bit errors, from any portion of a single memory chip.

To give another example, in Sun systems, memory error correction code has been adopted on all servers to minimize system downtime caused by faulty single inline memory modules (SIMMs) and dual inline memory modules (DIMMs).

Redundant Networking Components

To avoid network I/O channel failures, provide redundant physical elements in the path between the server and the network backbone, which includes network interface cards, cables, and patch panels.

Hot Swap Power

In its simplest form, two power supplies, each capable of providing power for the whole system, should be built into the server to share the load. When one of the supplies fails, the surviving supply keeps the server running. The facility of UPS is another most essential requirement.

Hot Swap Fans

A cooling fan failure will not bring a system down if it can be hot-swapped transparently. Most cooling-related issues are external to the system, such as keeping the computer room temperature stable below the required levels. High temperatures and temperature fluctuations are a form of stress to electronic components.

This is an excerpt from the bestselling book Oracle Grid & Real Application Clusters, Rampant TechPress, by Mike Ault and Madhu Tumma.

You can buy it direct from the publisher for 30%-off and get instant access to the code depot of Oracle tuning scripts.

http://www.rampant-books.com/book_2004_1_10g_grid.htm

��