This is an excerpt from the bestselling book
Oracle Grid & Real Application Clusters. To get immediate
access to the code depot of working RAC scripts, buy it
directly from the publisher and save more than 30%.
There are many features or
options that add value to the redundancy at the server level. Taking
advantage of such features helps avoid failures, and avoids degraded
cluster performance in systems like the RAC system. These features
address different subsystems of the server, such as the memory and
processors. Redundant components such as fans, power supplies, and
adapters can also provide higher availability, particularly when
used with software that provides monitoring and alerting capability
to the system administrators.
To make the servers more
reliable, high-reliability components and best-system practices
should be used. The following section examines some features of the
redundancy that administrators need to focus on.
Dynamic Reorganization (DR)
with in a server
DR is an operating environment
feature that provides the ability to replace and reconfigure system
hardware while the system is running. This feature is optional and
can be implemented at the discretion of the system administrator.
The main benefit of DR is that an administrator can add or replace
hardware resources, such as CPUs, memory, and I/O interfaces, with
little interruption of normal system operations. The DR process
helps to increase the overall uptime and availability of servers.
For example, the DR method is
available for Sun system architectures that contain multiple system
boards and use board slots that support hot plugging. The DR
facility is very well implemented for Sun Fire server series
3800-6800. By using the DR methodology, hardware components can be
added or removed from a system with minimal interruption. The DR is
performed at attachment points. DR allows connect or disconnect
attachment points. The Sun Fire series supports the following
attachment points for dynamic reorganization.
* I/O Assembly (PCI / ePCI
assemblies)
* CPU/Memory Boards
* CPCI cards
* System Memory
* CPU/s
Predictive Failure Analysis (PFA)
Many server vendors provide a
mechanism to anticipate system failures. It is called Predictive
Failure Analysis (PFA). Servers keep running until they don't run
anymore. Often, there are not clear signs that the servers will go
down. If zero downtime is necessary, consider using predictive
failure analysis technology. This technology warns a DBA up to 48
hours in advance of an imminent server failure. That's plenty of
time to prevent disaster. The analysis method and terminology may
differ, but most of the leading vendors provide PFA for the servers.
Error Correcting and Checking
(ECC) Memory
Error Correcting and Checking (ECC)
memory detects and corrects all single bit errors without impacting
the operation of the system. It also detects all, and corrects some,
double-bit errors. All error correction events are logged by the
system.
IBM Chipkill memory is a good
example. Chipkill ECC memory and automatic server restart features
work to minimize server downtime. With the latest Chipkill memory
technology available in select IBM xSeries and Netfinity servers,
they are protected from any single memory chip that fails and any
number of multi-bit errors, from any portion of a single memory
chip.
To give another example, in Sun
systems, memory error correction code has been adopted on all
servers to minimize system downtime caused by faulty single inline
memory modules (SIMMs) and dual inline memory modules (DIMMs).
Redundant Networking
Components
To avoid network I/O channel
failures, provide redundant physical elements in the path between
the server and the network backbone, which includes network
interface cards, cables, and patch panels.
Hot Swap Power
In its simplest form, two power
supplies, each capable of providing power for the whole system,
should be built into the server to share the load. When one of the
supplies fails, the surviving supply keeps the server running. The
facility of UPS is another most essential requirement.
Hot Swap Fans
A cooling fan failure will not
bring a system down if it can be hot-swapped transparently. Most
cooling-related issues are external to the system, such as keeping
the computer room temperature stable below the required levels. High
temperatures and temperature fluctuations are a form of stress to
electronic components.