Call now: 252-767-6166  
Oracle Training Oracle Support Development Oracle Apps

 
 Home
 E-mail Us
 Oracle Articles
New Oracle Articles


 Oracle Training
 Oracle Tips

 Oracle Forum
 Class Catalog


 Remote DBA
 Oracle Tuning
 Emergency 911
 RAC Support
 Apps Support
 Analysis
 Design
 Implementation
 Oracle Support


 SQL Tuning
 Security

 Oracle UNIX
 Oracle Linux
 Monitoring
 Remote s
upport
 Remote plans
 Remote
services
 Application Server

 Applications
 Oracle Forms
 Oracle Portal
 App Upgrades
 SQL Server
 Oracle Concepts
 Software Support

 Remote S
upport  
 Development  

 Implementation


 Consulting Staff
 Consulting Prices
 Help Wanted!

 


 Oracle Posters
 Oracle Books

 Oracle Scripts
 Ion
 Excel-DB  

Don Burleson Blog 


 

 

 


 

 

 

 

 

Recovery in the RAC Environment

Oracle RAC Cluster Tips by Burleson Consulting

There are basically two types of failure in a RAC environment: instance and media. Instance failure involves the loss of one or more RAC instances, whether due to node failure or connectivity failure. Media failure involves the loss of one or more of the disk assets used to store the database files themselves.

If a RAC database undergoes instance failure, the first node still available that detects the failed instance or instances will perform instance recovery on all failed instances using the failed instances redo logs and the SMON process of the surviving instance. The redo logs for all RAC instances are located either on an OCFS shared disk asset or on a RAW file system that is visible to all the other RAC instances. This allows any other node to recover for a failed RAC node in the event of instance failure.

Recovery using redo logs allows committed transactions to be completed. Non-committed transactions are rolled back and their resources released.

There are experts with over a dozen years of working with Oracle databases that have yet to see an instance failure result in a non-recoverable situation with an Oracle database. Generally speaking, an instance failure in RAC or in normal Oracle requires no active participation from the DBA other than to restart the failed instance when the node becomes available once again.

If, for some reason, the recovering instance cannot see all of the datafiles accessed by the failed instance, an error will be written to the alert log. To verify that all datafiles are available, the ALTER SYSTEM CHECK DATAFILES command can be used to validate proper access.

Instance recovery involves nine distinct steps.  The Oracle manual only lists eight, but in this case, the actual instance failure has been included:

1. Normal RAC operation, all nodes are available.

2. One or more RAC instances fail.

3. Node failure is detected.

4. Global Cache Service (GCS) reconfigures to distribute resource management to the surviving instances.

5. The SMON process in the instance that first discovers the failed instance(s) reads the failed instance(s) redo logs to determine which blocks have to be recovered.

6. SMON issues requests for all of the blocks it needs to recover.  Once all blocks are made available to the SMON process doing the recovery, all other database blocks are available for normal processing.

7. Oracle performs roll forward recovery against the blocks, applying all redo log recorded transactions.

8. Once redo transactions are applied, all undo records are applied, which eliminates non-committed transactions.

9. Database is now fully available to surviving nodes.

Instance recovery is automatic, and other than the performance hit to surviving instances and the disconnection of users who were using the failed instance, recovery is invisible to the other instances. If RAC failover and transparent application failover (TAF) technologies are properly utilized, the only users that should see a problem are those with in-flight transactions. The following listing shows what the other instance sees in its alert log during a reconfiguration.

Sat Feb 15 16:39:09 2003
Reconfiguration started
List of nodes: 0,
 Global Resource Directory frozen
one node partition
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 1977
 2381 GCS shadows traversed, 1 cancelled, 13 closed
 1026 GCS resources traversed, 0 cancelled
 3264 GCS resources on freelist, 4287 on array, 4287 allocated
 set master node info
 Submitted all remote-enqueue requests
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 2381 GCS shadows traversed, 0 replayed, 13 unopened
 Submitted all GCS remote-cache requests
 0 write requests issued in 2368 GCS resources
 2 PIs marked suspect, 0 flush PI msgs
Sat Feb 15 16:39:10 2003
Reconfiguration complete
 Post SMON to start 1st pass IR
Sat Feb 15 16:39:10 2003
Instance recovery: looking for dead threads
Sat Feb 15 16:39:10 2003
Beginning instance recovery of 1 threads
Sat Feb 15 16:39:10 2003
Started first pass scan
Sat Feb 15 16:39:11 2003
Completed first pass scan
 208 redo blocks read, 6 data blocks need recovery
Sat Feb 15 16:39:11 2003
Started recovery at
 Thread 2: logseq 26, block 14, scn 0.0
Recovery of Online Redo Log: Thread 2 Group 4 Seq 26 Reading mem 0
  Mem# 0 errs 0: /oracle/oradata/ault_rac/ault_rac_raw_rdo_2_2.log
Recovery of Online Redo Log: Thread 2 Group 3 Seq 27 Reading mem 0
  Mem# 0 errs 0: /oracle/oradata/ault_rac/ault_rac_raw_rdo_2_1.log
Sat Feb 15 16:39:12 2003
Completed redo application
Sat Feb 15 16:39:12 2003
Ended recovery at
 Thread 2: logseq 27, block 185, scn 0.5479311
 6 data blocks read, 8 data blocks written, 208 redo blocks read
Ending instance recovery of 1 threads
SMON: about to recover undo segment 11
SMON: mark undo segment 11 as available

One word of caution, during testing for this listing, an instance could not be brought back up after failure, a rare occurrence. A kill -9 was done on the SMON process on AULTLINUX1, within the Linux/RAC/RAW environment.  AULTLINUX2 continued to operate and recovered the failed instance; however, an attempted restart of the instance on AULTLINUX1 yielded a Linux Error: 24: Too Many Files Open error. This was actually caused by something blocking the SPFILE link. Once the instance was pointed towards the proper SPFILE location during startup, it restarted with no problems.

 
   
Oracle Grid and Real Application Clusters

See working examples of Oracle Grid and RAC in the book Oracle Grid and Real Application Clusters.

Order directly from Rampant and save 30%. 
 

 


 

 
��  
 
 
Oracle Training at Sea
 
 
 
 
oracle dba poster
 

 
Follow us on Twitter 
 
Oracle performance tuning software 
 
Oracle Linux poster
 
 
 

 

Burleson is the American Team

Note: This Oracle documentation was created as a support and Oracle training reference for use by our DBA performance tuning consulting professionals.  Feel free to ask questions on our Oracle forum.

Verify experience! Anyone considering using the services of an Oracle support expert should independently investigate their credentials and experience, and not rely on advertisements and self-proclaimed expertise. All legitimate Oracle experts publish their Oracle qualifications.

Errata?  Oracle technology is changing and we strive to update our BC Oracle support information.  If you find an error or have a suggestion for improving our content, we would appreciate your feedback.  Just  e-mail:  

and include the URL for the page.


                    









Burleson Consulting

The Oracle of Database Support

Oracle Performance Tuning

Remote DBA Services


 

Copyright © 1996 -  2017

All rights reserved by Burleson

Oracle ® is the registered trademark of Oracle Corporation.

Remote Emergency Support provided by Conversational