Friday, November 13, 2015

Oracle Rac

Oracle RAC Architecture 10g and 11g

RAC components
·         Shared Disk System
·         Oracle clusterware
·         Cluster interconnects
·         Oracle Kernel Components
Shared Disk System
Below are the three major type of shared storage which are using in RAC

Raw volumes: A raw logical volume is an area of physical and logical disk space that is under the direct control of an application such as database or partition rather than under the direct control of the operating system or a file system. This is the only available option in 9i.

Cluster File system: This option is not widely used and here the cluster file system such as Oracle Cluster file system (OCFS) for MS Windows and Linux holding the all datafiles of RAC database
Automatic Storage Management (ASM): Oracle recommended storage option which is optimized for cluster file system for Oracle database files introduced in Oracle 10g.

Oracle Clusterware
Clusterware is the mandatory piece of software which is required to run Oracle RAC option, it provides the basic clustering support at OS level and enables oracle software to run in cluster mode.
Clusterware enables nodes to communicate each other and forms the cluster that make the nodes work as a single logical server.

Clusterware is managed by cluster ready services (CRS) using the Oracle Cluster registry (OCR) and voting disk

Cluster Ready Services (CRS)
Oracle clusterware uses CRS for interaction between the OS and the Database.CRS have five components - namely Process Monitor daemon (OPROCd: obsolete in 11gR2),CRS daemon(CRSd),Oralce Custer Synchronization Service daemon(OCSSd),Event Volume Manager daemon(EVMd) and the Oracle Notification Service (ONS)

CRS is installed and run from ORA_CRS_HOME in 10g and GRID_HOME in 11g

Starting and Stopping CRS

crsctl stop crs      # Stops Oracle clusterware
crsctl start crs     # Starts Oracle clusterware
crsctl enable crs    # Enables Oracle clusterware
crsctl disable crs   # Disables Oracle clusterware

Note: These commands should be used in root user (Superuser)
Oracle Cluster Synchronization Services
OCSSd provides synchronization between the nodes. Failure of OCSSd causes the machine to reboot to avoid split-brain situation.OCSSd runs as oracle user.

Event Manager Process
This runs to daemon process EVMd. The daemon process spawns a permanent child process called evmlogger and generate the events when things happens.

Oracle Notification Services
Whenever state of cluster resource changes ONS process in each node will communicate with each other and exchange the high availability information.
Oracle Process Monitor
The oracle process monitor daemon(OPROCd) identifies the potential cluster node hang and rebooting the hang node.
Oracle Cluster Registry (OCR)
OCR uses the cluster registry to keep the configuration information. This should be a shared storage and should be accessible  to all the nodes in the cluster simultaneously.This shared storage is known as Oracle cluster registry (OCR)  

Oracle Logical Registry(11gR2)
Similar to OCR, introduces in 11gR2 but it only stores information about the local node. It is not shared by other nodes of cluster and used by OHASd while starting or joining a cluster.

Voting Disk
A voting disk is a shared diks that will be accessed by all the member of the nodes in the cluster.It is used as   central referece for all the nodes and keeps the heartbeat information between the nodes. If any of the node is unable to ping the voding disk,cluster immediatly recogonize the comminucation failure and envicts the node from the cluster.

Cluster interconnect
It is the communication path used by the cluster for the synchronization of resources and it is also used in some cases for transfer of data from one instance to another.Typically, the interconnect is a network connections that is dedicated to the server nodes of a cluster (thus is sometimes referred as private interconnect)

Single Client Access Name (SCAN)-11gR2
SCAN is a single network name that resolves in three different IPs registered in DNS or GNS.
Prior the 11g R2 if we add or remove a node from cluster it was required to changes the connection setting in client . By configuring the connection using SCAN name this problem is eliminated.

NewfeaturesinOracle9i/10g/11gRAC

Oracle Real Application Clusters New features


Oracle 9i RAC
·                     OPS (Oracle Parallel Server) was renamed as RAC
·                     CFS (Cluster File System) was supported
·                     OCFS (Oracle Cluster File System) for Linux and Windows
·                     watchdog timer replaced by hangcheck timer

Oracle 10g R1 RAC
·                     Cluster Manager replaced by CRS
·                     ASM introduced
·                     Concept of Services expanded
·                     ocrcheck introduced
·                     ocrdump introduced
·                     AWR was instance specific

Oracle 10g R2 RAC
·                     CRS was renamed as Clusterware
·                     asmcmd introduced
·                     CLUVFY introduced
·                     OCR and Voting disks can be mirrored
·                     Can use FAN/FCF with TAF for OCI and ODP.NET
·                      
o                  Business requirements software

o                  Tracking devices for vehicles

o                  Consolidate

o                  Consolidating

Oracle 11g R1 RAC

·      Oracle 11g RAC parallel upgrades - Oracle 11g have rolling upgrade features whereby RAC database can be upgraded without any downtime.
·   Hot patching - Zero downtime patch application.
·   Oracle RAC load balancing advisor - Starting from 10g R2 we have RAC load balancing advisor utility. 11g RACload balancing advisor is only available with clients who use .NET, ODBC, or the Oracle Call Interface (OCI).
·                     ADDM for RAC - Oracle has incorporated RAC into the automatic database diagnostic monitor, for cross-node advisories. The script addmrpt.sql run give report for single instance, will not report all instances in RAC, this is known as instance ADDM. But using the new package DBMS_ADDM, we can generate report for all instances of RAC, this known as database ADDM.
·                     Optimized RAC cache fusion protocols - moves on from the general cache fusion protocols in 10g to deal with specific scenarios where the protocols could be further optimized.
·                     Oracle 11g RAC Grid provisioning - The Oracle grid control provisioning pack allows us to "blow-out" a RAC node without the time-consuming install, using a pre-installed "footprint".

Oracle 11g R2 RAC

·                     We can store everything on the ASM. We can store OCR & voting files also on the ASM.
·                     ASMCA
·                     Single Client Access Name (SCAN) - eliminates the need to change tns entry when nodes are added to or removed from the Cluster. RAC instances register to SCAN listeners as remote listeners. SCAN is fully qualified name. Oracle recommends assigning 3 addresses to SCAN, which create three SCAN listeners.
·                     Clusterware components: crfmond, crflogd, GIPCD.
·                     AWR is consolidated for the database.
·                     11g Release 2 Real Application Cluster (RAC) has server pooling technologies so it’s easier to provision and manage database grids. This update is geared toward dynamically adjusting servers as corporations manage the ebb and flow between data requirements for datawarehousing and applications.
·                     By default, LOAD_BALANCE is ON.
·                     GSD (Global Service Deamon), gsdctl introduced.
·                     GPnP profile.
·                     Cluster information in an XML profile.
·                     Oracle RAC OneNode is a new option that makes it easier to consolidate databases that aren’t mission critical, but need redundancy.
·                     raconeinit - to convert database to RacOneNode.
·                     raconefix - to fix RacOneNode database in case of failure.
·                     racone2rac - to convert RacOneNode back to RAC.
·                     Oracle Restart - the feature of Oracle Grid Infrastructure's High Availability Services (HAS) to manage associated listeners, ASM instances and Oracle instances.
·                     Oracle Omotion - Oracle 11g release2 RAC introduces new feature called Oracle Omotion, an online migration utility. This Omotion utility will relocate the instance from one node to another, whenever instance failure happens.
·                     Omotion utility uses Database Area Network (DAN) to move Oracle instances. Database Area Network (DAN)technology helps seamless database relocation without losing transactions.
·                     Cluster Time Synchronization Service (CTSS) is a new feature in Oracle 11g R2 RAC, which is used to synchronize time across the nodes of the cluster. CTSS will be replacement of NTP protocol.
·                     Grid Naming Service (GNS) is a new service introduced in Oracle RAC 11g R2. With GNS, Oracle Clusterware (CRS) can manage Dynamic Host Configuration Protocol (DHCP) and DNS services for the dynamic node registration and configuration.
·                     Cluster interconnect: Used for data blocks, locks, messages, and SCN numbers.
·                     Oracle Local Registry (OLR) - From Oracle 11gR2 "Oracle Local Registry (OLR)" something new as part of Oracle Clusterware. OLR is node’s local repository, similar to OCR (but local) and is managed by OHASD. It pertains data of local node only and is not shared among other nodes.
·                     Multicasting is introduced in 11gR2 for private interconnect traffic.
·                     I/O fencing prevents updates by failed instances, and detecting failure and preventing split brain in cluster. When a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or diskgroups. This methodology is called I/O Fencing, sometimes called Disk Fencing or failure fencing.
·                     Re-bootless node fencing (restart) - instead of fast re-booting the node, a graceful shutdown of the stack is attempted.
·                     Clusterware log directories: acfs*
·                     HAIP (IC VIP).
·                     Redundant interconnects: NIC bonding, HAIP.
·                     RAC background processes: DBRM – Database Resource Manager, PING – Response time agent.
·                     Virtual Oracle 11g RAC cluster - Oracle 11g RAC supports virtualization.



Oracle Kernel Components 
Set of additional background process in each instance is known as oracle kernel components in RAC environment.Since buffer and shared pool became global in RAC , special handling is required to manage the resources  to avoid conflicts and corruption.Additional background process (for RAC) and single instance background process works together and achieved this.
Global Cache and Global Enqueue Services
In Oracle RAC each instance will have its own cache but it is required for an instance to access the data blocks currently residing in  another instance cache.This management and data sharing is done by Global Cache services (GCS). Blocks other than data such as locks, enqueue details and shared across the instances are known as Global Enqueue Services (GES).
Global Resource Directory
In a cluster group, then centre repository formed by all the resources is known as Global Resource Directory (GRD).GRD is managed by GCS and GES.
Oracle RAC Background Processes
·         LMS       Global Cache Service Process
·         LMON   Global Enqueue Service Monitor
·         LMD      Global Enqueue service Daemon
·         LCK0     Instance Enqueue Process 
LMS: Global Cache Services Process
LMS- Lock Manager Server Process is used in Cache Fusion.It enebles consistent copies of blocks to be transfered from a holding instance's buffer cache to a requesting instance bufer cache with out a disk write under certian conditions.
It rollbacks any uncommitted transactions for any blocks that are being requested for a consistent read by the remote instance.
LMON:Global Enqueue Services Monitor
LMON-Lock Monitor Process is responsible to manage Global Enqueue Services (GES).It maintain consistency of GCS memory in case of any process death. LMON is also responsible for the cluster reconfiguration when an instance joins or leaves the cluster. It also check for the instance death and listens for local manages.

LMD: Global Enqueue Services Daemon
LMD-Lock Manager Daemon process manages Enqueue manager service requests for GCS. It also handles deadlock detection and remote resource requests.
LCK0:Instance Enqueue Process
LCK-Lock Process manages instance resource requests and cross-instance call operations for shared resources.
DIAG Process
Separate process using for diagnostic purpose, PMON will start incase this process dies.DIAG process will not affect any other background process.
                                  
High Availability and Clustering
When you have very critical systems that require to be online 24x7 then you need a HA solution (High Availability), you have to weigh up the risk associated with downtime against the cost of a solution. HA solutions are not cheap and they are not easy to manage. HA solutions need to be thoroughly tested as it may not be tested in the real world for months. I had a solution that run for almost a year before a hardware failure caused a failover, this is when your testing before hand comes into play.
As I said before HA comes with a price, and there are a number of HA technologies
  • Fault Tolerance - this technology protects you from hardware failures for example redundant PSU, etc
  • Disaster Recovery - this technology protects from operational issues such as a Data Center becoming unavailable
  • Disaster Tolerance - this technology is used to prepare for the above two, the most important of the three technologies.
Every company should plan for unplanned outages, this costs virtually nothing, knowing what to do in a DR situation is half the battle, in many companies people make excuses not to design a DR plan (it costs to much, we don't have the redundant hardware,etc). You cannot make these assumptions until you design a DR plan, the plan will highlight the risks and the costs that go with that risk, then you can make the decision on what you can and cannot afford, there is no excuse not to create DR plan.
Sometimes in large corporations you will hear the phrase five nines, this phrases means the availability of a system and what downtime (approx) is allowed, the table below highlights the uptime a system requires in order to achieve the five nines
% uptime
% Downtime
Downtime per year
Downtime per week
98
2
7.3 days
3 hours 22 minutes
99
1
3.65 days
1 hour 41 minutes
99.8
0.2
17 hours 30 minutes
20 minutes
99.9
0.1
8 hours 45 minutes
10 minutes
99.99
0.01
52.5 minutes
1 minute
99.999 (five nines)
0.001
5.25 minutes
6 seconds
To achieve the five nines your system is only allowed 5.25 minutes per year or 6 seconds per week, in some HA designs it may take 6 seconds to failover.
When looking for a solution you should try and build redundancy into your plan, this is the first step to a HA solution, for example
  • Make sure computer cabinets have dual power
  • Make sure servers have dual power supplies, dual network cards, redundant hard disks that can be mirrored
  • Make sure you use multipathing to the data disk which are usually on a SAN or NAS
  • Make sure that the server is connected to two different network switches
You are trying to eliminate as many Single Point Of Failures (SPOF's) as you can without increasing the costs. Most hardware today will have these redundancy features built in, but its up to you to make use of them.
HA comes in three/four favors
No-failover
This option usually uses the already built-in redundancy, failed disks and PSU can be replaced online, but if a major hardware was to fail then a system outage is unavoidable, the system will remain down until it is fixed.
This solution can be perfectly acceptable in some environments but at what price to the business, even in today's market QA/DEV systems cost money when not running, i am sure that your developers are quite happy to take the day off paid while you fix the system
Cluster
This is the jewel in the HA world, a cluster can be configure in a variety of favors, from minimal downtime while services are moved to a good nodes, to virtual zero downtime.
However a cluster solution does come with a heavy price tag, hardware, configuration and maintaining a cluster is expensive but if you business loses vast amounts of money if you system is down, then its worth it.
Cold failover
Many smaller companies use this solution, basically you have a additional server ready to take over a number of servers if one where to fail. I have used this technique myself, i create a number of scripts that can turn a cold standby server into any number of servers, if the original server is going to have a prolonged outage.
The problem with this solution is there is going to be downtime, especially if it takes a long time to get the standby server up to the same point in time as the failed server.
The advantage of this solution is that one additional server could cover a number of servers, even if it slight under powered to the original server, as long as it keeps the service running.
Hot failover
Many applications offer hot-standby servers, these servers are running along side the live system, data is applied to the hot-standby server periodically to keep it up to date, thus in a failover situation the server is almost ready to go.
The problem with this system is costs and manageability, also one server is usually dedicated to one application, thus you may have to have many hot-standby servers.
The advantage is that downtime is kept to a minimum, but there will be some downtime, generally the time it take to get the hot-standby server up todate, for example applying the last set of logs to a database.
Here is a summary table that shows the most command aspects of cold failover versus hot failover
Aspects
Cold Failover
Hot Failover
Scalability/number of nodes
Scalable limited to the capacity of a single node
As nodes can be added on demand, it provides infinite scalability. High number of nodes supported.
User interruption
Required up to a minimal extent. The failover operation can be scripted or automated to a certain extent
Not required, failover is automatic
Transparent failover of applications
Not Possible
Transparent application failover will be available where sessions can be transferred to another node without user interruption
Load Balancing
Not possible, only one server will be used
Incoming load can be balanced between both nodes
Usage of resources
Only one server at a time, the other server will be kept idle
Both the servers will be used
Failover time
More than minutes as the other system must be cold started
Less than a minute, typically in a few seconds.
Clustering
I have discussed clustering in my Tomcat and JBoss topics, so I will only touch on the subject lightly here. A cluster is a group of two or more interconnected nodes, that provide a service. The cluster provides a high level of fault tolerance, if a node were to become unavailable within the cluster the services are moved/restored to another working node, thus the end user should never know that a fault occurred.
Clusters can be setup to use a single node in the cluster or to load balance between the nodes, but the main object is to keep the service running, hence why you pay top dollar for this. One advantage of a cluster is that it is very scalable because additional nodes can be added or taken away (a node may need to be patched) without interrupting the service.
Clustering has come a long way, there are now three types of clustering architecture
Shared nothing
each node within the cluster is independent, they share nothing. An example of this may be web servers, you a have number of nodes within the cluster supplying the same web service. The content will be static thus there is no need to share disks, etc.
Shared disk only
each node will be attached or have access to the same set of disks. These disks will contain the data that is required by the service. One node will control the application and the disk and in the event of a that node fails, the other node will take control of both the application and the data. This means that one node will have to be on standby setting idle waiting to take over if required to do so.
A typical traditional Veritas Cluster and Sun Cluster would fit the bill here.
Shared everything
again all nodes will be attached or have access to the same set of disks, but this time each node can read/write to the disks concurrently. Normally there will be a piece of software that controls the reading and writing to the disks ensuring data integrity. To achieve this a cluster-wide filesystem is introduced, so that all nodes view the filesystem identically, the software then coordinates the sharing and updating of files, records and databases.
Oracle RAC and IBM HACMP would be good examples of this type of cluster
Oracle RAC History
The first Oracle cluster database was release with Oracle 6 for the digital VAX, this was the first cluster database on the market. With Oracle 6.2 Oracle Parallel Server (OPS) was born, which used Oracle's own DLM (Distributed Lock Manager). Oracle 7 used vendor-supplied clusterware but this was complex to setup and manage, Oracle 8 introduce a general lock manager and this was a direction for Oracle to create its own clusterware product. Oracle's lock manager is integrated with Oracle code with an additional layer called OSD (Operating System Dependent), this was soon integrated within the kernel and become known as IDLM (Integrated Distributed Lock Manager) in later Oracle versions. Oracle Real Application Clusters 9i (Oracle RAC) used the same IDLM and relied on external clusterware software (Sun Cluster, Veritas Cluster, etc).
Oracle Parallel Server Architecture
A Oracle parallel database consists of two or more nodes that own Oracle instances and share a disk array. Each node has its own SGA and its own redo logs, but the data files and control files are all shared to all instances. All data and controls are concurrently read and written by all instances, redo logs files on the other hand can be read by any instance but only written by the owning instance. Each instance has its own set of background processes.
The components of a OPS database are
  • Cluster Manager - OS Vendor specific
  • Distributed Lock Manager (DLM)
  • Cluster Interconnect
  • Shared Disk Array
The Cluster Group Services (CGS) has some OSD components (node monitor interface) and the rest is built in the kernel. CGS has a key repository used by the DLM for communication and network related activities. This layer provides the following
  • Internode messaging
  • Group member consistency
  • Cluster synchronization
  • Process grouping, registration and deregistration
The DLM is a integral part of OPS and the RAC stack. In older versions the DLM API module had to rely on external OS routines to check the status of a lock, this was done using UNIX sockets and pipes. With the new IDLM the data is in the SGA of each instance and requires only a serialized lookup using latches and/or enqueues and may require global coordination, the algorithm for which was built directly into the Oracle kernel. The IDLM job is to track every lock granted to a resource, memory structures required by the DLM are allocated out of the shared pool. The design of the DLM is such it can survive nodes failures in all but one node of the cluster.
A user must require a lock before it can operate on any resource, the Parallel Cache Management (PCM) coordinates and maintains data blocks exists within each data buffer cache (of an instance) so that data viewed or requested by users is never inconsistent or incoherent. The PCM ensures that only one instance in a cluster can modify a block at any given time, other instances have to wait until the lock is released.
DLM maintains information about all locks on a given resource, the DLM nominates one node to manage all relevant lock information for a resource, this node is referred to as the master node, lock mastering is distributed among all nodes. Using the IPC layer the DLM permits it to share the load of mastering resources, which means that a user can lock a resource on one node but actually end up communicating with the processes on another node.
In OPS 8i Oracle introduced Cache Fusion Stage 1, this introduced a new background process called the Block Server Process (BSP). The BSP main roles was to ship consistent read (CR) version(s) of a block(s) across an instance in a read/write contention scenario, this shipping is performed over a high speed interconnect. Cache Fusion Stage 2 in Oracle 9i and 10g, addresses some of the issues with Stage 1, in which both types of blocks (CR and CUR) can be transferred using the interconnect. Since 8i the introduction of the GV$ views meant that a DBA could view cluster-wide database and other statistics sitting on any node/instance of the cluster.
The limitations of OPS are
  • scalability is limited to the capacity of the node
  • you cannot easily add additional nodes to a OPS
  • OPS requires third-party clustering software adding to the expense and complexity.
  • OPS requires RAW partitions and these can be difficult to manage.
Oracle RAC addresses the limitation in OPS by extending Cache Fusion, and the dynamic lock mastering. Oracle 10g RAC also comes with its own integrated clusterware and storage management framework, removing all dependencies of a third-party clusterware product. The latest Oracle RAC offers
  • Availability - can be configured to have no SPOF even when running on low spec'ed hardware
  • Scalability - multiple servers in a cluster to manage a single database transparently (scale out), basically means adding additional server is easy.
  • Reliability - improved code and monitoring RAC has become very reliable
  • Affordability - you can use low cost hardware, however RAC is not going to come cheap.
  • Transparency - RAC looks and feels like a standard Oracle database to an application

RAC Architecture
Oracle Real Application clusters allows multiple instances to access a single database, the instances will be running on multiple nodes. In an standard Oracle configuration a database can only be mounted by one instance but in a RAC environment many instances can access a single database.
Oracle's RAC is heavy dependent on a efficient, high reliable high speed private network called the interconnect, make sure when designing a RAC system that you get the best that you can afford.
The table below describes the difference of a standard oracle database (single instance) an a RAC environment
Component
Single Instance Environment
RAC Environment
SGA
Instance has its own SGA
Each instance has its own SGA
Background processes
Instance has its own set of background processes
Each instance has its own set of background processes
Datafiles
Accessed by only one instance
Shared by all instances (shared storage)
Control Files
Accessed by only one instance
Shared by all instances (shared storage)
Online Redo Logfile
Dedicated for write/read to only one instance
Only one instance can write but other instances can read during recovery and archiving. If an instance is shutdown, log switches by other instances can force the idle instance redo logs to be archived
Archived Redo Logfile
Dedicated to the instance
Private to the instance but other instances will need access to all required archive logs during media recovery
Flash Recovery Log
Accessed by only one instance
Shared by all instances (shared storage)
Alert Log and Trace Files
Dedicated to the instance
Private to each instance, other instances never read or write to those files.
ORACLE_HOME
Multiple instances on the same server accessing different databases ca use the same executable files
Same as single instance plus can be placed on shared file system allowing a common ORACLE_HOME for all instances in a RAC environment.
RAC Components
The major components of a Oracle RAC system are
  • Shared disk system
  • Oracle Clusterware
  • Cluster Interconnects
  • Oracle Kernel Components
The below diagram describes the basic architecture of the Oracle RAC environment
Here are a list of processes running on a freshly installed RAC

Disk architecture
With today's SAN and NAS disk storage systems, sharing storage is fairly easy and is required for a RAC environment, you can use the below storage setups
  • SAN (Storage Area Networks) - generally using fibre to connect to the SAN
  • NAS ( Network Attached Storage) - generally using a network to connect to the NAS using either NFS, ISCSI
  • JBOD - direct attached storage, the old traditional way and still used by many companies as a cheap option
All of the above solutions can offer multi-pathing to reduce SPOFs within the RAC environment, there is no reason not to configure multi-pathing as the cost is cheap when adding additional paths to the disk because most of the expense is paid when out when configuring the first path, so an additional controller card and network/fibre cables is all that is need.
The last thing to think about is how to setup the underlining disk structure this is known as a raid level, there are about 12 different raid levels that I know off, here are the most common ones
raid 0 (Striping)
A number of disks are concatenated together to give the appearance of one very large disk.
Advantages
   Improved performance
   Can Create very large Volumes
Disadvantages
   Not highly available (if one disk fails, the volume fails)
raid 1 (Mirroring)
A single disk is mirrored by another disk, if one disk fails the system is unaffected as it can use its mirror.
Advantages
   Improved performance
   Highly Available (if one disk fails the mirror takes over)

Disadvantages
   Expensive (requires double the number of disks)
raid 5
Raid stands for Redundant Array of Inexpensive Disks, the disks are striped with parity across 3 or more disks, the parity is used in the event that one of the disks fails, the data on the failed disk is reconstructed by using the parity bit.
Advantages
   Improved performance (read only)
   Not expensive

Disadvantages
   Slow write operations (caused by having to create the parity bit)
There are many other raid levels that can be used with a particular hardware environment for example EMC storage uses the RAID-S, HP storage uses Auto RAID, so check with the manufacture for the best solution that will provide you with the best performance and resilience.
Once you have you storage attached to the servers, you have three choices on how to setup the disks
  • Raw Volumes - normally used for performance benefits, however they are hard to manage and backup
  • Cluster FileSystem - used to hold all the Oracle datafiles can be used by windows and linux, its not used widely
  • Automatic Storage Management (ASM) - Oracle choice of storage management, its a portable, dedicated and optimized cluster filesystem
I will only be discussing ASM, which i have already have a topic on called Automatic Storage Management.
Oracle Clusterware
Oracle Clusterware software is designed to run Oracle in a cluster mode, it can support you to 64 nodes, it can even be used with a vendor cluster like Sun Cluster.
The Clusterware software allows nodes to communicate with each other and forms the cluster that makes the nodes work as a single logical server. The software is run by the Cluster Ready Services (CRS) using the Oracle Cluster Registry (OCR) that records and maintains the cluster and node membership information and the voting disk which acts as a tiebreaker during communication failures. Consistent heartbeat information travels across the interconnect to the voting disk when the cluster is running.
The CRS has four components
  • OPROCd - Process Monitor Daemon
  • CRSd - CRS daemon, the failure of this daemon results in a node being reboot to avoid data corruption
  • OCSSd - Oracle Cluster Synchronization Service Daemon (updates the registry)
  • EVMd - Event Volume Manager Daemon
The OPROCd daemon provides the I/O fencing for the Oracle cluster, it uses the hangcheck timer or watchdog timer for the cluster integrity. It is locked into memory and runs as a realtime processes, failure of this daemon results in the node being rebooted. Fencing is used to protect the data, if a node were to have problems fencing presumes the worst and protects the data thus restarts the node in question, its better to be save than sorry.
The CRSd process manages resources such as starting and stopping the services and failover of the application resources, it also spawns separate processes to manage application resources. CRS manages the OCR and stores the current know state of the cluster, it requires a public, private and VIP interface in order to run. OCSSd provides synchronization services among nodes, it provides access to the node membership and enables basic cluster services, including cluster group services and locking, failure of this daemon causes the node to be rebooted to avoid split-brain situations.
The below functions are covered by the OCSSd
  • CSS provides basic Group Services Support, it is a distributed group membership system that allows applications to coordinate activities to archive a common result.
  • Group services use vendor clusterware group services when it is available.
  • Lock services provide the basic cluster-wide serialization locking functions, it uses the First In, First Out (FIFO) mechanism to manage locking
  • Node services uses OCR to store data and updates the information during reconfiguration, it also manages the OCR data which is static otherwise.
The last component is the Event Management Logger, which runs the EVMd process. The daemon spawns a processes called evmlogger and generates the events when things happen. The evmlogger spawns new children processes on demand and scans the callout directory to invoke callouts. Death of the EVMd daemon will not halt the instance and will be restarted.
Quick recap
CRS Process
Functionality
Failure of the Process
Run AS
OPROCd - Process Monitor
provides basic cluster integrity services
Node Restart
Root
EVMd - Event Management
spawns a child process event logger and generates callouts
Daemon automatically restarted, no node restart
Oracle
OCSSd - Cluster Synchronization Services
basic node membership, group services, basic locking
Node Restart
Oracle
CRSd - Cluster Ready Services
resource monitoring, failover and node recovery
Daemon restarted automatically, no node restart
root
The cluster-ready services (CRS) is a new component in 10g RAC, its is installed in a separate home directory called ORACLE_CRS_HOME. It is a mandatory component but can be used with a third party cluster (Veritas, Sun Cluster), by default it manages the node membership functionality along with managing regular RAC-related resources and services

RAC uses a membership scheme, thus any node wanting to join the cluster as to become a member. RAC can evict any member that it seems as a problem, its primary concern is protecting the data. You can add and remove nodes from the cluster and the membership increases or decrease, when network problems occur membership becomes the deciding factor on which part stays as the cluster and what nodes get evicted, the use of a voting disk is used which I will talk about later.
The resource management framework manage the resources to the cluster (disks, volumes), thus you can have only have one resource management framework per resource. Multiple frameworks are not supported as it can lead to undesirable affects.
The Oracle Cluster Ready Services (CRS) uses the registry to keep the cluster configuration, it should reside on a shared storage and accessible to all nodes within the cluster. This shared storage is known as the Oracle Cluster Registry (OCR) and its a major part of the cluster, it is automatically backed up (every 4 hours) the daemons plus you can manually back it up. The OCSSd uses the OCR extensively and writes the changes to the registry
The OCR keeps details of all resources and services, it stores name and value pairs of information such as resources that are used to manage the resource equivalents by the CRS stack. Resources with the CRS stack are components that are managed by CRS and have the information on the good/bad state and the callout scripts. The OCR is also used to supply bootstrap information ports, nodes, etc, it is a binary file.
The OCR is loaded as cache on each node, each node will update the cache then only one node is allowed to write the cache to the OCR file, the node is called the master. The Enterprise manager also uses the OCR cache, it should be at least 100MB in size. The CRS daemon will update the OCR about status of the nodes in the cluster during reconfigurations and failures.
The voting disk (or quorum disk) is shared by all nodes within the cluster, information about the cluster is constantly being written to the disk, this is know as the heartbeat. If for any reason a node cannot access the voting disk it is immediately evicted from the cluster, this protects the cluster from split-brains (the Instance Membership Recovery algorithm IMR is used to detect and resolve split-brains) as the voting disk decides what part is the really cluster. The voting disk manages the cluster membership and arbitrates the cluster ownership during communication failures between nodes. Voting is often confused with quorum the are similar but distinct, below details what each means

The only vote that counts is the quorum member vote, the quorum member vote defines the cluster. If a node or group of nodes cannot archive a quorum, they should not start any services because they risk conflicting with an established quorum.
The voting disk has to reside on shared storage, it is a a small file (20MB) that can be accessed by all nodes in the cluster. In Oracle 10g R1 you can have only one voting disk, but in R2 you can have upto 32 voting disks allowing you to eliminate any SPOF's.
The original Virtual IP in Oracle was Transparent Application Failover (TAF), this had limitations, this has now been replaced with cluster VIPs. The cluster VIPs will failover to working nodes if a node should fail, these public IPs are configured in DNS so that users can access them. The cluster VIPs are different from the cluster interconnect IP address and are only used to access the database.
The cluster interconnect is used to synchronize the resources of the RAC cluster, and also used to transfer some data from one instance to another. This interconnect should be private, highly available and fast with low latency, ideally they should be on a minimum private 1GB network. What ever hardware you are using the NIC should use multi-pathing (Linux - bonding, Solaris - IPMP). You can use crossover cables in a QA/DEV environment but it is not supported in a production environment, also crossover cables limit you to a two node cluster.
Oracle Kernel Components
The kernel components relate to the background processes, buffer cache and shared pool and managing the resources without conflicts and corruptions requires special handling.
In RAC as more than one instance is accessing the resource, the instances require better coordination at the resource management level. Each node will have its own set of buffers but will be able to request and receive data blocks currently held in another instance's cache. The management of data sharing and exchange is done by the Global Cache Services (GCS).
All the resources in the cluster group form a central repository called the Global Resource Directory (GRD), which is distributed. Each instance masters some set of resources and together all instances form the GRD. The resources are equally distributed among the nodes based on their weight. The GRD is managed by two services called Global Caches Services (GCS) and Global Enqueue Services (GES), together they form and manage the GRD. When a node leaves the cluster, the GRD portion of that instance needs to be redistributed to the surviving nodes, a similar action is performed when a new node joins.
RAC Background Processes
Each node has its own background processes and memory structures, there are additional processes than the norm to manage the shared resources, theses additional processes maintain cache coherency across the nodes.
Cache coherency is the technique of keeping multiple copies of a buffer consistent between different Oracle instances on different nodes. Global cache management ensures that access to a master copy of a data block in one buffer cache is coordinated with the copy of the block in another buffer cache.
The sequence of a operation would go as below
  1. When instance A needs a block of data to modify, it reads the bock from disk, before reading it must inform the GCS (DLM). GCS keeps track of the lock status of the data block by keeping an exclusive lock on it on behalf of instance A
  2. Now instance B wants to modify that same data block, it to must inform GCS, GCS will then request instance A to release the lock, thus GCS ensures that instance B gets the latest version of the data block (including instance A modifications) and then exclusively locks it on instance B behalf.
  3. At any one point in time, only one instance has the current copy of the block, thus keeping the integrity of the block.
GCS maintains data coherency and coordination by keeping track of all lock status of each block that can be read/written to by any nodes in the RAC. GCS is an in memory database that contains information about current locks on blocks and instances waiting to acquire locks. This is known as Parallel Cache Management (PCM). The Global Resource Manager (GRM) helps to coordinate and communicate the lock requests from Oracle processes between instances in the RAC. Each instance has a buffer cache in its SGA, to ensure that each RAC instance obtains the block that it needs to satisfy a query or transaction. RAC uses two processes the GCS and GES which maintain records of lock status of each data file and each cached block using a GRD.
So what is a resource, it is an identifiable entity, it basically has a name or a reference, it can be a area in memory, a disk file or an abstract entity. A resource can be owned or locked in various states (exclusive or shared). Any shared resource is lockable and if it is not shared no access conflict will occur.
A global resource is a resource that is visible to all the nodes within the cluster. Data buffer cache blocks are the most obvious and most heavily global resource, transaction enqueue's and database data structures are other examples. GCS handle data buffer cache blocks and GES handle all the non-data block resources.
All caches in the SGA are either global or local, dictionary and buffer caches are global, large and java pool buffer caches are local. Cache fusion is used to read the data buffer cache from another instance instead of getting the block from disk, thus cache fusion moves current copies of data blocks between instances (hence why you need a fast private network), GCS manages the block transfers between the instances.
Finally we get to the processes
Oracle RAC Daemons and Processes

LMSn
Lock Manager Server process - GCS
this is the cache fusion part and the most active process, it handles the consistent copies of blocks that are transferred between instances. It receives requests from LMD to perform lock requests. I rolls back any uncommitted transactions. There can be up to ten LMS processes running and can be started dynamically if demand requires it.
they manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process. It also handles global deadlock detection and monitors for lock conversion timeouts.
as a performance gain you can increase this process priority to make sure CPU starvation does not occur
you can see the statistics of this daemon by looking at the view X$KJMSDP
LMON
Lock Monitor Process - GES
this process manages the GES, it maintains consistency of GCS memory structure in case of process death. It is also responsible for cluster reconfiguration and locks reconfiguration (node joining or leaving), it checks for instance deaths and listens for local messaging.
A detailed log file is created that tracks any reconfigurations that have happened.
LMD
Lock Manager Daemon - GES
this manages the enqueue manager service requests for the GCS. It also handles deadlock detention and remote resource requests from other instances.
you can see the statistics of this daemon by looking at the view X$KJMDDP
LCK0
Lock Process - GES
manages instance resource requests and cross-instance call operations for shared resources. It builds a list of invalid lock elements and validates lock elements during recovery.
DIAG
Diagnostic Daemon
This is a lightweight process, it uses the DIAG framework to monitor the health of the cluster. It captures information for later diagnosis in the event of failures. It will perform any necessary recovery if an operational hang is detected.





RAC Administration
I am only going to talk about RAC administration, if you need Oracle administration then see my Oracle section.
It is recommended that the spfile (binary parameter file) is shared between all nodes within the cluster, but it is possible that each instance can have its own spfile. The parameters can be grouped into three categories
Unique parameters
These parameters are unique to each instance, examples would be instance_namethread and undo_tablespace
Identical parameters
Parameters in this category must be the same for each instance, examples would be db_name and control_file
Neither unique or identical parameters
parameters that are not in any of the above, examples would be db_cache_size, large_pool_size, local_listener andgcs_servers_processes
The main unique parameters that you should know about are
  • instance_name - defines the name of the Oracle instance (default is the value of the oracle_sid variable)
  • instance_number - a unique number for each instance must be greater than 0 but smaller than the max_instance parameter
  • thread - specifies the set of redolog files to be used by the instance
  • undo_tablespace - specifies the name of the undo tablespace to be used by the instance
  • rollback_segments - you should use Automatic Undo Management
  • cluster_interconnects - use if only if Oracle has trouble not picking the correct interconnects
The identical unique parameters that you should know about are below you can use the below query to view all of them

       select name, isinstance_modifiable from v$parameter where isinstance_modifiable = 'false' order by name;
  • cluster_database - options are true or false, mounts the control file in either share (cluster) or exclusive mode, use false in the below cases
    • Converting from no archive log mode to archive log mode and vice versa
    • Enabling the flashback database feature
    • Performing a media recovery on a system table
    • Maintenance of a node
  • active_instance_count - used for primary/secondary RAC environments
  • cluster_database_instances - specifies the number of instances that will be accessing the database (set to maximum # of nodes)
  • dml_locks - specifies the number of DML locks for a particular instance (only change if you get ORA-00055 errors)
  • gc_files_to_locks - specify the number of global locks to a data file, changing this disables the Cache Fusion.
  • max_commit_propagation_delay - influences the mechanism Oracle uses to synchronize the SCN among all instances
  • instance_groups - specify multiple parallel query execution groups and assigns the current instance to those groups
  • parallel_instance_group - specifies the group of instances to be used for parallel query execution
  • gcs_server_processes - specify the number of lock manager server (LMS) background processes used by the instance for Cache Fusion
  • remote_listener - register the instance with listeners on remote nodes.
syntax for parameter file
<instance_name>.<parameter_name>=<parameter_value>

inst1.db_cache_size = 1000000
*.undo_management=auto
example
alter system set db_2k_cache_size=10m scope=spfile sid='inst1';

Note: use the sid option to specify a particular instance
Starting and Stopping Instances
The srvctl command is used to start/stop an instance, you can also use sqlplus to start and stop the instance
start all instances
srvctl start database -d <database> -o <option>

Note: starts listeners if not already running, you can use the -o option to specify startup/shutdown options, see below for options

force
open
mount
nomount
stop all instances
srvctl stop database -d <database> -o <option>

Note: the listeners are not stopped, you can use the -o option to specify startup/shutdown options, see below for options

immediate
abort
normal
transactional
start/stop particular instance
srvctl [start|stop] database -d <database> -i <instance>,<instance>
Undo Management
To recap on undo management you can see my undo section, instances in a RAC do not share undo, they each have a dedicated undo tablespace. Using the undo_tablespace parameter each instance can point to its own undo tablespace
undo tablespace
instance1.undo_tablespace=undo_tbs1
instance2.undo_tablespace=undo_tbs2
With todays Oracle you should be using automatic undo management, again I have a detailed discussion on AUM in my undo section.
Temporary Tablespace
I have already discussed temporary tablespace's, in a RAC environment you should setup a temporary tablespace group, this group is then used by all instances of the RAC. Each instance creates a temporary segment in the temporary tablespace it is using. If an instance is running a large sort, temporary segments can be reclaimed from segments from other instances in that tablespace.
useful views
gv$sort_segment - explore current and maximum sort segment usage statistics (check columns freed_extents, free_requests ,if they grow increase tablespace size)
gv$tempseg_usage - explore temporary segment usage details such as name, SQL, etc
v$tempfile - identify - temporary datafiles being used for the temporary tablespace
Redologs
I have already discussed redologs, in a RAC environment every instance has its own set of redologs. Each instance has exclusive write access to its own redologs, but each instance can read each others redologs, this is used for recovery. Redologs are located on the shared storage so that all instances can have access to each others redologs. The process is a little different to the standard Oracle when changing the archive mode
archive mode (RAC)
SQL> alter system set cluster_database=false scope=spfile sid='prod1';
srvctl stop database -d <database>
SQL> startup mount
SQL> alter database archivelog;
SQL> alter system set cluster_database=true scope=spfile sid='prod1';
SQL> shutdown;
srvctl start database -d prod
Flashback
Again I have already talked about flashback, there is no difference in RAC environment apart from the setting up
flashback (RAC)
## Make sure that the database is running in archive log mode
SQL> archive log list

## Setup the flashback
SQL> alter system set cluster_database=false scope=spfile sid='prod1';
SQL> alter system set DB_RECOVERY_FILE_DEST_SIZE=200M scope=spfile;
SQL> alter system set DB_RECOVERY_FILE_DEST='/ocfs2/flashback' scope=spfile;
srvctl stop database -p prod1
SQL> startup mount
SQL> alter database flashback on;
SQL> shutdown;
srvctl start database -p prod1
SRVCTL command
We have already come across the srvctl above, this command is called the server control utility. It can divided into two categories
  • Database configuration tasks
  • Database instance control tasks
Oracle stores database configuration in a repository, the configuration is stored in the Oracle Cluster Registry (OCR) that was created when RAC was installed, it will be located on the shared storage. Srvctl uses CRS to communicate and perform startup and shutdown commands on other nodes.
I suggest that you lookup the command but I will provide a few examples
display the registered databases
srvctl config database
status
srvctl status database -d <database
srvctl status instance -d <database> -i <instance>
srvctl status nodeapps -n <node>
srvctl status service -d <database>
srvctl status asm -n <node>
stopping/starting
srvctl stop database -d <database>
srvctl stop instance -d <database> -i <instance>,<instance>
srvctl stop service -d <database> [-s <service><service>] [-i <instance>,<instance>]
srvctl stop nodeapps -n <node>
srvctl stop asm -n <node>

srvctl start database -d <database>
srvctl start instance -d <database> -i <instance>,<instance>
srvctl start service -d <database> -s <service><service> -i <instance>,<instance>
srvctl start nodeapps -n <node>
srvctl start asm -n <node>
adding/removing
srvctl add database -d <database> -o <oracle_home>
srvctl add instance -d <database> -i <instance> -n <node>
srvctl add service -d <database> -s <service> -r <preferred_list>
srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network
srvctl add asm -n <node> -i <asm_instance> -o <oracle_home>

srvctl remove database -d <database> -o <oracle_home>
srvctl remove instance -d <database> -i <instance> -n <node>
srvctl remove service -d <database> -s <service> -r <preferred_list>
srvctl remove nodeapps -n <node> -o <oracle_home> -A <name|ip>/network
srvctl asm remove -n <node>
Services
Services are used to manage the workload in Oracle RAC, the important features of services are
  • used to distribute the workload
  • can be configured to provide high availability
  • provide a transparent way to direct workload
The view v$services contains information about services that have been started on that instance, here is a list from a fresh RAC installation
http://www.datadisk.co.uk/images/rac/admin2.jpg
The table above is described below
  • Goal - allows you to define a service goal using service time, throughput or none
  • Connect Time Load Balancing Goal - listeners and mid-tier servers contain current information about service performance
  • Distributed Transaction Processing - used for distributed transactions
  • AQ_HA_Notifications - information about nodes being up or down will be sent to mid-tier servers via the advance queuing mechanism
  • Preferred and Available Instances - the preferred instances for a service, available ones are the backup instances
You can administer services using the following tools
  • DBCA
  • EM (Enterprise Manager)
  • DBMS_SERVICES
  • Server Control (srvctl)
Two services are created when the database is first installed, these services are running all the time and cannot be disabled.
  • sys$background - used by an instance's background processes only
  • sys$users - when users connect to the database without specifying a service they use this service
add
srvctl add service -d D01 -s BATCH_SERVICE -r node1,node2 -a node3

Note: the options are describe below

-d - database
-s - the service
-r - the service will running on the these nodes
-a - if nodes in the -r list are not running then run on this node
remove
srvctl remove service -d D01 -s BATCH_SERVICE
start
srvctl start service -d D01 -s BATCH_SERVICE
stop
srvctl stop service -d D01 -s BATCH_SERVICE
status
srvctl status service -d D10 -s BATCH_SERVICE
service (example)
## create the JOB class
BEGIN
  DBMS_SCHEDULER.create_job_class(
     job_class_name => 'BATCH_JOB_CLASS',
     service            => 'BATCH_SERVICE');
END;
/

## Grant the privileges to execute the job
grant execute on sys.batch_job_class to vallep;
## create a job associated with a job class
BEGIN
  DBMS_SCHDULER.create_job(
    job_name => 'my_user.batch_job_test',
   job_type => 'PLSQL_BLOCK',
    job_action => SYSTIMESTAMP'
    repeat_interval => 'FREQ=DAILY;',
    job_class => 'SYS.BATCH_JOB_CLASS',
    end_date => NULL,
    enabled => TRUE,
    comments => 'Test batch job to show RAC services');
END;
/

## assign a job class to an existing job
exec dbms_scheduler.set_attribute('MY_BATCH_JOB', 'JOB_CLASS', 'BATCH_JOB_CLASS');
Cluster Ready Services (CRS)
CRS is Oracle's clusterware software, you can use it with other third-party clusterware software, though it is not required (apart from HP True64).
CRS is start automatically when the server starts, you should only stop this service in the following situations
  • Applying a patch set to $ORA_CRS_HOME
  • O/S maintenance
  • Debugging CRS problems
CRS Administration

Starting
## Starting CRS using Oracle 10g R1
not possible
## Starting CRS using Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl start crs
Stopping
## Stopping CRS using Oracle 10g R1
srvctl stop -d database <database>
srvctl stop asm -n <node>
srvctl stop nodeapps -n <node>
/etc/init.d/init.crs stop

## Stopping CRS using Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl stop crs
disabling/enabling
## stop CRS restarting after a reboot, basically permanent over reboots

## Oracle 10g R1
/etc/init.d/init.crs [disable|enable]

## Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl [disable|enable] crs
Checking
$ORA_CRS_HOME/bin/crsctl check crs
$ORA_CRS_HOME/bin/crsctl check evmd
$ORA_CRS_HOME/bin/crsctl check cssd
$ORA_CRS_HOME/bin/crsctl check crsd
$ORA_CRS_HOME/bin/crsctl check install -wait 600
Resource Applications (CRS Utilities)

Status
$ORA_CRS_HOME/bin/crs_stat
$ORA_CRS_HOME/bin/crs_stat -t
$ORA_CRS_HOME/bin/crs_stat -ls
$ORA_CRS_HOME/bin/crs_stat -p

Note:
-t more readable display
-ls permission listing
-p parameters
create profile
$ORA_CRS_HOME/bin/crs_profile
register/unregister application
$ORA_CRS_HOME/bin/crs_register
$ORA_CRS_HOME/bin/crs_unregister
Start/Stop an application
$ORA_CRS_HOME/bin/crs_start
$ORA_CRS_HOME/bin/crs_stop
Resource permissions
$ORA_CRS_HOME/bin/crs_getparam
$ORA_CRS_HOME/bin/crs_setparam
Relocate a resource
$ORA_CRS_HOME/bin/crs_relocate
Nodes

member number/name
olsnodes -n

Note: the olsnodes command is located in $ORA_CRS_HOME/bin
local node name
olsnodes -l
activates logging
olsnodes -g
Oracle Interfaces

Display
oifcfg getif
Delete
oicfg delig -global
Set
oicfg setif -global <interface name>/<subnet>:public
oicfg setif -global <interface name>/<subnet>:cluster_interconnect
Global Services Daemon Control

Starting
gsdctl start
Stopping
gsdctl stop
Status
gsdctl status
Cluster Configuration (clscfg is used during installation)

create a new configuration
clscfg -install

Note: the clscfg command is located in $ORA_CRS_HOME/bin
upgrade or downgrade and existing configuration
clscfg -upgrade
clscfg -downgrade
add or delete a node from the configuration
clscfg -add
clscfg -delete
create a special single-node configuration for ASM
clscfg -local
brief listing of terminology used in the other nodes
clscfg -concepts
used for tracing
clscfg -trace
Help
clscfg -h
Cluster Name Check

print cluster name
cemutlo -n

Note: in Oracle 9i the ulity was called "cemutls", the command is located in $ORA_CRS_HOME/bin
print the clusterware version
cemutlo -w

Note: in Oracle 9i the ulity was called "cemutls"
Node Scripts

Add Node
addnode.sh

Note: see 
adding and deleting nodes
Delete Node
deletenode.sh

Note: 
see adding and deleting nodes
Oracle Cluster Registry (OCR)
As you already know the OCR is the registry that contains information
  • Node list
  • Node membership mapping
  • Database instance, node and other mapping information
  • Characteristics of any third-party applications controlled by CRS
The file location is specified during the installation, the file pointer indicating the OCR device location is the ocr.loc, this can be in either of the following
  • linux - /etc/oracle
  • solaris - /var/opt/oracle
The file contents look something like below, this was taken from my installation
orc.loc
ocrconfig_loc=/u02/oradata/racdb/OCRFile
ocrmirrorconfig_loc=/u02/oradata/racdb/OCRFile_mirror
local_only=FALSE
OCR is import to the RAC environment and any problems must be immediately actioned, the command can be found in located in $ORA_CRS_HOME/bin
OCR Utilities

log file
$ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
Checking
ocrcheck

Note: will return the OCR version, total space allocated, space used, free space, location of each device and the result of the integrity check
dump contents
ocrdump

Note: by default it dumps the contents into a file named OCRDUMPFILE in the current directory
export/import
ocrconfig -export <file>

ocrconfig -restore <file>
backup/restore
# show backups
ocrconfig -showbackup

# to change the location of the backup, you can even specify a ASM disk
ocrconfig -backuploc <path|+asm>

# perform a backup, will use the location specified by the -backuploc location
ocrconfig -manualbackup

# perform a restore
ocrconfig -restore <file>

# delete a backup
orcconfig -delete <file>

Note: there are many more option so see the ocrconfig man page
add/remove/replace
## add/relocate the ocrmirror file to the specified location
ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf'

## relocate an existing OCR file
ocrconfig -replace ocr '/ocfs1/ocr_new.dbf'

## remove the OCR or OCRMirror file
ocrconfig -replace ocr
ocrconfig -replace ocrmirror
Voting Disk
The voting disk as I mentioned in the architecture is used to resolve membership issues in the event of a partitioned cluster, the voting disk protects data integrity.
querying
crsctl query css votedisk
Adding
crsctl add css votedisk <file>
deleting
crsctl delete css votedisk <file>

RAC Backups and Recovery
Backups and recovery is very similar to a single instance database. This article covers only the specific issues that surround RAC backups and recovery, I have already written a article on standard Oraclebackups and recovery.
Backups can be different depending on the the size of the company
  • small company - may use tools such as tar, cpio, rsync
  • medium/large company - Veritas Netbackup, RMAN
  • Enterprise company - SAN mirroring with a backup option like Netbackup or RMAN
Oracle RAC can use all the above backup technologies, but Oracle prefers you to use RMAN oracle own backup solution.
Backup Basics
Oracle backups can be taken hot or cold, a backup will comprise of the following
  • Datafiles
  • Control Files
  • Archive redolog files
  • Parameter files (init.ora or SPFILE)
Databases have now grown to very large sizes well over a terabyte in size in some cases, thus tapes backups are not used in these cases but sophisticated disk mirroring have taken their place. RMAN can be used in either a tape or disk solution, it can even work with third-party solutions such as Veritas Netbackup.
In a Oracle RAC environment it is critical to make sure that all archive redolog files are located on shared storage, this is required when trying to recover the database, as you need access to all archive redologs. RMAN can use parallelism when recovering, the node that performs the recovery must have access to all archived redologs, however, during recovery only one node applies the archived logs as in a standard single instance configuration.

Oracle RAC also supports Oracle Data Guard, thus you can have a primary database configured as a RAC and a standby database also configured as a RAC.

Instance Recovery
In a RAC environment there are two types of recovery
  • Crash Recovery - means that all instances have failed, thus they all need to be recovered
  • Instance Recovery - means that one or more instances have failed. this instance can then be recovered by the surviving instances
Redo information generated by an instance is called a thread of redo. All log files for that instance belong to this thread, an online redolog file belongs to a group and the group belongs to a thread. Details about log group file and thread association details are stored in the control file. RAC databases have multiple threads of redo, each instance has one active thread, the threads are parallel timelines and together form a stream. A stream consists of all the threads of redo information ever recorded, the streams form the timeline of changes performed to the database.
Oracle records the changes made to a database, these are called change vectors. Each vector is a description of a single change, usually a single block. A redo record contains one or more change vectors and is located by its Redo Byte Address (RBA) and points to a specific location in the redolog file (or thread). It will consist of three components
  • log sequence number
  • block number within the log
  • byte number within the block
Checkpoints are the same in a RAC environment and a single instance environment, I have already discussed checkpoints, when a checkpoint needs to be triggered, Oracle will look for the thread checkpoint that has the lowest checkpoint SCN, all blocks in memory that contain changes made prior to this SCN across all instances must be written out to disk. I have discussed how to control recovery in my Oracle section and this applies to RAC as well.
Crash Recovery
Crash recovery is basically the same for a single instance and a RAC environment, I have a complete recovery section in my Oracle section, here is a note detailing the difference

For a single instance the following is the recovery process
  1. The on-disk block is the starting point for the recovery, Oracle will only consider the block on the disk so the recovery is simple. Crash recovery will automatically happen using the online redo logs that are current or active
  2. The starting point is the last full checkpoint. The starting point is provided by the control file and compared against the same information in the data file headers, only the changes need to be applied
  3. The block specified in the redolog is read into cache, if the block has the same timestamp as the redo record (SCN match) the redo is applied.
For a RAC instance the following is the recovery process
  1. A foreground process in a surviving instance detects an "invalid block lock" condition when a attempt is made to read a block into the buffer cache. This is an indication that an instance has failed (died)
  2. The foreground process sends a notification to instance system monitor (SMON) which begin to search for dead instances. SMON maintains a list of all the dead instances and invalid block locks. Once the recovery and cleanup has finished this list is updated.
  3. The death of another instance is detected if the current instance is able to acquire that instance's redo thread locks, which is usually held by an open and active instance.
Oracle RAC uses a two-pass recovery, because a data block could have been modified in any of the instances (dead or alive), so it needs to obtain the latest version of the dirty block and it uses PI (Past Image) and Block Written Record (BWR) to archive this in a quick and timely fashion.
Block Written Record (BRW)
The cache aging and incremental checkpoint system would write a number of blocks to disk, when the DBWR completes a data block write operation, it also adds a redo record that states the block has been written (data block address and SCN). DBWn can write block written records (BWRs) in batches, though in a lazy fashion. In RAC a BWR is written when an instance writes a block covered by a global resource or when it is told that its past image (PI) buffer it is holding is no longer necessary.
Past Image (PI)
This is was makes RAC cache fusion work, it eliminates the write/write contention problem that existed in the OPS database. A PI is a copy of a globally dirty block and is maintained in the database buffer cache, it can be created and saved when a dirty block is shipped across to another instance after setting the resource role to global. The GCS is responsible for informing an instance that its PI is no longer needed after another instance writes a newer (current) version of the same block. PI's are discarded when GCS posts all the holding instances that a new and consistent version of that particular block is now on disk.

I go into more details about PI's in my 
cache fusion section.
The first pass does not perform the actual recovery but merges and reads redo threads to create a hash table of the blocks that need recovery and that are not known to have been written back to the datafiles. The checkpoint SCN is need as a starting point for the recovery, all modified blocks are added to the recovery set (a organized hash table). A block will not be recovered if its BWR version is greater than the latest PI in any of the buffer caches.
The second pass SMON rereads the merged redo stream (by SCN) from all threads needing recovery, the redolog entries are then compared against a recovery set built in the first pass and any matches are applied to the in-memory buffers as in a single pass recovery. The buffer cache is flushed and the checkpoint SCN for each thread is updated upon successful completion.
Cache Fusion Recovery
I have a detailed section on cache fusion, this section covers the recovery, cache fusion is only used in RAC environments, as additional steps are required, such as GRD reconfiguration, internode communication, etc. There are two types of recovery
  • Crash Recovery - all instances have failed
  • Instance Recovery - one instance has failed
In both cases the threads from failed instances need to be merged, in a instance recovery SMON will perform the recovery where as in a crash recovery a foreground process performs the recovery.
The main features (advantages) of cache fusion recovery are
  • Recovery cost is proportional to the number of failures, not the total number of nodes
  • It eliminates disk reads of blocks that are present in a surviving instance's cache
  • It prunes recovery set based on the global resource lock state
  • The cluster is available after an initial log scan, even before recovery reads are complete
In cache fusion the starting point for recovery of a block is its most current PI version, this could be located on any of the surviving instances and multiple PI blocks of a particular buffer can exist.
Remastering is the term used that describes the operation whereby a node attempting recovery tries to own or master the resource(s) that were once mastered by another instance prior to the failure. When one instance leaves the cluster, the GRD of that instance needs to be redistributed to the surviving nodes. RAC uses an algorithm called lazy remastering to remaster only a minimal number of resources during a reconfiguration. The entire Parallel Cache Management (PCM) lock space remains invalid while the DLM and SMON complete the below steps
  1. IDLM master node discards locks that are held by dead instances, the space is reclaimed by this operation is used to remaster locks that are held by the surviving instance for which a dead instance was remastered
  2. SMON issues a message saying that it has acquired the necessary buffer locks to perform recovery
Lets look at an example on what happens during a remastering, lets presume the following
  • Instance A masters resources 1, 3, 5 and 7
  • Instance B masters resources 2, 4, 6, and 8
  • Instance C masters resources 9, 10, 11 and 12
Instance B is removed from the cluster, only the resources from instance B are evenly remastered across the surviving nodes (no resources on instances A and C are affected), this reduces the amount of work the RAC has to perform, likewise when a instance joins a cluster only minimum amount of resources are remastered to the new instance.
Before Remastering
After Remastering
http://www.datadisk.co.uk/images/rac/backups1.jpg
http://www.datadisk.co.uk/images/rac/backups2.jpg
You can control the remastering process with a number of parameters
_gcs_fast_config
enables fast reconfiguration for gcs locks (true|false)
_lm_master_weight
controls which instance will hold or (re)master more resources than others
_gcs_resources
controls the number of resources an instance will master at a time
you can also force a dynamic remastering (DRM) of an object using oradebug
force dynamic remastering (DRM)
## Obtain the OBJECT_ID form the below table
SQL> select * from v$gcspfmaster_info;
## Determine who masters it
SQL> oradebug setmypid
SQL> oradebug lkdebug -a <OBJECT_ID>

## Now remaster the resource
SQL> oradebug setmypid
SQL> oradebug lkdebug -m pkey <OBJECT_ID>
The steps of a GRD reconfiguration is as follows
  • Instance death is detected by the cluster manager
  • Request for PCM locks are frozen
  • Enqueues are reconfigured and made available
  • DLM recovery
  • GCS (PCM lock) is remastered
  • Pending writes and notifications are processed
  • I Pass recovery
    • The instance recovery (IR) lock is acquired by SMON
    • The recovery set is prepared and built, memory space is allocated in the SMON PGA
    • SMON acquires locks on buffers that need recovery
  • II Pass recovery
    • II pass recovery is initiated, database is partially available
    • Blocks are made available as they are recovered
    • The IR lock is released by SMON, recovery is then complete
    • The system is available
Graphically it looks like below


RAC Performance
I have already discussed basic Oracle tuning, in this section I will mainly dicuss Oracle RAC tuning. First lets review the best pratices of a Oracle design regarding the application and database
  • Optimize connection management, ensure that the middle tier and programs that connect to the database are efficent in connection management and do not log on or off repeatedly
  • Tune the SQL using the available tools such as ADDM and SQL Tuning Advisor
  • Ensure that applications use bind variablescursor_sharing was introduced to solve this problem
  • Use packages and procedures (because they are compiled) in place of anonymous PL/SQL blocks and big SQL statements
  • Use locally managed tablespaces and automatic segment space management to help performance and simplify database administration
  • Use automatic undo management and temporary tablespace to simplify administration and increase performance
  • Ensure you use large caching when using sequences, unless you cannot afford to lose sequence during a crash
  • Avoid using DDL in production, it increases invalidations of the already parsed SQL statements and they need to be recompiled
  • Partion tables and indexes to reduce index leaf contention (buffer busy global cr problems)
  • Optimize contention on data blocks (hot spots) by avoiding small tables with too many rows in a block
Now we can review RAC specific best practices
  • Consider using application partitioning (see below)
  • Consider restricting DML-intensive users to using one instance, thus reducing cache contention
  • Keep read-only tablespaces away from DML-intensive tablespaces, they only require minimum resources thus optimizing Cache Fusion performance
  • Avoid auditing in RAC, this causes more shared library cache locks
  • Use full tables scans sparingly, it causes the GCS to service lots of block requests, see table v$sysstat column "table scans (long tables)"
  • if the application uses lots of logins, increase the value of sys.audsess$ sequence
Partitioning Workload
Workload partitioning is a certian type of workload that is executed on an instance, that is partitioning allows users who access the same set of data to log on to the same instance. This limits the amount of data that is shared between instances thus saving resources used for messaging and Cache Fusion data block transfer.
You should consider the following when deciding to implement partitioning
  • If the CPU and private interconnects are of high performance then there is no need to to partition
  • Partitioning does add complexity, thus if you can increase CPU and the interconnect performance the better
  • Only partition if performance is betting impacted
  • Test both partitioning and non-partitioning to what difference it makes, then decide if partitioning is worth it
RAC Wait Events
An event is an operation or particular function that the Oracle kernel performs on behalf of a user or a Oracle background process, events have specific names like database event. Whenever a session has to wait for something, the wait time is tracked and charged to the event that was associated with that wait. Events that are associated with all such waits are known as wait events. The are a number of wait classes
  • Commit
  • Scheduler
  • Application
  • Configuration
  • User I/O
  • System I/O
  • Concurrency
  • Network
  • Administrative
  • Cluster
  • Idle
  • Other
There are over 800 different events spread across the above list, however you probably will only deal with about 50 or so that can improve performance.
When a session requests access to a data block it sends a request to the lock master for proper authorization, the request does not know if it will receive the block via Cache Fusion or a permission to read from the disk. Two placeholder events
  • global cache cr request (consistent read - cr)
  • global cache curr request (current - curr)
keep track of the time a session spends in this state. There are number of types of wait events regarding access to a data block
Wait Event
Contention type
Description
gc current block 2-way
write/write
an instance requests authorization for a block to be accessed in current mode to modify a block, the instance mastering the resource receives the request. The master has the current version of the block and sends the current copy of the block to the requestor via Cache Fusion and keeps a Past Image (.PI)
If you get this then do the following
  • Analyze the contention, segments in the "current blocks received" section of AWR
  • Use application partitioning scheme
  • Make sure the system has enough CPU power
  • Make sure the interconnect is as fast as possible
  • Ensure that socket send and receive buffers are configured correctly
gc current block 3-way
write/write
an instance requests authorization for a block to be accessed in current mode to modify a block, the instance mastering the resource receives the request and forwards it to the current holder of the block, asking it to relinquish ownership. The holding instance sends a copy of the current version of the block to the requestor via Cache Fusion and transfers the exclusive lock to the requesting instance. It also keeps a past Image (PI).
Use the above actions to increase the performance
gc current block 2-way
write/read
The difference with the one above is that this sends a copy of the block thus keeping the current copy.
gc current block 3-way
write/read
The difference with the one above is that this sends a copy of the block thus keeping the current copy.
gc current block busy
write/write
The requestor will eventually get the block via cache fusion but it is delayed due to one of the following
  • The block was being used by another session on another session
  • was delayed as the holding instance could not write the corresponding redo record immediately
If you get this then do the following
gc current buffer busy
local
This is the same as above (gc current block busy), the difference is that another session on the same instance also has requested the block (hence local contention)
gc current block congested
none
This is caused if heavy congestion on the GCS, thus CPU resources are stretched
Enqueue Tuning
Oracle RAC uses a queuing mechanism to ensure proper use of shared resources, it is called Global Enqueue Services (GES). Enqueue wait is the time spent by a session waiting for a shared resource, here are some examples of enqueues:
  • updating the control file (CF enqueue)
  • updating an individual row (TX enqueue)
  • exclusive lock on a table (TM enqueue)
Enqueues can be managed by the instance itself others are used globally, GES is responsible for coordinating the global resources. The formula used to calculate the number of enqueue resources is as below
                     GES Resources = DB_FILES + DML_LOCKS + ENQUEUE_RESOURCES + PROCESS + TRANSACTION x (1 + (N - 1)/N)

                      N = number of RAC instances
displaying enqueues stats
SQL> column current_utilization heading current
SQL> column max_utilization heading max_usage
SQL> column initial_allocation heading initial
SQL> column resource_limit format a23;

SQL> select * from v$resource_limit;
AWR and RAC
I have already discussed AWR in a single instance environment, so for a quick refresh take a look and come back here to see how you can use it in a RAC environment.
From a RAC point of view there are a number of RAC-specific sections that you need to look at in the AWR, in the report section is a AWR of my home RAC environment, you can view the whole report here.
RAC AWR Section
Report
Description
Number of Instances
lists the number of instances from the beginning and end of the AWR report
Instance global cache load profile
information about the interinstance cache fusion data block and messaging traffic, because my AWR report is lightweight here is a more heavy used RAC example
Global Cache Load Profile
~~~~~~~~~~~~~~~~~~~~~~~~~            Per Second        Per Transaction
                                                 ---------------          ---------------
Global Cache blocks received:       315.37                   12.82
Global Cache blocks served:          240.30                   9.67
GCS/GES messages received:          525.16                   20.81
GCS/GES messages sent:                765.32                   30.91

The first two statistics indicate the number of blocks transferred to or from this instance, thus if you are using a 8K block size

        Sent:        240 x 8,192 = 1966080 bytes/sec = 2.0 MB/sec
        Received:  315 x 8,192 = 2580480 bytes/sec = 2.6 MB/sec

to determine the amount of network traffic generated due to messaging you first need to find the average message size (this was 193 on my system)

    select sum(kjxmsize * (kjxmrcv + kjxmsnt + kjxmqsnt)) / sum((kjxmrcv + kjxmsnt + kjxmqsnt)) "avg Message size" from x$kjxm
        where kjxmrcv > 0 or kjxmsnt > 0 or kjxmqsnt > 0;

then calculate the amount of messaging traffic on this network

    193 (765 + 525) = 387000 = 0.4 MB

to calculate the total network traffic generated by cache fusion

     = 2.0 + 2.6 + 0.4 = 5 MBytes/sec
     = 5 x 8 = 40 Mbits/sec

The DBWR Fusion writes statistic indicates the number of times the local DBWR was forced to write a block to disk due to remote instances, this number should be low.
Glocal cache efficiency percentage
this section shows how the instance is getting all the data blocks it needs. The best order is the following
  • Local cache
  • Remote cache
  • Disk
The first two give the cache hit ratio for the instance, you are looking for a value less than 10%, if you are getting higher values then you may consider application partitioning.
GCS and GES - workload characteristics
this section contains timing statistics for global enqueue and global cache. As a general rule you are looking for
  • All timings related to CR (Consistent Read) processing block should be less than 10 msec
  • All timings related to CURRENT block processing should be less than 20 msec
Messaging statistics
The first section relates to sending a message and should be less than 1 second.
The second section details the breakup of direct and indirect messages, direct messages are sent by a instance foreground or the user processes to remote instances, indirect are messages that are not urgent and are pooled and sent.
Service statistics
shows the resources used by all the service instance supports
Service wait class statistics
summarizes waits in different categories for each service
Top 5 CR and current block segements
conatns the names of the top 5 contentious segments (table or index). If a table or index has a very high percentage of CR and Current block transfers you need to investigate. This is pretty much like a normal single instance.
Cluster Interconnect
As I stated above the interconnect it a critical part of the RAC, you must make sure that this is on the best hardware you can buy. You can confirm that the interconnect is being used in Oracle 9i and 10g by using the command oradebug to dump information out to a trace file, in Oracle 10g R2 the cluster interconnect is also contained in the alert.log file, you can view my information from here.
interconnect
SQL> oradebug setmypid
SQL> oradebug ipc
Note: look in the user_dump_dest directory, the trace will be there


Global Resource Directory (GRD)
The RAC environment includes many resources such as multiple versions of data block buffers in buffer caches in different modes, Oracle uses locking and queuing mechanisms to coordinate lock resources, data and interinstance data requests. Resources such as data blocks and locks must be synchronized between nodes as nodes within a cluster acquire and release ownership of them. The synchronization provided by the Global Resource Directory (GRD) maintains a cluster wide concurrency of the resources and in turn ensures the integrity of the shared data. Synchronization is also required for buffer cache management as it is divided into multiple caches, and each instance is responsible for managing its own local version of the buffer cache. Copies of data are exchanged between nodes, this sometimes is referred to as the global cache but in reality each nodes buffer cache is separate and copies of blocks are exchanged through traditional distributed locking mechanism.
Global Cache Services (GCS) maintain the cache coherency across buffer cache resources and Global Enqueue Services (GES) controls the resource management across the clusters non-buffer cache resources.
Cache Coherency
Cache coherency identifies the most up-to-date copy of a resource, also called the master copy, it uses a mechanism by which multiple copies of an object are keep consistent between Oracle instances. Parallel Cache Management (PCM) ensures that the master copy of a data block is stored in one buffer cache and consistent copies of the data block are stored in other buffer caches, the process LCKx is responsible for this task.
The lock and resource structures for instance locks reside in the GRD (also called the DLM), its a dedicated area within the shared pool. Details about the data blocks resources and cached versions are maintained by GCS. Additional details such as the location of the most current version, state of the buffer, role of the data block (local or global) and ownership are maintained by GES. Global cache together with GES form the GRD. Each instance maintains a part of the GRD in its SGA. The GCS and GES nominate one instance, this will become the resource master, to manage all information about a particular resource. Each instance knows which instance master is with which resource.
Resources and Enqueues
A resource is an identifiable entity, it has a name or reference. The referenced entity is usually a memory region, a disk file, a data block or an abstract entity. A resource can be owned or locked in various states (exclusive or shared), all resources are lockable. A global resource is visible throughout the cluster, thus a local resource can only be used by the instance at it is local too. Each resource can have a list of locks called the grant queue, that are currently granted to users. A convert queue is a queue of locks that are waiting to be converted to particular mode, this is the process of changing a lock from one mode to another, even a NULL is a lock. A resource has a lock value block (LVB). The Global Resource Manager (GRM) keeps the lock information valid and correct across the cluster.
Locks are placed on a resource grant or a convert queue, if the lock changes it moves between the queues. A lock leaves the convert queue under the following conditions
  • The process requests the lock termination (it remove the lock)
  • The process cancels the conversion, the lock is moved back to the grant queue in the previous mode
  • The requested mode is compatible with the most restrictive lock in the grant queue and with all the previous modes of the convert queue, and the lock is at the head of the convert queue
Convert requests are processed on a FIFO basis, the grant queue and convert queue are associated with each and every resource that is managed by the GES.
Enqueues are basically locks that support queuing mechanisms and that can be acquired in different modes. An enqueue can be held in exclusive mode by one process and others can hold a non-exclusive mode depending on the type. Enqueues are the same in RAC as they are in a single instance.
Global Enqueue Services (GES)
GES coordinates the requests of all global enqueues, it also deals with deadlocks and timeouts. There are two types of local locks, latches and enqueues, latches do not affect the cluster only the local instance, enqueues can affect both the cluster and the instance.
Enqueues are shared structures that serialize access to database resources, they support multiple modes and are held longer than latches, they protect persistent objects such as tables or library cache objects. Enqueues can use any of the following modes
Mode
Summary
Description
NULL
NULL
no access rights, a lock is held at this level to indicate that a process is interested in a resource
SS
SubShared
the resource can be read in an unprotected fashion other processes can read and write to the resource, the lock is also known as a row share lock
SX
Shared Exclusive
the resource can be read and written to in an unprotected fashion, this is also known as a RX (row exclusive) lock
S
Shared
a process cannot write to the resource but multiple processes can read it. This is the traditional share lock.
SSX
SubShared Exclusive
Only one process can hold a lock at this level, this makes sure that only processes can modify it at a time. Other processes can perform unprotected reads. This is also know as a SRX (shared row exclusive) table lock.
X
Exclusive
grants the holding process exclusive access to the resource, other processes cannot read or write to the resource. This is also the traditional exclusive lock.
Global Locks
Each node has information for a set of resources, Oracle uses a hashing algorithm to determine which nodes hold the directory tree information for the resource. Global locks are mainly of two types
  • Locks used by the GCS for buffer cache management, these are called PCM locks
  • Global locks (global enqueue) that Oracle synchronizes within a cluster to coordinate non-PCM resources, they protect the enqueue structures
An instance owns a global lock that protects a resource (i.e. data block or data dictionary entry) when the resource enters the instance's SGA.
GES locks control access to data files (not the data blocks) and control files and also serialize interinstance communication. They also control library caches and the dictionary cache. Examples of this are DDL, DML enqueue table locks, transaction enqueues and DDL locks or dictionary locks. The SCN and mount lock are global locks.
Transaction and row locks are the same as in a single instance database, the only difference is that the enqueues are global enqueues, take a look in locking for an in depth view on how Oracle locking works.
Messaging
The difference between RAC and a single instance messaging is that RAC uses the high speed interconnect and a single instance uses shared memory and semaphores, interrupts are used when one or more process want to use the processor in a multiple CPU architecture. GES uses messaging for interinstance communication, this is done by messages and asynchronous traps (ASTs). Both LMON and LMD use messages to communicate to other instances, the GRD is updated when locks are required. The messaging traffic can be viewed using the view V$GES_MISC.
A three-way lock message involves up to a maximum of three instances, Master instance (M), Holding instance (H) and the Requesting instance (R), the sequence is detailed below where requesting instance R is interested in block B1 from holding instance H. The resource is mastered in master instance M
http://www.datadisk.co.uk/images/rac/grd2.jpg
  1. Instance R gets the ownership information about a resource from the GRD, instance R then sends the message to the master instance M requesting access to the resource. This message is sent by a direct send as it is critical
  2. Instance M receives the message and forwards it to the holding instance H. This is also sent directly, this is known as a blocking asynchronous trap (BAST).
  3. Instance H sends the resource to instance R, using the interconnect, the resource is copied in instance R memory
  4. Once the lock handle is obtained on the resource instance R sends an acknowledgment to instance M. This message is queued as it is not critical, this is called acquisition asynchronous trap (AAST).
Because GES heavily rely's on messaging the interconnect must be of high quality (high performance , low latency), also the messages are kept small (128 bytes) to increase performance. The Traffic Controller (TRFC) is used to control the DLM traffic between the instances in the cluster, it uses buffering to accommodate large volumes of traffic. The TRFC keeps track of everything by using tickets (sequence numbers), there is a predefined pool of tickets this is dependent on the network send buffer size. A ticket is obtained before sending any messages, once sent the ticket is returned to the pool, LMS or LMD perform this. If there are no tickets then the message has to wait until a ticket is available. You can control the number of tickets and view them
system parameter
_lm_tickets
_lm_ticket_active_sendback (used for aggressive messaging)
ticket usage
select local_nid local, remote_nid remote, tckt_avail avail, tckt_limit limit, snd_q_len send_queue, tckt_wait waiting from v$ges_traffic_controller;
dump ticket information
SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug lkdebug -t
Note: the output can be viewed here
Global Cache Services (GCS)
GCS locks only protect data blocks in the global cache (also know as PCM locks), it can be acquired in share or exclusive mode. Each lock element can have the lock role set to either local (same as single instance) or global. When in global role three lock modes are possible, shared, exclusive and null. In global role mode you can read or write to the data block only as directed by the master instance of that resource. The lock and state information is held in the SGA and is maintained by GCS, these are called lock elements. It also holds a chain of cache buffer chains that are covered by the corresponding lock elements. These can be view via v$lock_element, the parameter _db_block_hash_buckets controls the number of hash buffer chain buckets.
GCS locks uses the following modes as stated above
Exclusive (X)
used during update or any DML operation, if another instance requires the block that has a exclusive lock it asks GES to request that he second instance disown the global lock
Shared (S)
used for select operations, reading of data does not require a instance to disown a global lock.
Null (N)
allows instances to keep a lock without any permission on the block(s). This mode is used so that locks need not be created and destroyed all the time, it just converts from one lock to another.
Lock roles are used by Cache Fusion, it can be either local or global, the resource is local if the block is dirty only in the local cache, it is global if the block is dirty in a remote cache or in several remote caches. A Past Image (PI) is kept by the instance when a block is shipped to another instance, the role is then changed to a global role, thus the PI represents the state of a dirty buffer. A node must keep a PI until it receives notification from the master that a write to disk has completed covering that version, the node will then log a block written record (BWR). I have already discussed PI and BWR in my backupsection.
When a new current block arrives, the previous PI remains untouched in case another node requires it. If there are a number of PI's that exist, they may or may not merge into a single PI, the master will determine this based on if the older PI's are required, a indeterminate number of PI's can exist.
In the local role only S and X modes are permitted, when requested by the master instance the holding instance serves a copy of the block to others. If the block is globally clean this instance lock role remains local. If the block is modified (dirty), a PI is retained and the lock becomes global. In the global lock role lock modes can be N, S and X, the block is global and it may even by dirty in any of the instances and the disk version may be obsolete. Interested parties can only modify the block using X mode, an instance cannot read from the disk as it may not be current, the holding instance can send copies to other instances when instructed by the master.
I have a complete detailed walkthough in my cache_fusion section, which will help you better to understand.
A lock element holds lock state information (converting, granting, etc). LEs are managed by the lock process to determine the mode of the locks, they also old a chain of cache buffers that are covered by the LE and allow the Oracle database to keep track of cache buffers that must be written to disk in a case a LE (mode) needs to be downgraded (X > N).
http://www.datadisk.co.uk/images/rac/grd3.jpg

LEs protect all the data blocks in the buffer cache, the list below describes the classes of the data block which are managed by the LEs using GCS locks (x$bh.class).
0
FREE
1
EXLCUR
2
SHRCUR
3
CR
4
READING
5
MRECOVERY
6
IRCOVERY
7
WRITING
8
PI
So putting this altogether you get the following, GCS manages PCM locks in the GRD, PCM locks manage the data blocks in the global cache. Data blocks are can be kept in any of the instances buffer cache (which is global), if not found then it can be read from disk by the requesting instance. The GCS monitors and maintains the list and mode of the blocks in all the instances. Each instance will master a number of resources, but a resource can only be mastered by one instance. GCS ensures cache coherency by requiring that instances acquire a lock before modifying or reading a database block. GCS locks are not row-level locks, row-level locks are used in conjunction with PCM locks. GCS lock ensures that they block is accessed by one instances then row-level locks manage the blocks at the row-level. If a block is modified all Past Images (PI) are no longer current and new copies are required to obtained.
Consistent read processing means that readers never block writers, as the same in a single instance. One parameter that can help is _db_block_max_cr_dba which limits the number of CR copies per DBA on the buffer cache. If too many CR requests arrive for a particular buffer, the holder can disown the lock on the buffer and write the buffer to the disk, thus the requestor can then read it from disk, especially if the requested block has a older SCN and needs to reconstruct it (known as CR fabrication). This is technically known as fairness downconvert, and the parameter _fairness_threshold can used to configure it.
The lightwork rule is involved when CR construction involves too much work and no current block or PI block is available in the cache for block cleanouts. The below can be used to view the number of times a downconvert occurs
downconvert
select cr_requests, light_works, data_requests, fairness_down_converts from v$cr_block_server;

Note: lower the _fairness_threshold if the ratio goes above 40%, set to 0 if the instance is a query only instance.
The GRD is a central repository for locks and resources, it is distributed across all nodes (not a single node), but only one instance masters a resource. The process of maintaining information about resources is called lock mastering or resource mastering. I spoke about lock remastering in my backup section.
Resource affinity allows the resource mastering of the frequently used resources on its local node, it uses dynamic resource mastering to move the location of the resource masters. Normally resource mastering only happens when a instance joins or leaves the RAC environment, as of Oracle 10g R2 mastering occurs at the object level which helps fine-grained object remastering. There are a number of parameters that can be used to dynamically remaster an object
_gc_affinity_time
specifies interval minutes for remastering
_gc_affinity_limit
defines the number of times a instance access the resource before remastering, setting to 0 disable remastering
_gc_affinity_minimum
defines the minimum number of times a instance access the resource before remastering
_lm_file_affinity
disables dynamic remastering for the objects belonging to those files
_lm_dynamic_remastering
enable or disable remastering
You should consult Oracle before changing any of the above parameters.
Cache Fusion
I mentioned above Cache Fusion in my GRD section, here I go into great detail on how it works, I will also provide a number of walk through examples on my RAC system.
Cache Fusion uses the most efficient communications as possible to limit the amount of traffic used on the interconnect, now you don't need this level of detail to administer a RAC environment but it sure helps to understand how RAC works when trying to diagnose problems. RAC appears to have one large buffer but this is not the case, in reality the buffer caches of each node remain separate, data blocks are shared through distributed locking and messagingoperations. RAC copies data blocks across the interconnect to other instances as it is more efficient than reading the disk, yes memory and networking together are faster than disk I/O.
Ping
The transfer of a data block from instances buffer cache to another instances buffer cache is know as a ping. As mentioned already when an instance requires a data block it sends the request to the lock master to obtain a lock in the desired mode, this process is known as blocking asynchronous trap (BAST). When an instance receives a BAST it downgrades the lock ASAP, however it might have to write the corresponding block to disk, this operation is known as disk ping or hard ping. Disk pings have been reduce in the later versions of RAC, thus relaying on block transfers more, however there will always be a small amount of disk pinging. In the newer versions of RAC when a BAST is received sending the block or downgrading the lock may be deferred by tens of milliseconds, this extra time allows the holding instance to complete an active transaction and mark the block header appropriately, this will eliminate any need for the receiving instance to check the status of the transaction immediately after receiving/reading a block. Checking the status of a transaction is an expensive operation that may require access (and pinging) to the related undo segment header and undo data blocks as well. The parameter _gc_defer_time can be used to define the duration by which an instance deferred downgrading a lock.
Past Image Blocks (PI)
In the GRD section I mentioned Past Images (PIs), basically they are copies of data blocks in the local buffer cache of an instance. When an instance sends a block it has recently modified to another instance, it preserves a copy of that block, marking as a PI. The PI is kept until that block is written to disk by the current owner of the block. When the block is written to disk and is known to have a global role, indicating the presents of PIs in other instances buffer caches, GCS informs the instance holding the PIs to discard the PIs. When a checkpoint is required it informs GCS of the write requirement, GCS is responsible for finding the most current block image and informing the instance holding that image to perform a block write. GCS then informs all holders of the global resource that they can release the buffers holding the PI copies of the block, allowing the global resource to be released. You can view the past image blocks present in the fixed table X$BH
PIs
select state, count(state) from X$BH group by state;


Note: the state column with 8 is the past images.
Cache Fusion I
Cache Fusion I is also know as consistent read server and was introduced in Oracle 8.1.5, it keeps a list of recent transactions that have changed a block.the original data contained in the block is preserved in the undo segment, which can be used to provide consistent read versions of the block.
In a single instance the following happens when reading a block
  • When a reader reads a recently modified block, it might find an active transaction in the block
  • The reader will need to read the undo segment header to decide whether the transaction has been committed or not
  • If the transaction is not committed, the process creates a consistent read (CR) version of the block in the buffer cache using the data in the block and the data stored in the undo segment
  • If the undo segment shows the transaction is committed, the process has to revisit the block and clean out the block (delay block cleanout) and generate the redo for the changes.
In an RAC environment if the process of reading the block is on an instance other than the one that modified the block, the reader will have to read the following blocks from the disk
  • data block to get the data and/or transaction ID and Undo Byte Address (UBA)
  • undo segment header block to find the last undo block used for the entire transaction
  • undo data block to get the actual record to construct a CR image
Before these blocks can be read the instance modifying the block will have to write those's blocks to disk, resulting in 6 I/O operations. In RAC the instance can construct a CR copy by hopefully using the above blocks that are still in memory and then sending the CR over the interconnect thus reducing 6 I/O operations.
As from Oracle 8 introduced a new background process called the Block Server Process makes the CR fabrication at the holders cache and ships the CR version of the block across the interconnect, the sequence is detailed in the table below
http://www.datadisk.co.uk/images/rac/cache_fusion1.jpg
  1. An instance sends a message to the lock manager requesting a shared lock on the block
  2. Following are the possibilities in the global cache
    • If there is no current user for the block, the lock manager grants the shared lock to the requesting instance
    • if the other instance has an exclusive lock on the block, the lock manager asks the owning instance to build a CR copy and ship it to the requesting instance. 
  3. Based on the result, either of the following can happen
    • if the lock is granted, the requesting instance reads the block from disk
    • The owning instance creates a CR version of the buffer in its own buffer cache and ships it to the requesting instance over the interconnect
  4. The owning instance also informs the lock manager and requesting instance that it has shipped the block
  5. The requesting instance has the locked granted, the lock manager updates the IDLM with the new holders of that resource
While making a CR copy, the holding instance may refuse to do so if
  • it does not find any of the blocks needed in its buffer cache, it will not perform a disk read to make a CR copy for another instance
  • It is repeatedly asked to send a CR copy of the same block, after sending the CR copies four times it will voluntarily relinquish the lock, write the block to the disk and let other instances get the block from the disk. The number of copies it will serve before doing so is governed by the parameter _fairness_threshold
Cache Fusion II
Read/Write contention was addressed in cache fusion I, cache fusion II addresses the write/write contention
http://www.datadisk.co.uk/images/rac/cache_fusion2.jpg
  1. An instance sends a message to the lock manager requesting an exclusive lock on the block
  2. Following are the possibilities in the global cache
    • If there is no current user for the block, the lock manager grants the exclusive lock to the requesting instance
    • if the other instance has an exclusive lock on the block, the lock manager asks the owning instance to release the lock 
  3. Based on the result, either of the following can happen
    • if the lock is granted, the requesting instance reads the block from disk
    • The owning instance sends the current block to the requesting instance via the interconnect, to guarantee recovery in the event of instance death, the owning instance writes all the redo records generated for the block to the online redolog file. It will keep a past image of the block and inform the master instance that it has sent the current block to the requesting instance
  4. The lock manager updates the resource directory (GRD) with the current holder of the block
Cache Fusion in Operation
A quick recap of GCS, a GCS resource can be local or global, if it is local it can be acted upon without consulting other instances, if it is global it cannot be acted upon without consulting or informing remote instances. GCS is used as a messaging agent to coordinate manipulation of a global resource. By default all resources are in NULL mode (remember null mode is used to convert from one type to another (share or exclusive)).
The table below denotes the different states of a resource
Mode/Role
Local
Global
Null (N)
NL
NG
Shared (S)
SL
SG
Exclusive (X)
XL
XG
States
SL
it can serve a copy of the block to other instances and it can read the block from disk, since the block is not modified there is no need to write to disk
XL
it has sole ownership and interest in that resource, it has exclusive right to modify the block, all changes to the blocks are in the local buffer cache and it can write the block to the disk. If another instance wants the block it can to come via the GCS
NL
used to protect consistent read block, if an instance wants it in X mode, the current instance will send the block to the requesting instance and downgrades its role to NL
SG
a block is present in one or more instances, an instance can read the read from disk and serve it to other instances
XG
a block can have one or more PIs, the instance with the XG role has the latest copy of the block and is the most likely candidate to write the block to the disk. GCS can ask the instance to write the block and serve it to other instances
NG
after discarding PIs when instructed to by GCS, the block is kept in the buffer cache with NG role, this serves only as the CR copy of the block.
Below are a number of common scenarios to help understand the following
  • reading from disk
  • reading from cache
  • getting the block from cache for update
  • performing an update on a block
  • performing an update on the same block
  • reading a block that was globally dirty
  • performing a rollback on a previously updated block
  • reading the block after commit
We will assume the following
  • Four RAC environment (Instances A, B, C and D)
  • Instance D is the master of the lock resource for the data block BL
  • We will only use one block and it will reside at SCN 987654
  • We will use a three-letter code for the lock states
    • first letter will indicate the lock mode - N = Null, S = Shared and X = Exclusive
    • second latter will indicate lock role - G = Global, L = Local
    • The third letter will indicate the PIs - 0 = no PIs, 1 = a PI of the bloc
for example a code of SL0 means a global shared lock with no past images (PIs)
Reading a block from disk
http://www.datadisk.co.uk/images/rac/cache_fusion3.jpg
instance C want to read the block it will request a lock in share mode from the master instance
  1. Instance C requests the block by sending a shared lock request to master D
  2. The block has never been read into the buffer cache of any instance and it is not locked. Master D grants the lock to instance C. The lock granted is SL0 (see above to work out three-letter code)
  3. Instance C reads the block from the shared disk into its buffer cache
  4. Instance C has the block in shard mode, the lock manager updates the resource directory.
Reading a block from the cache
http://www.datadisk.co.uk/images/rac/cache_fusion4.jpg
Carrying on from the above example, Instance B wants to read the same block that is cached in instance C buffer.
  1. Instance B sends a shared lock request to master instance D
  2. The lock master knows that the block may be available at instance C and sends a ping message to instance C
  3. Instance C sends the block to instance B via the interconnect, along with the block instance C indicates that instance B should take the current lock mode and role from instance C, instance C keeps a copy of the block
  4. Instance B sends a message to instance D that it has assumed the SL lock for the block. This message is not critical for the lock manager, thus the message is sent asynchronously
Getting a (Cached) clean block for update
http://www.datadisk.co.uk/images/rac/cache_fusion5.jpg
Carrying on from the above example, instance A wants to modify the same block that is already cached in instance B and C (block 987654)
  1. Instance A sends an exclusive lock request to master D
  2. The lock master knows that the block may be available at instance B in SCUR mode and at instance C in CR mode. it also sends a ping message to the shared lock holders. The most recent access was at instance B and instance D sends a BAST message to instance B
  3. Instance B sends the block to instance A via the interconnect and closes it shared lock. The block may still be in its buffer to be as CR, but all locks are released
  4. Instance A now has the exclusive lock on the block and sends an assume message to instance D, the lock is in XL0
  5. Instance A modifies the block in its buffer cache, the changes are not committed and thus the block has not been written to disk, thus the SCN remains at 987654
Getting a (Cached) modified block for update and commit
http://www.datadisk.co.uk/images/rac/cache_fusion6.jpg
Carrying on from the above example, instance C now wants to modify the block, if it tries to modify the same row it will have to wait until instance A either commits or rolls back. However in this case instance C wants to modify a different row in the same block.
  1. Instance C sends an exclusive lock request to master D
  2. The lock master knows that instance A holds an exclusive lock on the block and hence sends a ping message to instance A
  3. Instance A sends the dirty buffer to instance C via the interconnect, it downgrades the lock from XCR to NULL, it keeps a PI version of the block and disowns any lock on that buffer. Before shipping the block, Instance A has to create a PI image and flush any pending redo for the block change, the block mode on instance A is now NG1
  4. Instance C sends a message to instance D indicating it has the block in exclusive mode. The block role G indicates that the block is in global mode and if it needs to write the block to disk it must coordinate it with other instances that have past images (PIs) of that block. Instance C modifies the block and issues a commit, the SCN is now 987660.
Commit the previously modified block and select the data
http://www.datadisk.co.uk/images/rac/cache_fusion7.jpg
Carrying on from the above example, instance A now issues a commit to release the row level locks held by the transaction and flush the redo information to the redologs
  1. Instance A wants to commit the changes, commit operations do not require any synchronous modifications to the block
  2. The lock status remains the same as the previous state and change vectors for the commits are written to the redologs.
Write the dirty buffers to disk due to a checkpoint
http://www.datadisk.co.uk/images/rac/cache_fusion8.jpg
Carrying on from the above example, instance B writes the dirty blocks from the buffer cache due to a checkpoint (this is were it gets interesting and very clever)
  1. Instance B sends a write request to master D with the necessary SCN
  2. The master knows that the most recent copy of the block may be available at instance C and hence sends a message to instance C asking to write
  3. Instance C initiates a disk write and writes a BWR into the redolog file
  4. Instance C get the write notification that the write is complete
  5. Instance C notifies the master that the write is completed
  6. On receipt of the notification, instance D tells all PI holders to discard their PIs, and the lock at instance C writes the modified block to the disk
  7. All instances that have previously modified this block will also have to write a BWR. The write request by instance C has now been satisfied and instance C can now proceed with its checkpoint as usual
Master instance crashes
http://www.datadisk.co.uk/images/rac/cache_fusion9.jpg
Carrying on from the above example
  1. the master instance D crashes
  2. The Global Resource Directory is frozen momentarily and the resources held by master instance D will be equally distributed in the surviving nodes, also know as remastering (see remastering for more details).
Select the rows from Instance A
http://www.datadisk.co.uk/images/rac/cache_fusion10.jpg
Carrying on from the above example, now instance A queries the rows from that table to get the most recent data
  1. Instance A sends a shared lock to now the new master instance C
  2. Master C knows the most recent copy of the block may be in instance C and asks the holder to ship the CR block to instance A
  3. Instance C ships the CR block to instance A via the interconnect
The above sequence of events can be seen in the table below
Example
Operation on Node
Buffer Status
A
B
C
D
A
B
C
D
1
read block from disk
SCUR
2
read the block from cache
CR
SCUR
3
update the block
XCUR
CR
CR
4
update the same block
PI
CR
XCUR
5
commit the changes
PI
CR
XCUR
6
trigger checkpoint
CR
XCUR
7
instance crash
8
select the rows
CR
XCUR


RAC Troubleshooting
This is the one section what will be updated frequently as my experience with RAC grows, as RAC has been around for a while most problems can be resolve with a simple google lookup, but a basic understanding on where to look for the problem is required. In this section I will point you where to look for problems, every instance in the cluster has its own alert logs, which is where you would start to look. Alert logs contain startup and shutdown information, nodes joining and leaving the cluster, etc.
Here is my complete alert log file of my two node RAC starting up.
The cluster itself has a number of log files that can be examined to gain any insight of occurring problems, the table below describes the information that you may need of the CRS components
$ORA_CRS_HOME/crs/log
contains trace files for the CRS resources
$ORA_CRS_HOME/crs/init
contains trace files for the CRS daemon during startup, a good place to start
$ORA_CRS_HOME/css/log
contains cluster reconfigurations, missed check-ins, connects and disconnects from the client CSS listener. Look here to obtain when reboots occur
$ORA_CRS_HOME/css/init
contains core dumps from the cluster synchronization service daemon (OCSd)
$ORA_CRS_HOME/evm/log
log files for the event volume manager and eventlogger daemon
$ORA_CRS_HOME/evm/init
pid and lock files for EVM
$ORA_CRS_HOME/srvm/log
log files for Oracle Cluster Registry (OCR)
$ORA_CRS_HOME/log
log files for Oracle clusterware which contains diagnostic messages at the Oracle cluster level
As in a normal Oracle single instance environment, a RAC environment contains the standard RDBMS log files, these files are located by the parameter background_dest_dump. The most important of these are
$ORACLE_BASE/admin/udump
contains any trace file generated by a user process
$ORACLE_BASE/admin/cdump
contains core files that are generated due to a core dump in a user process
Now lets look at a two node startup and the sequence of events
First you must check that the RAC environment is using the connect interconnect, this can be done by either of the following
logfile
## The location of my alert log, yours may be different
/u01/app/oracle/admin/racdb/bdump/alert_racdb1.log
ifcfg command
oifcfg getif
table check
select inst_id, pub_ksxpia, picked_ksxpia, ip_ksxpia from x$ksxpia;
oradebug
SQL> oradebug setmypid
SQL> oradebug ipc

Note: check the trace file which can be located by the parameter user_dump_dest
system parameter
cluster_interconnects

Note: used to specify which address to use
When the instance starts up the Lock Monitor's (LMON) job is to register with the Node Monitor (NM) (see below table). Remember when a node joins or leaves the cluster the GRD undergoes a reconfiguration event, as seen in the logfile it is a seven step process (see below for more details on the seven step process).
The LMON trace file also has details about reconfigurations it also details the reason for the event
reconfiguation reason
description
1
means that the NM initiated the reconfiguration event, typical when a node joins or leaves a cluster
2
means that an instance has died
How does the RAC detect an instance death, every instance updates the control file with a heartbeat through its checkpoint (CKPT), if the heartbeat information is missing for x amount of time, the instance is considered to be dead and the Instance Membership Recovery (IMR) process initiates reconfiguration.
3
means communication failure of a node/s. Messages are sent across the interconnect if a message is not received in an amount of time then a communication failure is assumed by default UDP is used and can be unreliable so keep an eye on the logs if too many reconfigurations happen for reason 3.
Example of a reconfiguration, taken from the alert log.
Sat Mar 20 11:35:53 2010
Reconfiguration started (old inc 2, new inc 4)
List of nodes:
  0 1
  Global Resource Directory frozen
  * allocate domain 0, invalid = TRUE
  Communication channels reestablished
  Master broadcasted resource hash value bitmaps
  Non-local Process blocks cleaned out
Sat Mar 20 11:35:53 2010
  LMS 0: 0 GCS shadows cancelled, 0 closed
  Set master node info
  Submitted all remote-enqueue requests
  Dwn-cvts replayed, VALBLKs dubious
  All grantable enqueues granted
  Post SMON to start 1st pass IR
Sat Mar 20 11:35:53 2010
  LMS 0: 0 GCS shadows traversed, 3291 replayed
Sat Mar 20 11:35:53 2010
  Submitted all GCS remote-cache requests
  Post SMON to start 1st pass IR
  Fix write in gcs resources
Reconfiguration complete

Note: when a reconfiguration happens the GRD is frozen until the reconfiguration is completed
Confirm that the database has been started in cluster mode, the log file will state the following
cluster mode
Sat Mar 20 11:36:02 2010
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Completed: ALTER DATABASE MOUNT
Staring with 10g the SCN is broadcast across all nodes, the system will have to wait until all nodes have seen the commit SCN. You can change the board cast method using the system parameter_lgwr_async_broadcasts.
Lamport Algorithm
The lamport algorithm generates SCNs in parallel and they are assigned to transaction on a first come first served basis, this is different than a single instance environment, a broadcast method is used after a commit operation, this method is more CPU intensive as it has to broadcast the SCN for every commit, but he other nodes can see the committed SCN immediately.
The initialization parameter max_commit_propagation_delay limits the maximum delay allow for SCN propagation, by default it is 7 seconds. When set to less than 100 the broadcast on commit algorithm is used.
Disable/Enable Oracle RAC
There are times when you may wish to disable RAC, this feature can only be used in a Unix environment (no windows option).
Disable Oracle RAC (Unix only)
  1. Log in as Oracle in all nodes
  2. shutdown all instances using either normal or immediate option
  3. change to the working directory $ORACLE_HOME/lib
  4. run the below make command to relink the Oracle binaries without the RAC option (should take a few minutes)

        make -f ins_rdbms.mk rac_off
  5. Now relink the Oracle binaries

        make -f ins_rdbms.mk ioracle
Enable Oracle RAC (Unix only)
  1. Log in as Oracle in all nodes
  2. shutdown all instances using either normal or immediate option
  3. change to the working directory $ORACLE_HOME/lib
  4. run the below make command to relink the Oracle binaries without the RAC option (should take a few minutes)

        make -f ins_rdbms.mk rac_on
  5. Now relink the Oracle binaries

        make -f ins_rdbms.mk ioracle
Performance Issues
Oracle can suffer a number of different performance problems and can be categorized by the following
  • Hung Database
  • Hung Session(s)
  • Overall instance/database performance
  • Query Performance
A hung database is basically an internal deadlock between to processes, usually Oracle will detect the deadlock and rollback one of the processes, however if the situation occurs with the internal kernel-level resources (latches or pins), it is unable to automatically detect and resolve the deadlock, thus hanging the database. When this event occurs you must obtain dumps from each of the instances (3 dumps per instance in regular times), the trace files will be very large.
capture information
## Using alter session
SQL> alter session set max_dump_file_size = unlimited;
SQL> alter session set events 'immediate trace name systemstate level 10';
# using oradebug
SQL> select * from dual;
SQL> oradebug setmypid
SQL> unlimit
SQL> oradebug dump systemstate 10

# using oradebug from another instance
SQL> select * from dual;
SQL> oradebug setmypid
SQL> unlimit
SQL> oradebug -g all dump systemstate 10

Note: the select statement above is to avoid problems on pre 8 Oracle
SQLPlus - problems connecting
## If you get problems connecting with SQLPLUS use the command below
$ sqlplus -prelim
Enter user-name: / as sysdba
A severe performance problem can be mistaken for a hang, this usually happen because of contention problems, a systemstate dump is normally used to analyze this problem, however a systemstate dump taken a long time to complete, it also has a number of limitations
  • Reads the SGA in a dirty manner, so it may be inconsistent
  • Usually dumps a lot of information
  • does not identify interesting processes on which to perform additional dumps
  • can be a very expensive operation if you have a large SGA.
To overcome these limitations a new utility command was released with 8i called hanganalyze which provides clusterwide information in a RAC environment on a single shot.
sql method
alter session set events 'immediate trace hanganalyze level <level>';
oradebug
SQL> oradebug hanganalyze <level>

## Another way using oradebug
SQL> setmypid
SQL> setinst all
SQL> oradebug -g def hanganalyze <level>

Note: you will be told where the output will be dumped to
hanganalyze levels
1-2
only hanganalyze output, no process dump at all, click here for an example level 1 dump
3
Level 2 + Dump only processes thought to be in a hang (IN_HANG state)
4
Level 3 + Dump leaf nodes (blockers) in wait chains (LEAF, LEAF_NW, IGN_DMP state)
5
Level 4 + Dump all processes involved in wait chains (NLEAF state)
10
Dump all processes (IGN state)
The hanganalyze command uses internal kernel calls to determine whether a session is waiting for a resource and reports the relationship between blockers and waiters, systemdump is better but if you over whelmed try hanganalyze first.
Debugging Node Eviction
A node is evicted from the cluster after it kills itself because it is not able to service the application, this generally happens when you have communication problems. For eviction node problems look for ora-29740 errors in the alert log file and LMON trace files.
To understand eviction problems you need to now the basics of node membership and instance membership recovery (IMR) works. When a communication failure happens the heartbeat information in the control cannot happen, thus data corruption can happen. IMR will remove any nodes from the cluster that it deems as a problem, IMR will ensure that the larger part of the cluster will survive and kills any remaining nodes. IMR is part of the service offered by Cluster Group Services (CGS). LMON handles many of the CGS functionalities, this works at the cluster level and can work with 3rd party software (Sun Cluster, Veritas Cluster). The Node Monitor (NM) provides information about nodes and their health by registering and communicating with the Cluster Manager (CM). Node membership is represented as a bitmap in the GRD. LMON will let other nodes know of any changes in membership, for example if a node joins or leaves the cluster, the bitmap is rebuilt and communicated to all nodes.
Node registering (alert log)
lmon registered with NM - instance id 1 (internal mem no 0)
One thing to remember is that all nodes must be able to read from and write to the controlfile. CGS makes sure that members are valid, it uses a voting mechanism to check the validity of each member. I have already discussed the voting disk in my architecture section, as stated above memberships is held in a bitmap in the GRD, the CKPT process updates the controlfile every 3 seconds in an operation known as aheartbeat. It writes into a single block that is unique for each instance, thus intra-instance coordination is not required, this block is called the checkpoint progress record. You can see the controlfile records using the gv$controlfile_record_section view, all members attempt to obtain a lock on the controlfile record for updating, the instance that obtains the lock tallies the votes from all members, the group membership must conform to the decided (voted) membership before allowing the GCS/GES reconfiguration to proceed, the controlfile vote result is stored in the same block as the heartbeat in the control file checkpoint progress record.
A cluster reconfiguration is performed using 7 steps
  1. Name service is frozen, the CGS contains an internal database of all the members/instances in the cluster with all their configurations and servicing details.
  2. Lock database (IDLM) is frozen, this prevents processes from obtaining locks on resources that were mastered by the departing/dead instance
  3. Determination of membership and validation and IMR
  4. Bitmap rebuild takes place, instance name and uniqueness verification, GCS must synchronize the cluster to be sure that all members get the reconfiguration event and that they all see the same bitmap.
  5. Delete all dead instance entries and republish all names newly configured
  6. Unfreeze and release name service for use
  7. Hand over reconfiguration to GES/GCS
Debugging CRS and GSD
Oracle server management configuration tools include a diagnostic and tracing facility for verbose output for SRVCTL, GSD, GSDCTL or SRVCONFIG.
To capture diagnose following the below
  1. use vi to edit the gsd.sh/srvctl/srvconfig file in the $ORACLE_HOME/bin directory
  2. At the end of the file look for the below line

    exec $JRE -classpath $CLASSPATH oracle.ops.mgmt.daemon.OPSMDaemon $MY_OHOME 
  3. Add the following just before the -classpath in the exec $JRE line

    -DTRACING.ENABLED=true -DTRACING.LEVEL=2
  4. the string should look like this

    exec $JRE -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath...........
In Oracle database 10g setting the below variable accomplishes the same thing, set it to blank to remove the debugging
Enable tracing
 $ export SRVM_TRACE=true
Disable tracing
 $ export SRVM_TRACE=""

Adding or Deleting a Node
One of the jobs of a DBA is adding and removing nodes from a RAC environment when capacity demands, although you should add a node of a similar spec it is possible to add a node of a higher or lower spec.
The first stage is to configure the operating system and make sure any necessary drivers are installed, also make sure that the node can see the shared disks available to the existing RAC.
I am going to presume we have a two RAC environment already setup, and we are going to add a third node.
Pre-Install Checking
You used the Cluster Verification utility when installing the RAC environment, the tools check that the node has been properly prepared for a RAC deployment. You can run the command either from the new node or from any of the existing nodes in the cluster
pre-install check run from new node
runcluvfy.sh stage -pre crsinst -n rac1,rac2,rac3 -r 10gr2
pre-install check run from existing node
cluvfy stage -pre crsinst -n rac1,rac2,rac3 -r 10g2
Make sure that you fix any highlighted problems before continuing.
Install CRS
Cluster Ready Services (CRS) should be installed first, this allows the node to become part of the cluster. Adding the new node can be started from any of the existing nodes
  1. Log into any of the existing nodes as user oracle then run the below command, the script below starts the OUI GUI tool, hopefully the tool will already see the existing cluster and will fill in the details for you

      $ORA_RS_HOME/oui/bin/addnode.sh
  2. In the specify cluster nodes to add to installation screen, enter the new names for the public, private and virtual hosts
  3. Click next to see a summary page
  4. Click install, the installer will copy the files from the existing node to the new node. Once copied you will be asked to run orainstRoot.sh and root.sh as user root
  5. Run orainstRoot.sh and root.sh in the new and rootaddnode.sh in the node that you are running the installation from.

      
orainstRoot.sh
sets the Oracle inventory in the new node and set ownerships and permissions to the inventory
root.sh
checks whether the Oracle CRS stack is already configured in the new node, creates /etc/oracle directory and adds the relevant OCR keys to the cluster registry and it adds the daemon to CRS and starts CRS in the new node.
 rootaddnode.sh
configures the OCR registry to include the new nodes as part of the cluster
  1.  
  2. Click next to complete the installation. Now you need to configure Oracle Notification Services (ONS). The port can be identified by the below command

      cat $ORA_CRS_HOME/opmn/conf/ons.config
  3. Now run the ONS utility by supplying the <remote_port> number obtained above

      racgons add_config rac3:<remote_port>
Installing Oracle DB Software
Once the CRS has been installed and the new node is in the cluster, it is time to install the Oracle DB software. Again you can use any of the existing nodes to install the software.
  1. Log into any of the existing nodes as user oracle then run the below command, the script below starts the OUI GUI tool, hopefully the tool will already see the existing cluster and fill in the details for you

      $ORA_RS_HOME/oui/bin/addnode.sh
  2. Click next on the welcome screen to open the specify cluster nodes to add to installation screen, you should have a list of all the existing nodes in the cluster, select the new node and click next
  3. Check the summary page then click install to start the installation
  4. The files will be copied to the new node, the script will ask you to run run.sh on the new node, then click OK to finish off the installation
Configuring the Listener 

Now its time to configure the listener in the new node
  1. Login as user oracle, and set your DISPLAY environment variable, then start the Network Configuration Assistant

    $ORACLE_HOME/bin/netca
  2. Choose cluster management
  3. Choose listener
  4. Choose add
  5. Choose the the name as LISTENER
These steps will add a listener on rac3 as LISTENER_rac3
Create the Database Instance
Run the below to create the database instance on the new node
  1. Login as oracle on the new node, set the environment to database home and then run the database creation assistant (DBCA)

    $ORACLE_HOME/bin/dbca
  2. In the welcome screen choose oracle real application clusters database to create the instance and click next
  3. Choose instance management and click next
  4. Choose add instance and click next
  5. Select RACDB (or whatever name you gave you RAC environment) as the database and enter the SYSDBA and password, click next
  6. You should see a list of existing instances, click next and on the following screen enter ORARAC3 as the instance and choose RAC3 as the node name (substitute any of the above names for your environment naming convention)
  7. The database instance will now created, click next in the database storage screen., choose yes when asked to extend ASM
Removing a Node
Removing a node is similar to above but in reverse order
  1. Delete the instance on the node to be removed
  2. Clean up ASM
  3. Remove the listener from the node to be removed
  4. Remove the node from the database
  5. Remove the node from the clusterware
You can delete the instance by using the database creation assistant (DBCA), invoke the program choose the RAC database, choose instance management and then choose delete instance, enter the sysdba user and password then choose the instance to delete.
To clean up ASM follow the below steps
  1. From node 1 run the below command to stop ASM on the node to be removed

      srvctl stop asm -n rac3
      srvctl remove asm -n rac3
  2. Now run the following on the node to be removed

      cd $ORACLE_HOME/admin
      rm -rf +ASM

      cd $ORACLE_HOME/dbs
      rm -f *ASM*
  3. Check that /etc/oratab file has no ASM entries, if so remove them
Now remove the listener for the node to be removed
  1. Login as user oracle, and set your DISPLAY environment variable, then start the Network Configuration Assistant

    $ORACLE_HOME/bin/netca
  2. Choose cluster management
  3. Choose listener
  4. Choose Remove
  5. Choose the the name as LISTENER
Next we remove the node from the database
  1. Run the below script from the node to be removed

    cd $ORACLE_HOME/bin
    ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac3}" -local
    ./runInstaller
  2. Choose to deinstall products and select the dbhome
  3. Run the following from node 1

      cd $ORACLE_HOME/oui/bin
      ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac1,rac2,rac3}"
Lastly we remove the clusterware software
  1. Run the following from node 1, you obtain the port number from remoteport section in the ons.config file in $ORA_CRS_HOME/opmn/conf

      $CRS_HOME/bin/racgons remove_config rac3:6200
  2. Run the following from the node to be removed as user root

      cd $CRS_HOME/install
      ./rootdelete.sh
  3. Now run the following from node 1 as user root, obtain the node number first

      $CRS_HOME/bin/olsnodes -n
      cd $CRS_HOME/install
      ./rootdeletenode.sh rac3,3
  4. Now run the below from the node to be removed as user oracle

      cd $CRS_HOME/oui/bin
      ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac3}" CRS=TRUE -local
      ./runInstaller
  5. Choose to deinstall software and remove the CRS_HOME
  6. Run the following from node as user oracle

      cd $CRS_HOME/oui/bin
      ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac1,rac2,rac3}" CRS=TRUE
  7. Check that the node has been removed, the first should report "invalid node", the second you should not see any output and the last command you should only see nodes rac1 and rac2

      srvctl status nodeapps -n rac3
      crs_stat |grep -i rac3
      olsnodes -n

RAC Cheatsheet
This is a quick and dirty cheatsheet on Oracle RAC 10g, as my experience with RAC grows I will update this section, below is a beginners guide on the commands and information that you will require to administer Oracle RAC.
Acronyms
Acronyms
GCS
Global Cache Services
in memory database containing current locks and awaiting locks, also known as PCM
GES
Global Enqueue Services
coordinates the requests of all global enqueues uses the GCS, also known as non-PCM
GRD
Global Resource Directory
all resources available to the cluster (formed and managed by GCS and GES), see GRDfor more details
GRM
Global Resource Manager
helps to coordinate and communicate the locks requests between Oracle processes
GSD
Global Services Daemon
runs on each node with one GSD process per node. The GSD coordinates with the cluster manager to receive requests from clients such as the DBCA, EM, and the SRVCTL utility to execute administrative job tasks such as instance startup or shutdown. The GSD is not an Oracle instance background process and is therefore not started with the Oracle instance
PCM (IDLM)
Parallel Cache Management
formly know as (integrated) Distributed Lock Manager, its another name for GCS
Resource
n/a
it is a identifiable entity it basically has a name or a reference, it can be a area in memory, a disk file or an abstract entity
Resource (Global)
n/a
a resource that can be accessed by all the nodes within the cluster examples would be the following
  • Data Buffer Cache Block
  • Transaction Enqueue
  • Database Data Structures
LVB
Lock Value Block
contains a small amount of data regarding the lock
TRFC
Traffic Controller
controls the DLM traffic between instances (messaging tickets)
Files and Directories
Files and Directories
$ORA_CRS_HOME/cdata/<cluster_name>
OCR backups (default location)
$ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
OCR command log file
$ORA_CRS_HOME/crs/log
contains trace files for the CRS resources
$ORA_CRS_HOME/crs/init
contains trace files for the CRS daemon during startup, a good place to start
$ORA_CRS_HOME/css/log
contains cluster reconfigurations, missed check-ins, connects and disconnects from the client CSS listener. Look here to obtain when reboots occur
$ORA_CRS_HOME/css/init
contains core dumps from the cluster synchronization service daemon (OCSd)
$ORA_CRS_HOME/evm/log
logfiles for the event volume manager and eventlogger daemon
$ORA_CRS_HOME/evm/init
pid and lock files for EVM
$ORA_CRS_HOME/srvm/log
logfiles for Oracle Cluster Registry (OCR)
$ORA_CRS_HOME/log
log fles for Oracle clusterware which contains diagnostic messages at the Oracle cluster level
Useful Views/Tables
GCS and Cache Fusion Diagnostics
v$cache
contains information about every cached block in the buffer cache
v$cache_transfer
contains information from the block headers in SGA that have been pinged at least once
v$instance_cache_transfer
contains information about the transfer of cache blocks through the interconnect
v$cr_block_server
contains statistics about CR block transfer across the instances
v$current_block_server
contains statistics about current block transfer across the instances
v$gc_element
contains one-to-one information for each global cache resource used by the buffer cache
GES diagnostics
v$lock
contains information about locks held within a database and outstanding requests for locks and latches
v$ges_blocking_enqueue
contains information about locks that are being blocked or blocking others and locks that are known to the lock manager
v$enqueue_statistics
contains details about enqueue statistics in the instance
v$resource_limits
display enqueue statistics
v$locked_object
contains information about DML locks acquired by different transactions in databases with their mode held
v$ges_statistics
contains miscellaneous statistics for GES
v$ges_enqueue
contains information about all locks known to the lock manager
v$ges_convert_local
contains information about all local GES operations
v$ges_convert_remote
contains information about all remote GES operations
v$ges_resource
contains information about all resources known to the lock manager
v$ges_misc
contains information about messaging traffic information
v$ges_traffic_controller
contains information about the message ticket usage
Dynamic Resource Remastering
v$hvmaster_info
contains information about current and previous master instances of GES resources in relation to hash value ID of resource
v$gcshvmaster_info
the same as above but globally
v$gcspfmaster_info
conatins information about current and previous masters about GCS resources belonging to files mapped to a particular master, including the number of times the resource has remastered
Cluster Interconnect
v$cluster_interconnects
contains information about interconnects that are being used for cluster communication
v$configured_interconnects
same as above but also contains interconnects that AC is aware off that are not being used
Miscellanous
v$service
services running on an instance
x$kjmsdp
display LMS daemon statistics
x$kjmddp
display LMD daemon statistics
Useful Parameters
Parameters
cluster_interconnects
specify a specific IP address to use for the inetrconnect
_gcs_fast_config
enables fast reconfiguration for gcs locks (true|false)
_lm_master_weight
controls which instance will hold or (re)master more resources than others
_gcs_resources
controls the number of resources an instance will master at a time
_lm_tickets
controls the number of message tickets
_lm_ticket_active_sendback
controls the number of message tickets (aggressive messaging)
_db_block_max_cr_dba
limits the number of CR copies per DBA on the buffer cache (see grd)
_fairness_threshold
used when too many CR requested arrive for a particular buffer and the block becomes disowned (see grd)
_gc_affinity_time
specifies interval minutes for reamstering
_gc_affinity_limit
defines the number of times a instance access the resource before remastering
_gc_affinity_minimum
defines the minimum number of times a instance access the resource before remastering
_lm_file_affinity
disables dynamic remastering for the objects belonging to those files
_lm_dynamic_remastering
enable or disable remastering
_gc_defer_time
define the time by which an instance deferred downgrading a lock (see Cache Fusion)
_lgwr_async_broadcast
change the SCN boardcast method (see troubleshooting)
Processes
http://www.datadisk.co.uk/images/rac/admin1.jpg
Oracle RAC Daemons and Processes
OPROCd
Process Monitor
provides basic cluster integrity services
EVMd
Event Management
spawns a child process event logger and generates callouts
OCSSd
Cluster Synchronization Services
basic node membership, group services, basic locking
CRSd
Cluster Ready Services
resource monitoring, failover and node recovery
LMSn
Lock Manager Server process - GCS
this is the cache fusion part, it handles the consistent copies of blocks that are tranferred between instances. It receives requests from LMD to perform lock requests. I rools back any uncommitted transactions. There can be upto ten LMS processes running and can be started dynamically if demand requires it.
they manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process. It also handles global deadlock detection and monitors for lock conversion timeouts.
LMON
Lock Monitor Process - GES
this process manages the GES, it maintains consistency of GCS memory in case of process death. It is also responsible for cluster reconfiguration and locks reconfiguration (node joining or leaving), it checks for instance deaths and listens for local messaging.
A detailed log file is created that tracks any reconfigurations that have happened.
LMD
Lock Manager Daemon - GES
this manages the enqueue manager service requests for the GCS. It also handles deadlock detention and remote resource requests from other instances.
LCK0
Lock Process - GES
manages instance resource requests and cross-instance call operations for shared resources. It builds a list of invalid lock elements and validates lock elements during recovery.
DIAG
Diagnostic Daemon
This is a lightweight process, it uses the DIAG framework to monitor the healt of the cluster. It captures information for later diagnosis in the event of failures. It will perform any neccessary recovery if an operational hang is detected.
General Administration
Managing the Cluster
starting
/etc/init.d/init.crs start

crsctl start crs
stopping
/etc/init.d/init.crs stop

crsctl stop crs
enable/disable at boot time
/etc/init.d/init.crs enable
/etc/init.d/init.crs disable

crsctl enable crs
crsctl disable crs
Managing the database configuration with SRVCTL
start all instances
srvctl start database -d <database> -o <option>

Note: starts listeners if not already running, you can use the -o option to specify startup/shutdown options

force
open
mount
nomount
stop all instances
srvctl stop database -d <database> -o <option>

Note: the listeners are not stopped, you can use the -o option to specify startup/shutdown options

immediate
abort
normal
transactional
start/stop particular instance
srvctl [start|stop] database -d <database> -i <instance>,<instance>
display the registered databases
srvctl config database
status
srvctl status database -d <database>
srvctl status instance -d <database> -i <instance>,<instance>
srvctl status service -d <database>
srvctl status nodeapps -n <node>
srvctl status asm -n <node>
stopping/starting
srvctl stop database -d <database>
srvctl stop instance -d <database> -i <instance>,<instance>
srvctl stop service -d <database> -s <service>,<service> -i <instance>,<instance>
srvctl stop nodeapps -n <node>
srvctl stop asm -n <node>

srvctl start database -d <database>
srvctl start instance -d <database> -i <instance>,<instance>
srvctl start service -d <database> -s <service>,<service> -i <instance>,<instance>
srvctl start nodeapps -n <node>
srvctl start asm -n <node>
adding/removing
srvctl add database -d <database> -o <oracle_home>
srvctl add instance -d <database> -i <instance> -n <node>
srvctl add service -d <database> -s <service> -r <preferred_list>
srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network
srvctl add asm -n <node> -i <asm_instance> -o <oracle_home>

srvctl remove database -d <database> -o <oracle_home>
srvctl remove instance -d <database> -i <instance> -n <node>
srvctl remove service -d <database> -s <service> -r <preferred_list>
srvctl remove nodeapps -n <node> -o <oracle_home> -A <name|ip>/network
srvctl asm remove -n <node>
OCR utilities
log file
$ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
checking
ocrcheck

Note: will return the OCR version, total space allocated, space used, free space, location of each device and the result of the integrity check
dump contents
ocrdump -backupfile <file>

Note: by default it dumps the contents into a file named OCRDUMP in the current directory
export/import
ocrconfig -export <file>

ocrconfig -restore <file>
backup/restore
# show backups
ocrconfig -showbackup

# to change the location of the backup, you can even specify a ASM disk
ocrconfig -backuploc <path|+asm>

# perform a backup, will use the location specified by the -backuploc location
ocrconfig -manualbackup

# perform a restore
ocrconfig -restore <file>

# delete a backup
orcconfig -delete <file>

Note: there are many more option so see the ocrconfig man page
add/remove/replace
## add/relocate the ocrmirror file to the specified location
ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf'

## relocate an existing OCR file
ocrconfig -replace ocr '/ocfs1/ocr_new.dbf'

## remove the OCR or OCRMirror file
ocrconfig -replace ocr
ocrconfig -replace ocrmirror
CRS Administration
CRS Administration
starting
## Starting CRS using Oracle 10g R1
not possible
## Starting CRS using Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl start crs
stopping
## Stopping CRS using Oracle 10g R1
srvctl stop database -d <database>
srvctl stop asm -n <node>
srvctl stop nodeapps -n <node>
/etc/init.d/init.crs stop

## Stopping CRS using Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl stop crs
disabling/enabling
## use to stop CRS restarting after a reboot

## Oracle 10g R1
/etc/init.d/init.crs [disable|enable]

## Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl [disable|enable] crs
checking
$ORA_CRS_HOME/bin/crsctl check crs
$ORA_CRS_HOME/bin/crsctl check evmd
$ORA_CRS_HOME/bin/crsctl check cssd
$ORA_CRS_HOME/bin/crsctl check crsd
$ORA_CRS_HOME/bin/crsctl check install -wait 600
Resource Applications (CRS Utilities)
status
$ORA_CRS_HOME/bin/crs_stat
create profile
$ORA_CRS_HOME/bin/crs_profile
register/unregister application
$ORA_CRS_HOME/bin/crs_register
$ORA_CRS_HOME/bin/crs_unregister
Start/Stop an application
$ORA_CRS_HOME/bin/crs_start
$ORA_CRS_HOME/bin/crs_stop
Resource permissions
$ORA_CRS_HOME/bin/crs_getparam
$ORA_CRS_HOME/bin/crs_setparam
Relocate a resource
$ORA_CRS_HOME/bin/crs_relocate
Nodes
member number/name
olsnodes -n
local node name
olsnodes -l
activates logging
olsnodes -g
Oracle Interfaces
display
oifcfg getif
delete
oicfg delig -global
set
oicfg setif -global <interface name>/<subnet>:public
oicfg setif -global <interface name>/<subnet>:cluster_interconnect
Global Services Daemon Control
starting
gsdctl start
stopping
gsdctl stop
status
gsdctl status
Cluster Configuration (clscfg is used during installation)
create a new configuration
clscfg -install
upgrade or downgrade and existing configuration
clscfg -upgrade
clscfg -downgrade
add or delete a node from the configuration
clscfg -add
clscfg -delete
create a special single-node configuration for ASM
clscfg -local
brief listing of terminology used in the other nodes
clscfg -concepts
used for tracing
clscfg -trace
help
clscfg -h
Cluster Name Check
print cluster name
cemulto -n

Note: in Oracle 9i the ulity was called "cemutls"
print the clusterware version
cemulto -w

Note: in Oracle 9i the ulity was called "cemutls"
Node Scripts
Add Node
addnode.sh

Note: see 
adding and deleting nodes
Delete Node
deletenode.sh

Note: 
see adding and deleting nodes
Enqueues
displaying statistics
SQL> column current_utilization heading current
SQL> column max_utilization heading max_usage
SQL> column initial_allocation heading initial
SQL> column resource_limit format a23;

SQL> select * from v$resource_limit;
Messaging (tickets)
ticket usage
select local_nid local, remote_nid remote, tckt_avail avail, tckt_limit limit, snd_q_len send_queue, tckt_wait waiting from v$ges_traffic_controller;
dump ticket information
SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug lkdebug -t
Lighwork Rule and Fairness Threshold
downconvert
select cr_requests, light_works, data_requests, fairness_down_converts from v$cr_block_server;

Note: lower the _fairness_threshold if the ratio goes above 40%, set to 0 if the instance is a query only instance.
Remastering
force dynamic remastering (DRM)
## Obtain the OBJECT_ID form the below table
SQL> select * from v$gcspfmaster_info;
## Determine who masters it
SQL> oradebug setmypid
SQL> oradebug lkdebug -a <OBJECT_ID>

## Now remaster the resource
SQL> oradebug setmypid
SQL> oradebug lkdebug -m pkey <OBJECT_ID>
GRD, SRVCTL, GSD and SRVCONFIG Tracing
Enable tracing
 $ export SRVM_TRACE=true
Disable tracing
 $ export SRVM_TRACE=""
Voting Disk
adding
crsctl add css votedisk <file>
deleting
crsctl delete css votedisk <file>
querying
crsctl query css votedisk





What is RAC?
RAC stands for Real Application cluster. It is a clustering solution from Oracle Corporation that ensures high availability of databases by providing instance failover, media failover features.

Mention the Oracle RAC software components:-
Oracle RAC is composed of two or more database instances. They are composed of Memory structures and background processes same as the single instance database. Oracle RAC instances use two processes
GES(Global Enqueue Service), GCS(Global Cache Service) that enable cache fusion.

Oracle RAC instances are composed of following background processes:
ACMS—Atomic Controlfile to Memory Service (ACMS)
GTX0-j—Global Transaction Process
LMON—Global Enqueue Service Monitor
LMD—Global Enqueue Service Daemon
LMS—Global Cache Service Process
LCK0—Instance Enqueue Process
RMSn—Oracle RAC Management Processes (RMSn)
RSMN—Remote Slave Monitor

What is GRD?
GRD stands for Global Resource Directory. The GES and GCS maintains records of the statuses of each datafile and each cached block using global resource directory. This process is referred to as cache fusion and helps in data integrity.

Give Details on Cache Fusion:-
Oracle RAC is composed of two or more instances. When a block of data is read from datafile by an instance within the cluster and another instance is in need of the same block,it is easy to get the block image from the insatnce which has the block in its SGA rather than reading from the disk. To enable inter instance communication Oracle RAC makes use of interconnects. The Global Enqueue Service(GES) monitors and Instance enqueue process manages the cache fusion.

Give Details on ACMS:-
ACMS stands for Atomic Controlfile Memory Service. In an Oracle RAC environment ACMS is an agent that ensures a distributed SGA memory update (ie) SGA updates are globally committed on success or globally aborted in event of a failure.

Give details on GTX0-j :-
The process provides transparent support for XA global transactions in a RAC environment. The database autotunes the number of these processes based on the workload of XA global transactions.

Give details on LMON:-
This process monitors global enques and resources across the cluster and performs global enqueue recovery operations. This is called as Global Enqueue Service Monitor.

Give details on LMD:-
This process is called as global enqueue service daemon. This process manages incoming remote resource requests within each instance.

Give details on LMS:-
This process is called as Global Cache service process. This process maintains statuses of datafiles and each cached block by recording information in a Global Resource Directory (GRD). This process also controls the flow of messages to remote instances and manages global data block access and transmits block images between the buffer caches of different instances.This processing is a part of cache fusion feature.

Give details on LCK0:-
This process is called as Instance enqueue process. This process manages non-cache fusion resource requests such as library and row cache requests.

Give details on RMSn:-
This process is called as Oracle RAC management process. These pocesses perform managability tasks for Oracle RAC. Tasks include creation of resources related Oracle RAC when new instances are added to the cluster.

Give details on RSMN:-
This process is called as Remote Slave Monitor. This process manages background slave process creation and communication on remote instances. This is a background slave process.This process performs tasks on behalf of a co-ordinating process running in another instance.

What components in RAC must reside in shared storage?
All datafiles, controlfiles, SPFIles, redo log files must reside on cluster-aware shred storage.

What is the significance of using cluster-aware shared storage in an Oracle RAC environment?
All instances of an Oracle RAC can access all the datafiles, control files, SPFILE's, redolog files when these files are hosted out of cluster-aware shared storage which are group of shared disks.

Give few examples for solutions that support cluster storage:-
ASM(automatic storage management), raw disk devices, network file system(NFS), OCFS2 and OCFS(Oracle Cluster Fie systems).

What is an interconnect network?
An interconnect network is a private network that connects all of the servers in a cluster.
The interconnect network uses a switch/multiple switches that only the nodes in the cluster can access.

How can we configure the cluster interconnect?
Configure User Datagram Protocol(UDP) on Gigabit ethernet for cluster interconnect.
On unix and linux systems we use UDP and RDS (Reliable data socket) protocols to be used by Oracle Clusterware. Windows clusters use the TCP protocol.

Can we use crossover cables with Oracle Clusterware interconnects?
No, crossover cables are not supported with Oracle Clusterware interconnects.

What is the use of cluster interconnect?
Cluster interconnect is used by the Cache fusion for inter instance communication.

How do users connect to database in an Oracle RAC environment?
Users can access a RAC database using a client/server configuration or through one or more middle tiers, with or without connection pooling. Users can use oracle services feature to connect to database.

What is the use of a service in Oracle RAC environment?
Applications should use the services feature to connect to the Oracle database.Services enable us to define rules and characteristics to control how users and applications connect to database instances.

What are the characteristics controlled by Oracle services feature?
The characteristics include a unique name, workload balancing and failover options, and high availability characteristics.

Which enable the load balancing of applications in RAC?
Oracle Net Services enable the load balancing of application connections across all of the instances in an Oracle RAC database.

What is a virtual IP address or VIP?
A virtual IP address or VIP is an alternate IP address that the client connections use instead of the standard public IP address. To configureVIP address, we need to reserve a spare IP address for each node, and the IP addresses must use the same subnet as the public network.

What is the use of VIP?
If a node fails, then the node's VIP address fails over to another node on which the VIP address can accept TCP connections but it cannot accept Oracle connections.

Give situations under which VIP address failover happens:-
VIP addresses failover happens when the node on which the VIP address runs fails, all interfaces for the VIP address fails, all interfaces for the VIP address are disconnected from the network.

What is the significance of VIP address failover?
When a VIP address failover happens, Clients that attempt to connect to the VIP address receive a rapid connection refused error .They don't have to wait for TCP connection timeout messages.

What are the administrative tools used for Oracle RAC environments?
Oracle RAC cluster can be administered as a single image using OEM (Enterprise Manager), SQL*PLUS, Servercontrol(SRVCTL), clusterverificationutility(cvu), DBCA, NETCA.

How do we verify that RAC instances are running?
Issue the following query from any one node connecting through SQL*PLUS.
$connect sys/sys as sysdba
SQL>select * from V$ACTIVE_INSTANCES;
The query gives the instance number under INST_NUMBER column, host_: instancename under  INST_NAME column.

What is FAN?
Fast application Notification as it abbreviates to FAN relates to the events related to instances, services and nodes. This is a notification mechanism that Oracle RAC uses to notify other processes about the configuration and service level information that includes service status changes such as, UP or DOWN events. Applications can respond to FAN events and take immediate action.

Where can we apply FAN UP and DOWN events?
FAN UP and FAN DOWN events can be applied to instances, services and nodes.

State the use of FAN events in case of a cluster configuration change?
During times of cluster configuration changes,Oracle RAC high availability framework publishes a FAN event immediately when a state change occurs in the cluster.So applications can receive FAN events and react immediately.This prevents applications from polling database and detecting a problem after such a state change.

Why should we have seperate homes for ASm instance?
It is a good practice to have ASM home seperate from the database home(ORACLE_HOME).This helps in upgrading and patching ASM and the Oracle database software independent of each other.Also,we can deinstall the Oracle database software independent of the ASM instance.

What is the advantage of using ASM?
Having ASM is the Oracle recommended storage option for RAC databases as the ASM maximizes performance by managing the storage configuration across the disks.ASM does this by distributing the database file across all of the available storage within our cluster database environment.

What is rolling upgrade?
It is a new ASM feature from Database 11g.ASM instances in Oracle database 11g release(from 11.1) can be upgraded or patched using rolling upgrade feature. This enables us to patch or upgrade ASM nodes in a clustered environment without affecting database availability. During a rolling upgrade we can maintain a functional cluster while one or more of the nodes in the cluster are running in different software versions.

Can rolling upgrade be used to upgrade from 10g to 11g database?
No, it can be used only for Oracle database 11g releases(from 11.1).

State the initialization parameters that must have same value for every instance in an Oracle RAC database:-
Some initialization parameters are critical at the database creation time and must have same values. Their value must be specified in SPFILE or PFILE for every instance. The list of parameters that must be identical on every instance are given below:
ACTIVE_INSTANCE_COUNT
ARCHIVE_LAG_TARGET
COMPATIBLE
CLUSTER_DATABASE
CLUSTER_DATABASE_INSTANCE
CONTROL_FILES
DB_BLOCK_SIZE
DB_DOMAIN
DB_FILES
DB_NAME
DB_RECOVERY_FILE_DEST
DB_RECOVERY_FILE_DEST_SIZE
DB_UNIQUE_NAME
INSTANCE_TYPE (RDBMS or ASM)
PARALLEL_MAX_SERVERS
REMOTE_LOGIN_PASSWORD_FILE
UNDO_MANAGEMENT

Can the DML_LOCKS and RESULT_CACHE_MAX_SIZE be identical on all instances?
These parameters can be identical on all instances only if these parameter values are set to zero.

What two parameters must be set at the time of starting up an ASM instance in a RAC  environment?
The parameters CLUSTER_DATABASE and INSTANCE_TYPE must be set.

Mention the components of Oracle clusterware:-
Oracle clusterware is made up of components like voting disk and Oracle Cluster Registry(OCR).

What is a CRS resource?
Oracle clusterware is used to manage high-availability operations in a cluster. Anything that Oracle Clusterware manages is known as a CRS resource.Some examples of CRS resources are database,an instance,a service,a listener,a VIP address,an application process etc.

What is the use of OCR?
Oracle clusterware manages CRS resources based on the configuration information of CRS resources stored in OCR(Oracle Cluster Registry).

How does a Oracle Clusterware manage CRS resources?
Oracle clusterware manages CRS resources based on the configuration information of CRS resources stored in OCR(Oracle Cluster Registry).

Name some Oracle clusterware tools and their uses?
OIFCFG - allocating and deallocating network interfaces
OCRCONFIG - Command-line tool for managing Oracle Cluster Registry
OCRDUMP - Identify the interconnect being used
CVU - Cluster verification utility to get status of CRS resources

What are the modes of deleting instances from ORacle Real Application cluster Databases?
We can delete instances using silent mode or interactive mode using DBCA(Database Configuration Assistant).

How do we remove ASM from a Oracle RAC environment?
We need to stop and delete the instance in the node first in interactive or silent mode.After that asm can be removed using srvctl tool as follows:
srvctl stop asm -n node_name
srvctl remove asm -n node_name
We can verify if ASM has been removed by issuing the following command:
srvctl config asm -n node_name

How do we verify that an instance has been removed from OCR after deleting an instance?
Issue the following srvctl command:
srvctl config database -d database_name
cd CRS_HOME/bin
./crs_stat

How do we verify an existing current backup of OCR?
We can verify the current backup of OCR using the following command :
Ocrconfig –showbackup

What are the performance views in an Oracle RAC environment?
We have v$ views that are instance specific. In addition we have GV$ views called as global views that has an INST_ID column of numeric data type.GV$ views obtain information from individual V$ views.


What are the types of connection load-balancing?
There are two types of connection load-balancing:server-side load balancing and clientside load balancing.

What is the differnece between server-side and client-side connection load balancing?
Client-side balancing happens at client side where load balancing is done using listener.In case of server-side load balancing listener uses a load-balancing advisory to redirect connections to the instance providing best service.
Give the usage of srvctl:-
srvctl start instance -d db_name -i "inst_name_list" [-o start_options]srvctl stop instance -d name -i "inst_name_list" [-o stop_options]srvctl stop instance -d orcl -i "orcl3,orcl4" -o immediatesrvctl start database -d name [-o start_options]srvctl stop database -d name
[-o stop_options]srvctl start database -d orcl -o mount
Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment

Remote DBA usually face Node Reboots or Node Evictions in Real Application Cluster Environment. Node Reboot is performed by CRS to maintain consistency in Cluster environment by removing node which is facing some critical issue.

A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung , or a hung ocssd.binprocess etc. There could be many more reasons for node Eviction but Some of them are common and repetitive.Here, I am listing:-

Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment:

Whenever, Database Administrator face Node Reboot issue, First thing to look at should be/var/log/message and 
OS Watcher logs of the Database Node which was rebooted.

var/log/messages will give you an actual picture of reboot:- Exact time of restart, status of resource like swap and RAM etc. 
1. High Load on Database Server: Out of 100 Issues, I have seen 70 to 80 time High load on the system was reason for Node Evictions, One common scenario is due to high load RAM and SWAP space of DB node got exhaust and system stops working and finally reboot.

So, Every time you see a node eviction start investigation with /var/log/messages and 
Analyze OS Watcher logs. Below is a situation when a Database Node was reboot due to high load.

/var/log/messages output from a Database Node just before Node eviction:

Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA: 2*4kB 1*8kB 0*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15792kB
Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA32: 150*4kB 277*8kB 229*16kB 186*32kB 51*64kB 65*128kB 82*256kB 13*512kB 3*1024kB 3*2048kB 78*4096kB = 380368kB
Apr 23 08:15:04 remotedb06 kernel: Node 0 Normal: 12362*4kB 58*8kB 0*16kB 0*32kB 0*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 52984kB
Apr 23 08:15:09 remotedb06 kernel: 83907 total pagecache pages
Apr 23 08:15:11 remotedb06 kernel: 39826 pages in swap cache
Apr 23 08:15:33 remotedb06 kernel: osysmond.bin: page allocation failure. order:4, mode:0xd0
From above message, we can see that this system has only 4kB free swap out of 24G swap space. This means system neither has RAM not SWAP for processing, which case a reboot. This picture is also clear from OS Watcher of system.
08:15:29     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s

08:15:29     all    0.67    0.00   82.60   16.60    0.00    0.06    0.00    0.07  42521.98
08:17:08     all    1.69    0.00   67.02   30.93    0.00    0.06    0.00    0.29  37316.55

How to avoid Node Reboot due to High Load ? 
The simple and best way to avoid this is use Oracle Database Resource Manager (DBRM). DBRM help to resolve this by allowing the database to have more control over how hardware resources and their allocation. DBA should setup Resource consumer group and Resource plan and should use them as per requirements. In Exadata system Exadata DBA can use IORM to setup resource allocation among multiple database instances.

2. Voting Disk not Reachable: One of the another reason for Node Reboot is clusterware is not able to access a minimum number of the voting files.When the node aborts for this reason, the node alert log will show CRS-1606 error.

Here is a scenario for voting disk not reachable:

2013-01-26 10:15:47.177
[cssd(3743)]CRS-1606:The number of voting files available, 1, is less than the minimum number of voting files required, 2, resulting in CSSD
termination to ensure data integrity; details at (:CSSNM00018:) in /u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.log
There could be two reasons for this issue:

    A. Connection to the voting disk is interrupted.
    B. if only one voting disk is in use and version is less than 11.2.0.3.4, hitting known bug  13869978.

How to Solve Voting Disk Outage ?

There could be many reasons for voting disk is not reachable, Here are few general approach for DBA to follow.
1. Use command "crsctl query css votedisk" on a node where clusterware is up to get a list of all the voting files.
2. Check that each node can access the devices underlying each voting file.
3. Check for permissions to each voting file/disk have not been changed.
4. Check OS, SAN, and storage logs for any errors from the time of the incident.
5. Apply fix for 13869978 if only one voting disk is in use. This is fixed in 11.2.0.3.4 patch set and above, and 11.2.0.4 and above

If any voting files or underlying devices are not currently accessible from any node, work with storage administrator and/or system administrator to resolve it at storage and/or OS level.

3. Missed Network Connection between Nodes: In technical term this is called as Missed Network Heartbeat (NHB). Whenever there is communication gap or no communication between nodes on private network (interconnect) due to network outage or some other reason. A node abort itself to avoid "split brain" situation. The most common (but not exclusive) cause of missed NHB is network problems communicating over the private interconnect.

Suggestion to troubleshoot Missed Network Heartbeat.
1. Check OS statistics from the evicted node from the time of the eviction. DBA can use OS Watcher to look at OS Stats at time of issue, check oswnetstat and oswprvtnet for network related issues.
2. Validate the interconnect network setup with the Help of Network administrator.
3. Check communication over the private network.
4. Check that the OS network settings are correct by running the RAC check tool.

4. Database Or ASM Instance Hang: Sometimes Database or ASM instance hang can cause Node reboot. In these case Database instance is hang and is terminated afterwards, which cause either reboot cluster or Node eviction. DBA should check alert log of Database and ASM instance for any hang situation which might cause this issue.

Database Alert log file entry for Database Hang Situation:

Wed Jun 05 03:25:01 2013
Sweep [inc][122970]: completed
System State dumped to trace file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122970/remotedb2_m005_118200_i122970_a.trc
Wed Jun 05 03:25:33 2013
DSKM process appears to be hung. Initiating system state dump.
Incident details in: /u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122897/remotedb2_pmon_9707_i122897.trc
Use ADRCI or Support Workbench to package the incident.
System State dumped to trace file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_diag_9759.trc
Instance terminated by PMON, pid = 9707
Wed Jun 05 03:45:58 2013
Starting ORACLE instance (normal)
At the same time resources at Cluster level start failing and node was evicted by itself. Real Application Cluster Log files
[/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5011:Check of resource "PCAB1DE" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2013-06-05 03:27:06.649
Details at (:CRSAGF00113:) {0:6:29} in /u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log.
 [cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log.
2013-06-05 03:37:26.647
[cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:) in
this was not because of high load Since OS Watcher at this time doesn't show any load.
So, I believe this could be due to some bug in database.
In few of the cases, bugs could be the reason for node reboot, bug may be at Database level, ASM level or at Real Application Cluster level. Here, after initial investigation from Database Administrator side, DBA should open an SR with Oracle Support.


What is Oracle RAC Node Eviction?
One of the most common and complex issue is node eviction issue. A node is evicted from the cluster after it kills itself because it is not able to service the applications.
This generally happens during the communication failure between the instances, when the instance is not able to send heartbeat information to the control file and various other reasons.
Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected. A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process. The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.
During failures, to avoid data corruption, the failing instance evicts itself from the cluster group. The node eviction process is reported as Oracle error ORA-29740 in the alert log and LMON trace files.

 

Oracle RAC: Node evictions & 11gR2 node eviction means restart of cluster stack not reboot of node

Cluster integrity and cluster membership will be governed by occsd (oracle cluster synchronization daemon) monitors the nodes using 2 communication channels:
- Private Interconnect  aka Network Heartbeat 
- Voting Disk based communication aka  Disk Heartbeat
Network heartbeat:-
Each node in the cluster is “pinged” every second
  • Nodes must respond in css_misscount time (defaults to 30 secs.) 
          – Reducing the css_misscount time is generally not supported
  • Network heartbeat failures will lead to node evictions
  • CSSD-log: 
        [date / time] [CSSD][1111902528] 
        clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal  in 6.7 sec
Disk Heartbeat:-
Each node in the cluster “pings” (r/w) the Voting Disk(s) every second
  • Nodes must receive a response in (long / short) diskTimeout time 
            – IF I/O errors indicate clear accessibility problems
    à timeout is irrelevant
  • Disk heartbeat failures will lead to node evictions
  • CSSD-log: … 
         [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat:node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)
Now, we know with above possibilities (network, disk heartbeat failures can lead to node eviction, but sometime when the server/occsd/resource request also makes the node get evicted which are extreme cases)
Why nodes should be evicted?
Evicting (fencing) nodes is a preventive measure (it’s a good thing)!
  • Nodes are evicted to prevent consequences of a split brain: 
        – Shared data must not be written by independently operating nodes 
        – The easiest way to prevent this is to forcibly remove a node from the cluster
How are nodes evicted? – STONITH 
Once it is determined that a node needs to be evicted,
  • A “kill request” is sent to the respective node(s) 
        – Using all (remaining) communication channels
  • A node (CSSD) is requested to “kill itself” – “STONITH like” 
        – “STONITH” foresees that a remote node kills the node to be evicted
EXAMPLE: Voting Disk Failure
Voting Disks and heartbeat communication is used to determine the node
  • In a 2 node cluster, the node with the lowest node number should survive
  • In a n-node cluster, the biggest sub-cluster should survive (votes based)
EXAMPLE: Network heartbeat failure
  • The network heartbeat between nodes has failed 
          – It is determined which nodes can still talk to each other 
          – A “kill request” is sent to the node(s) to be evicted
  • Using all (remaining) communication channels à Voting Disk(s)
  • A node is requested to “kill itself”; executer: typically CSSD
EXAMPLE: What if CSSD is stuck or server itself is not responding?
A node is requested to “kill itself”
  • BUT CSSD is “stuck” or “sick” (does not execute) – e.g.:
          – CSSD failed for some reason 
         – CSSD is not scheduled within a certain margin
OCSSDMONITOR (was: oprocd) will take over and execute
EXAMPLE: Cluster member (rac instance) can request a to kill another member (RAC Instance)
A cluster member (rac instance ) can request a kill another member in order to protect the data integrity, in such cases like control file progress record not written proper by the failure instance(read here) , then occsd tries to kill that member, if not possible its tries to evict the node.

11gR2 Changes –> Important, in 11GR2, the fencing (eviction) does not to reboot.
  • Until Oracle Clusterware 11.2.0.2, fencing (eviction) meant “re-boot”
  • With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because: 
         – Re-boots affect applications that might run an a node, but are not protected 
         – Customer requirement: prevent a reboot, just stop the cluster – implemented…
How does this works?
With Oracle Clusterware 11.2.0.2, re-boots will be seen less: Instead of fast re-booting the node, a graceful shutdown of the cluster stack is attempted 

  • It starts with a failure – e.g. network heartbeat or interconnect failure
  • Then IO issuing processes are killed; it is made sure that no IO process remains 
         – For a RAC DB mainly the log writer and the database writer are of concern
  • Once all IO issuing processes are killed, remaining processes are stopped 
         – IF the check for a successful kill of the IO processes, fails → reboot
  • Once all remaining processes are stopped, the stack stops itself with a “restart flag”
  • OHASD will finally attempt to restart the stack after the graceful shutdown
   Exception to above:- 
  • IF the check for a successful kill of the IO processes fails → reboot
  • IF CSSD gets killed during the operation → reboot
  • IF cssdmonitor (oprocd replacement) is not scheduled → reboot
  • IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot


IMPORTANT PARAMETER
By setting CLUSTER_DATABASE=FALSE, you are configuring your system to be a single-node database. This in turn enables you to 'MOUNT EXCLUSIVE,' allowing you to switch the archive mode for the instance. As you indicated, you need to set this parameter value back to TRUE to re-establish your RAC configuration after the archiving mode has been changed.

Managing Oracle Clusterware
Ongoing management of Oracle Clusterware is achieved by using the crsctl and srvctl command-line utilities installed under the Oracle Grid Infrastructure home directory. Oracle Clusterware components and resources can be monitored and managed from any node in the cluster by using crsctl. The srvctl utility provides similar monitoring and management capabilities for Oracle-related resources such as database instances and database services. Both utilities are provided with Oracle Clusterware. However, most crsctl commands are available only to clusterware administrators, whereas srvctl commands are available to other groups such as database administrators.
crsctl manages clusterware-related operations:
— Starting and stopping Oracle Clusterware
— Enabling and disabling Oracle Clusterware daemons
— Registering cluster resources
srvctl manages Oracle resource–related operations:
— Starting and stopping database instances and database services

Controlling Oracle Clusterware
The crsctl utility can be used to control Oracle Clusterware.
To start or stop Oracle Clusterware on a specific node:
# crsctl start crs
# crsctl stop crs
To enable or disable Oracle Clusterware on a specific
node:
# crsctl enable crs
# crsctl disable crs
Controlling Oracle Clusterware
When a node that contains Oracle Clusterware is started, the Oracle Clusterware wrapper
script is automatically started by the /etc/init.d/ohasd startup script. When the
crsctl utility is used to disable Cluster Ready Services (CRS) from automatically starting,
state information related to startup is placed in the SLCS_SRC control files, preventing
automatic startup on machine reboot. To check the status of CRS, use the following syntax:

# crsctl check crs

CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
You may have to manually control the Oracle Clusterware stack while applying patches or
during planned outages. You can stop Oracle Clusterware by using the crsctl stop crs
command and start it by using the crsctl start crs command.
Verifying the Status of Oracle Clusterware
The crsctl utility can be used to verify the status of Oracle Clusterware.
To determine the overall health on a specific node:
$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

To check the viability of Cluster Synchronization Services
(CSS) across nodes:
$ crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
Verifying the Status of Oracle Clusterware

The crsctl utility can be used to verify the status of Oracle Clusterware on specific nodes and across nodes. In contrast to the crsctl controlling commands that required the root access (shown in the previous slide), the check commands do not require root and may be executed by the Oracle Clusterware software owner. The overall health of the clusterware on a specific node can be obtained by using the crsctl check crs command. It is possible to target three of the individual daemons by using the crsctl check <daemon> command for the crsd, evmd, and cssd daemons only. These commands are processed only on the node on which they are executed. To check the viability of Cluster Synchronization Services (CSS) across all nodes, use the crsctl check cluster command. The output of the overall health check performed on a specific node is shown in the slide.

crsctl check crs <<-- for the local node

crsctl check cluster <<-- for remote nodes in the cluster
crs_start
This command sets all the applications or the specified applications ONLINE, and attempts to start the specified registered applications or application resources. You should have full admin privileges for this.

crs_stop

This command stops a resource on the specified node. Checking your crs_stat always helps to verify which application/resource we want to stop.
crs_stat –t
The crs_stat command gives you the names of the applications, which you might need to shut down some applications manually, in order to shut the whole cluster down and restart it, it will check the resources of RAC.
$ORA_CRS_HOME/bin/crs_stat -t
Note: We cannot stop the resources of RAC with cr_stop –all, it will give some errors, Therefore we need to stop our instance first.
Example:
srvctl stop instance -d fokerac -i fokerac1

srvctl stop instance -d fokerac -i fokerac2


Difference on using SRVCTL and CRSCTL commands:
Use SRVCTL to manage Oracle-supplied resources such as
Listener,
Instances,
Disk groups,
Networks.

Use CRSCTL for managing Oracle Clusterware and its resources.

Oracle strongly discourages directly manipulating Oracle-supplied resources (resources whose names begin with ora) using CRSCTL.  This could adversely impact the cluster configuration.

If resource name begins with ora then use SRVCTL
How to check the status of CRS on specific node?
[ora11g@racnode1 ~]$ crsctl check crs
Note: crsctl command will be present in GRID_INRA_HOME
How to check the status of services on all nodes:
[ora11g@racnode1 ~]$ srvctl status nodeapps
How to check the status of complete clusterware stack on all nodes:
[ora11g@racnode1 ~]$ crsctl status resource -t
How to check the status of clusterware servers:
[ora11g@racnode1 ~]$ crsctl status server -f      
How to stop csr on all nodes of clusterware:
[root@racnode1 bin]# ./crsctl stop cluster -all
How to stop crs on specific node:
[root@racnode1 bin]# ./crsctl stop crs
How to start crs on specific node:
[root@racnode1 bin]# ./crsctl start crs
How to check the status of specific Instance across cluster nodes:
[ora11g@racnode1 ~]$ srvctl status instance -d racdb -i racdb1
How to check the status of all instances across the cluster nodes:
[ora11g@racnode1 ~]$ srvctl status instance -d racdb -i racdb1,racdb2,racdb3,racdb4
How to start specific Instance across cluster nodes:
[ora11g@racnode1 ~]$ srvctl start instance -d racdb -i racdb4
How to check the status of database across cluster nodes:
[ora11g@racnode1 ~]$ srvctl status database -d racdb
How to stop cluster database:
[ora11g@racnode1 ~]$ srvctl stop database -d racdb
-- Start/ Stop database / instance services status:
From oracle user
1) Verify instance status
$ srvctl status instance -d EHISKOL -i EHISKOL2
Instance EHISKOL2 is running on node ekrac2
2) To Stop instance
$ srvctl stop instance -d EHISKOL -i EHISKOL2
3) To start instance
$ srvctl start instance -d EHISKOL -i EHISKOL2
-- If below message found, then instance services might be runnng.

PRCC-1015 : EHISKOL was already running on ekrac2
PRCR-1004 : Resource ora.ehiskol.db is already running
4) To stop all instance services across the nodes:
$ srvctl stop database -d EHISKOL
-- Start/ Stop SCAN listeners in RAC databases:
1) Verify SCAN listener status:
$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node ekrac2
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node ekrac1
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node ekrac1
$ srvctl status listener
Listener LISTENER is enabled
Listener LISTENER is running on node(s): ekrac2,ekrac1
Note: Assume 3 scan IP configured with DNS.
When single / default IP configured with SCAN listener, then status will be shown like below:
$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node host1
2) To stop the scan listener:

$ srvctl stop scan_listener
3) To start the scan listener
$ srvctl start scan_listener
1) Verify Cluster services / CRS status
   ( for single node)
# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

for all nodes)
# crsctl check cluster -all
**************************************************************
ekrac1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
2) To stop Cluster services / CRS (single node)

./crsctl stop crs


TOP 5 ISSUES WITH SCAN VIP AND LISTENER
               

ISSUE 1 :

SCAN VIP showing status UNKNOWN – CHECK TIMED OUT
       crsctl stat res –t

                 ora.scan1.vip 1 ONLINE UNKNOWN rac2 CHECK TIMED OUT
                 ora.scan2.vip 1 ONLINE ONLINE rac1
                 ora.scan3.vip 1 ONLINE ONLINE rac1

CAUSE : 

/ec/resolv.conf is not consistent across all nodes .

SOLUTION :

         Have /etc/resolv.conf consistent across all nodes
         Reduce the value for lookup timeout so that the total lookup time is less than the check timeout for VIP resource. E.g.  add to /etc/resolv.conf : options timeout :1 

ISSUE 2 :

What to check on SCAN VIP’s before installing GRID and what to check after installation .

BEFORE INSTALLATION
         /etc/resolv.conf must be consistent across all nodes
         SCAN should have the netmask already used by public interface
         “nslookup scan-name” should return name server and 3 ip addresses
         If tcp.validnode_checking=yes in GRID’s home sqlnet.ora, then TCP.INVITED_NODES will require the SCAN VIP’s and node                       VIP’s to be added to the Grid Infrastructure sqlnet.ora file.

AFTER INSTALLATION

         crsctl status resource –w ‘TYPE=ora.scan_vip.type’    ---  it should show LISTENER_SCAN<X> ONLINE
Also use,
         srvctl config scan
         srvctl config scan_listener
         cluvfy comp scan

ISSUE 3 :

Service not registered with SCAN listener after SCAN listener failover .

DESCRIPTION :

After SCAN VIP and SCAN listener failover, instance does not register with SCAN listener. It might happen for only 1 SCAN listener.
Client connection gets intermittent ORA-12514 TNS:listener does not currently know of service requested in connect descriptor.

SOLUTION :

For both above bugs , the work around is to unregister and register remote listener on the db instance which does not register to a SCAN listener , with the following steps :

show parameter remote_listener ;
alter system set remote_listener=’   ‘ ;
alter system register ;
alter system set remote_listener=’<scan>:<port>’ ;
alter system register ;

ISSUE 4 : 

SCAN VIP’s do not failover when public n/w is down . 

DESCRIPTION : 

SCAN VIP’s should failover to next node when public n/w is down. On 11.2.0.1 under some circumstances it is possible that SCAN VIP stays on wrong node. 

SOLUTION : 
Install last 11.2.0.1 PSU or 11.2.0.2/11.2.0.3  .

ISSUE 5 : 

SCAN listener troubles . 

CAUSE : 

Listener.ora exists in /etc or /var/opt/oracle .
Default port 1521 not used . 

SOLUTION: 

Move listener.ora to other location. It is fixed in 11.2.0.3 and last 11.2.0.2 PSU .
SCAN listener port changed to default 1521 after upgrade to 11.2.0.3 GI from lower release . 

Automatic OCR Backups
The Oracle Clusterware automatically creates OCR physical backups every four hours. At any one time, Oracle always retains the last 3 backup copies of the OCR that are 4 hours old. The CRSD process that creates these backups also creates and retains an OCR backup for each full day and at the end of each week. You cannot customize the backup frequencies or the number of OCR physical backup files that Oracle retains.
The default location for generating physical backups on UNIX-based systems is CRS_home/cdata/cluster_name where cluster_name is the name of your cluster (i.e. crs). Use the ocrconfig -showbackup command to view all current OCR physical backups that were taken from the master node:
[oracle@racnode1 ~]$ ocrconfig -showbackup

Restoring the OCR from an automatic physical backup is accomplished using the ocrconfig -restore command. Note that the CRS stack needs to be shutdown on all nodes in the cluster prior to running the restore operation:

ocrconfig -restore <backup_file_name>



The Master Node
As documented in Doc ID: 357262.1 on the My Oracle Support web site, the CRSD process only creates automatic OCR physical backups on one node in the cluster, which is the OCR master node. It does not create automatic backup copies on the other nodes; only from the OCR master node. If the master node fails, the OCR backups will be created from the new master node. You can determine which node in the cluster is the master node by examining the $ORA_CRS_HOME/log/<node_name>/cssd/ocssd.log file on any node in the cluster. In this log file, check for reconfiguration information (reconfiguration successful) after which you will see which node is the master and how many nodes are active in the cluster:
Node 1 - (racnode1)
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 1 with 2 nodes
[    CSSD]CLSS-3001: local node number 1, master node number 1
Node 2 - (racnode2)
[    CSSD]CLSS-3000: reconfiguration successful, incarnation 1 with 2 nodes
[    CSSD]CLSS-3001: local node number 2, master node number 1
Another quick approach is to use either of the following methods:
Node 1 - (racnode1)
# grep -i "master node" $ORA_CRS_HOME/log/racnode?/cssd/ocssd.log | tail -1
[    CSSD]CLSS-3001: local node number 1, master node number 1
Node 2 - (racnode2)
# grep -i "master node" $ORA_CRS_HOME/log/racnode?/cssd/ocssd.log | tail -1
[    CSSD]CLSS-3001: local node number 2, master node number 1


# If not found in the ocssd.log, then look through all
# of the ocssd archives:
Node 1 - (racnode1)
# for x in 'ls -tr $ORA_CRS_HOME/log/racnode?/cssd/ocssd.*'
do grep -i "master node" $x; done | tail -1
[    CSSD]CLSS-3001: local node number 1, master node number 1
Node 2 - (racnode2)
# for x in 'ls -tr $ORA_CRS_HOME/log/racnode?/cssd/ocssd.*'
do grep -i "master node" $x; done | tail -1
[    CSSD]CLSS-3001: local node number 2, master node number 1


# The master node information is confirmed by the
# ocrconfig -showbackup command:
# ocrconfig -showbackup
racnode1     2009/09/29 13:05:22     /u01/app/crs/cdata/crs
racnode1     2009/09/29 09:05:22     /u01/app/crs/cdata/crs
racnode1     2009/09/29 05:05:22     /u01/app/crs/cdata/crs
racnode1     2009/09/28 05:05:21     /u01/app/crs/cdata/crs
racnode1     2009/09/22 05:05:13     /u01/app/crs/cdata/crs
Because of the importance of OCR information, Oracle recommends that you make copies of the automatically created backup files at least once a day from the master node to a different device from where the primary OCR resides. You can use any backup software to copy the automatically generated physical backup files to a stable backup location:


1.     Verify the OCR configuration by running the ocrcheck command:
2.     [root@racnode1 ~]# ocrcheck
3.     Status of Oracle Cluster Registry is as follows :
4.              Version                  :          2
5.              Total space (kbytes)     :     262120
6.              Used space (kbytes)      :       4668
7.              Available space (kbytes) :     257452
8.              ID                       :    1331197
9.              Device/File Name         : /u02/oradata/racdb/OCRFile
10.                                          Device/File integrity check succeeded <-- Primary OCR Restored
11.               Device/File Name         : /u02/oradata/racdb/OCRFile_mirror
12.                                          Device/File integrity check succeeded <-- Mirror OCR Restored
13.       
         Cluster registry integrity check succeeded
View Voting Disk Configuration Information
Use the crsctl utility to verify how many voting disks are configured for the cluster as well as their location. The the crsctl command can be run as either the oracle or root user account:
[oracle@racnode1 ~]$ crsctl query css votedisk
 0.     0    /u02/oradata/racdb/CSSFile
located 1 votedisk(s).
Add a Voting Disk
Adding or removing a voting disk from the cluster is a fairly straightforward process. Oracle Clusterware 10g Release 1 (10.1) only allowed for one voting disk while Oracle Oracle Clusterware 10g Release 2 (10.2) lifted this restriction to allow for 32 voting disks. Having multiple voting disks available to the cluster removes the voting disk as a single point of failure and eliminates the need to mirror them outside of Oracle Clusterware (i.e. RAID). The Oracle Universal Installer (OUI) allows you to configure either one or three voting disks during the installation of Oracle Clusterware. Having three voting disks available allows Oracle Clusterware (CRS) to continue operating uninterrupted when any one of the voting disks fail.
crsctl add css votedisk <path>
http://www.idevelopment.info/images/popup_dialog_information_mark.gif 
You must be logged in as the root user to run the crsctl command to add/remove voting disks.
Backing up the voting disk(s) is often performed on a regular basis by the DBA to guard the cluster against a single point of failure as the result of hardware failure or user error. Because the node membership information does not usually change, it is not a strict requirement that you back up the voting disk every day. At a minimum, however, your backup strategy should include procedures to back up all voting disks at the following times and make certain that the backups are stored in a secure location that is accessible from all nodes in the cluster in the event the voting disk(s) need to be restored:
  • After installing Oracle Clusterware
  • After adding nodes to or deleting nodes from the cluster
  • After performing voting disk add or delete operations
Oracle Clusterware 10g Release 1 (10.1) only allowed for one voting disk while Oracle Clusterware 10g Release 2 (10.2) lifted this restriction to allow for 32 voting disks. For high availability, Oracle recommends that Oracle Clusterware 10g R2 users configure multiple voting disks while keeping in mind that you must have an odd number of voting disks, such as three, five, and so on. To avoid simultaneous loss of multiple voting disks, each voting disk should be placed on a shared storage device that does not share any components (controller, interconnect, and so on) with the storage devices used for the other voting disks. If you define a single voting disk, then you should use external mirroring to provide redundancy.
To make a backup copy of the voting disk on UNIX/Linux, use the dd command:
dd if=<voting_disk_name> of=<backup_file_name> bs=<block_size>
Perform this operation on every voting disk where voting_disk_name is the name of the active voting disk (input file), backup_file_name is the name of the file to which you want to back up the voting disk contents (output file), and block_size is the value to set both the input and output block sizes. As a general rule on most platforms, including Linux and Sun, the block size for the ddcommand should be 4k to ensure that the backup of the voting disk gets complete blocks.
If your voting disk is stored on a raw device, use the device name in place of voting_disk_name. For example:
dd if=/dev/raw/raw3 of=/u03/crs_backup/votebackup/VotingDiskBackup.dmp bs=4k
Step by Step Cloning Database 10g RAC to RAC
Cloning RAC Database (RAC-RAC)
·         Install Oracle Clusterware on target server
·         Install Oracle Database Software on target server
·         Copy the Cold backup to target server shared location. (If you are using hotbackup then also copy the archive log files. We require archivelog files from both instances for e.g : 1st Instance = log_652206883_34730_1.arc 2nd Instance =log_652206883_35888_2.arc to recover the database)
·         Generate pfile from spfile on Production and transfer the file to target $ORACLE_HOME/dbs
·         Edit the pfile to replace the entries of PROD to TEST/UAT/DEV. Also remove the below entries from the pfile (initERPUAT.ora)
*.compatible='10.2.0' *.control_files='/uatdata/oracle/db/apps_st/data/cntrl01.dbf','/uatdata/oracle/db/apps_st/data/cntrl02.dbf','/uatdata/oracle/db/apps_st/data/cntrl03.dbf' *.db_name='ERPUAT' ERPUAT.instance_name='ERPUAT1' ERPUAT.local_listener='ERPUAT1_LOCAL' *.log_archive_dest_1='location=/uatarch/arch' *.log_archive_format='log_%r_%s_%t.arc' ERPUAT.remote_listener='ERPUAT_REMOTE' ERPUAT.service_names='ERPUAT' *.db_file_name_convert=(/oraproddata/oracle/db/apps_st/DATA,/uatdata/oracle/db/apps_st/DATA) *.log_file_name_convert=(/oraproddata/oracle/db/apps_st/DATA,/uatdata/oracle/db/apps_st/DATA) *._no_recovery_through_resetlogs=TRUE
·         Export the ORACLE_SID=ERPUAT and start the database in nomount state
·         Create the control file or rename the datafiles (change the location of datafiles) if you want to keep the db name same
Note : If using hotbackup, once the control file is created recover the database using the below command.Provide necessary archivelog files. To check which archivelog files to supply. Execute the below command on Source Database

No comments:

Post a Comment