Oracle RAC Architecture 10g and 11g
RAC components
·
Shared Disk System
·
Oracle clusterware
·
Cluster interconnects
·
Oracle Kernel Components
Shared Disk System
Below are the three
major type of shared storage which are using in RAC
Raw volumes: A raw logical volume is an area of physical
and logical disk space that is under the direct control of an application such
as database or partition rather than under the direct control of the operating
system or a file system. This is the only available option in 9i.
Cluster File system: This option is not widely used
and here the cluster file system such as Oracle Cluster file system (OCFS) for
MS Windows and Linux holding the all datafiles of RAC database
Automatic Storage
Management (ASM): Oracle recommended storage
option which is optimized for cluster file system for Oracle database files
introduced in Oracle 10g.
Oracle Clusterware
Clusterware is the
mandatory piece of software which is required to run Oracle RAC option, it
provides the basic clustering support at OS level and enables oracle software
to run in cluster mode.
Clusterware enables
nodes to communicate each other and forms the cluster that make the nodes work
as a single logical server.
Clusterware
is managed by cluster ready services (CRS) using the Oracle
Cluster registry (OCR) and voting disk
Cluster Ready Services (CRS)
Oracle clusterware
uses CRS for interaction between the OS and the Database.CRS have
five components - namely Process Monitor daemon
(OPROCd: obsolete in 11gR2),CRS daemon(CRSd),Oralce
Custer Synchronization Service
daemon(OCSSd),Event Volume Manager daemon(EVMd) and the Oracle
Notification Service (ONS)
CRS is installed and
run from ORA_CRS_HOME in 10g and GRID_HOME in 11g
Starting and Stopping CRS
crsctl stop crs
# Stops Oracle clusterware
crsctl start crs
# Starts Oracle clusterware
crsctl enable crs
# Enables Oracle clusterware
crsctl disable crs
# Disables Oracle clusterware
Note: These commands
should be used in root user (Superuser)
Oracle Cluster Synchronization Services
OCSSd provides
synchronization between the nodes. Failure of OCSSd causes the
machine to reboot to avoid split-brain situation.OCSSd runs as oracle user.
Event Manager Process
This runs
to daemon process EVMd. The daemon process spawns
a permanent child process called evmlogger and generate the events
when things happens.
Oracle Notification Services
Whenever state of
cluster resource changes ONS process in each node will communicate with each
other and exchange the high availability information.
Oracle Process Monitor
The oracle process
monitor daemon(OPROCd) identifies the potential cluster
node hang and rebooting the hang node.
Oracle Cluster Registry (OCR)
OCR uses the cluster
registry to keep the configuration information. This should be a shared storage
and should be accessible to all the nodes in
the cluster simultaneously.This shared storage is known as
Oracle cluster registry (OCR)
Oracle Logical Registry(11gR2)
Similar to OCR,
introduces in 11gR2 but it only stores information about the local node. It is
not shared by other nodes of cluster and used by OHASd while starting or
joining a cluster.
Voting Disk
A voting disk is a
shared diks that will be accessed by all the member of the nodes in the
cluster.It is used as central referece for all the nodes and keeps the
heartbeat information between the nodes. If any of the node is unable to ping
the voding disk,cluster immediatly recogonize the comminucation failure and
envicts the node from the cluster.
Cluster interconnect
It is the
communication path used by the cluster for
the synchronization of resources and it is also used in some cases
for transfer of data from one instance to another.Typically, the
interconnect is a network connections that is dedicated to the server nodes of
a cluster (thus is sometimes referred as private interconnect)
Single Client Access Name (SCAN)-11gR2
SCAN is a single
network name that resolves in three different IPs
registered in DNS or GNS.
Prior the 11g R2 if we
add or remove a node from cluster it was required to changes the
connection setting in client . By configuring the connection using
SCAN name this problem is eliminated.
NewfeaturesinOracle9i/10g/11gRAC
Oracle Real Application Clusters New features
·
OPS (Oracle Parallel Server) was
renamed as RAC
·
CFS (Cluster File System) was supported
·
OCFS (Oracle Cluster File System) for
Linux and Windows
·
watchdog timer replaced by hangcheck
timer
·
Cluster Manager replaced by CRS
·
Concept of Services expanded
·
AWR was instance specific
·
CRS was renamed as Clusterware
·
OCR and Voting disks can be mirrored
·
Can use FAN/FCF with TAF for OCI and
ODP.NET
·
o
Business
requirements software
o
Tracking
devices for vehicles
o
Consolidate
o
Consolidating
· Oracle 11g RAC parallel upgrades -
Oracle 11g have rolling upgrade features whereby RAC database can be
upgraded without any downtime.
· Hot patching - Zero downtime patch
application.
· Oracle RAC load balancing advisor
- Starting from 10g R2 we have RAC load balancing advisor
utility. 11g RACload
balancing advisor is only available with
clients who use .NET, ODBC, or the Oracle Call Interface (OCI).
·
ADDM for RAC - Oracle has incorporated
RAC into the automatic database diagnostic monitor, for cross-node
advisories. The script addmrpt.sql run give report for single instance, will
not report all instances in RAC, this is known as instance ADDM. But using the
new package DBMS_ADDM, we can generate report for all instances of RAC, this
known as database ADDM.
·
Optimized RAC cache fusion protocols -
moves on from the general cache fusion protocols in 10g to
deal with specific scenarios where the protocols could be further optimized.
·
Oracle 11g RAC Grid provisioning - The
Oracle grid control provisioning pack allows us to "blow-out" a RAC
node without the time-consuming install, using a pre-installed
"footprint".
·
ASMCA
·
Single Client Access Name (SCAN) -
eliminates the need to change tns entry when nodes are added to or removed from
the Cluster. RAC instances register to SCAN listeners as remote listeners. SCAN
is fully qualified name. Oracle recommends assigning 3 addresses to SCAN, which
create three SCAN listeners.
·
Clusterware components: crfmond,
crflogd, GIPCD.
·
AWR is consolidated for the database.
·
11g Release 2 Real Application Cluster
(RAC) has server pooling technologies so it’s easier to provision and manage
database grids. This update is geared toward dynamically adjusting servers as
corporations manage the ebb and flow between data requirements for
datawarehousing and applications.
·
By default, LOAD_BALANCE is ON.
·
GPnP profile.
·
Cluster information in an XML profile.
·
Oracle RAC OneNode is a new option that makes it
easier to consolidate databases that aren’t mission critical, but
need redundancy.
·
raconefix - to fix RacOneNode database
in case of failure.
·
racone2rac - to convert RacOneNode back
to RAC.
·
Oracle Restart - the feature of Oracle
Grid Infrastructure's High Availability Services (HAS)
to manage associated listeners, ASM instances and Oracle instances.
·
Oracle Omotion - Oracle 11g release2 RAC
introduces new feature called Oracle Omotion, an online migration utility. This
Omotion utility will relocate the instance from one node to another, whenever
instance failure happens.
·
Omotion utility uses Database Area
Network (DAN) to move Oracle instances. Database Area Network (DAN)technology helps
seamless database relocation without losing transactions.
·
Cluster Time Synchronization Service
(CTSS) is a new feature in Oracle 11g R2 RAC, which is used to synchronize time across the nodes of the cluster. CTSS
will be replacement of NTP protocol.
·
Grid Naming Service (GNS) is
a new service introduced in Oracle RAC 11g R2. With GNS, Oracle Clusterware
(CRS) can manage Dynamic Host Configuration Protocol
(DHCP) and DNS services for the dynamic node registration
and configuration.
·
Cluster interconnect: Used for data
blocks, locks, messages, and SCN numbers.
·
Oracle Local Registry (OLR) -
From Oracle 11gR2 "Oracle Local Registry (OLR)" something
new as part of Oracle Clusterware. OLR is node’s local repository, similar to
OCR (but local) and is managed by OHASD. It pertains data of local node only
and is not shared among other nodes.
·
Multicasting is introduced in 11gR2 for
private interconnect traffic.
·
I/O fencing prevents updates by failed
instances, and detecting failure and preventing split brain in cluster. When a
cluster node fails, the failed node needs to be fenced off from all the shared
disk devices or diskgroups. This methodology is called I/O Fencing,
sometimes called Disk Fencing or failure fencing.
·
Re-bootless node fencing (restart) - instead of fast re-booting the node, a
graceful shutdown of the stack is attempted.
·
Clusterware log directories: acfs*
·
HAIP (IC VIP).
·
Redundant interconnects: NIC bonding,
HAIP.
·
RAC background processes: DBRM –
Database Resource Manager, PING – Response time agent.
·
Virtual Oracle 11g RAC cluster - Oracle
11g RAC supports virtualization.
Oracle Kernel Components
Set of additional
background process in each instance is known as
oracle kernel components in RAC environment.Since buffer and shared
pool became global in RAC , special handling is required to
manage the resources to avoid conflicts and corruption.Additional
background process (for RAC) and single instance background process works
together and achieved this.
Global Cache and Global Enqueue Services
In Oracle RAC each
instance will have its own cache but it is required for an instance
to access the data blocks currently residing in another
instance cache.This management and data sharing is done by Global Cache
services (GCS). Blocks other than data such as locks, enqueue details
and shared across the instances are known as
Global Enqueue Services (GES).
Global Resource Directory
In a cluster group,
then centre repository formed by all the resources is known as Global
Resource Directory (GRD).GRD is managed by GCS and GES.
Oracle RAC Background Processes
·
LMS Global Cache Service
Process
·
LMON Global Enqueue Service Monitor
·
LMD
Global Enqueue service Daemon
·
LCK0 Instance Enqueue Process
LMS: Global Cache Services Process
LMS- Lock Manager
Server Process is used in Cache Fusion.It enebles consistent copies of blocks
to be transfered from a holding instance's buffer cache to a requesting
instance bufer cache with out a disk write under certian conditions.
It rollbacks
any uncommitted transactions for any blocks that are being requested
for a consistent read by the remote instance.
LMON:Global Enqueue Services Monitor
LMON-Lock Monitor
Process is responsible to manage Global Enqueue Services
(GES).It maintain consistency of GCS memory in case of
any process death. LMON is also responsible for the
cluster reconfiguration when an instance joins or leaves the cluster.
It also check for the instance death and listens for local manages.
LMD: Global Enqueue Services Daemon
LMD-Lock Manager Daemon process manages
Enqueue manager service requests for GCS. It also handles deadlock
detection and remote resource requests.
LCK0:Instance Enqueue Process
LCK-Lock Process
manages instance resource requests and cross-instance call operations for
shared resources.
DIAG Process
Separate process using
for diagnostic purpose, PMON will start incase this process dies.DIAG
process will not affect any other background process.
High Availability and
Clustering
When you have very
critical systems that require to be online 24x7 then you need a HA solution
(High Availability), you have to weigh up the risk associated with downtime
against the cost of a solution. HA solutions are not cheap and they are not
easy to manage. HA solutions need to be thoroughly tested as it may not be
tested in the real world for months. I had a solution that run for almost a
year before a hardware failure caused a failover, this is when your testing
before hand comes into play.
As I said before HA
comes with a price, and there are a number of HA technologies
- Fault Tolerance - this technology protects you from
hardware failures for example redundant PSU, etc
- Disaster Recovery - this technology protects from
operational issues such as a Data Center becoming unavailable
- Disaster Tolerance - this technology is used to prepare
for the above two, the most important of the three technologies.
Every company should
plan for unplanned outages, this costs virtually nothing, knowing what to do in
a DR situation is half the battle, in many companies people make excuses not to
design a DR plan (it costs to much, we don't have the redundant hardware,etc).
You cannot make these assumptions until you design a DR plan, the plan will
highlight the risks and the costs that go with that risk, then you can make the
decision on what you can and cannot afford, there is no excuse not to create DR
plan.
Sometimes in large
corporations you will hear the phrase five nines, this phrases
means the availability of a system and what downtime (approx) is allowed, the
table below highlights the uptime a system requires in order to achieve
the five nines
% uptime
|
% Downtime
|
Downtime per year
|
Downtime per week
|
98
|
2
|
7.3 days
|
3 hours 22 minutes
|
99
|
1
|
3.65 days
|
1 hour 41 minutes
|
99.8
|
0.2
|
17 hours 30 minutes
|
20 minutes
|
99.9
|
0.1
|
8 hours 45 minutes
|
10 minutes
|
99.99
|
0.01
|
52.5 minutes
|
1 minute
|
99.999 (five
nines)
|
0.001
|
5.25 minutes
|
6 seconds
|
To achieve the five
nines your system is only allowed 5.25 minutes per year or 6 seconds
per week, in some HA designs it may take 6 seconds to failover.
When looking for a
solution you should try and build redundancy into your plan, this is the first
step to a HA solution, for example
- Make sure computer cabinets have dual power
- Make sure servers have dual power supplies, dual
network cards, redundant hard disks that can be mirrored
- Make sure you use multipathing to the data disk which
are usually on a SAN or NAS
- Make sure that the server is connected to two different
network switches
You are trying to
eliminate as many Single Point Of Failures (SPOF's) as you can without
increasing the costs. Most hardware today will have these redundancy features
built in, but its up to you to make use of them.
HA comes in three/four
favors
No-failover
|
This option usually
uses the already built-in redundancy, failed disks and PSU can be replaced
online, but if a major hardware was to fail then a system outage is
unavoidable, the system will remain down until it is fixed.
This solution can be
perfectly acceptable in some environments but at what price to the business,
even in today's market QA/DEV systems cost money when not running, i am sure
that your developers are quite happy to take the day off paid while you fix
the system
|
Cluster
|
This is the jewel in
the HA world, a cluster can be configure in a variety of favors, from minimal
downtime while services are moved to a good nodes, to virtual zero downtime.
However a cluster
solution does come with a heavy price tag, hardware, configuration and
maintaining a cluster is expensive but if you business loses vast amounts of
money if you system is down, then its worth it.
|
Cold failover
|
Many smaller
companies use this solution, basically you have a additional server ready to
take over a number of servers if one where to fail. I have used this
technique myself, i create a number of scripts that can turn a cold standby
server into any number of servers, if the original server is going to have a
prolonged outage.
The problem with
this solution is there is going to be downtime, especially if it takes a long
time to get the standby server up to the same point in time as the failed
server.
The advantage of
this solution is that one additional server could cover a number of servers,
even if it slight under powered to the original server, as long as it keeps
the service running.
|
Hot failover
|
Many applications
offer hot-standby servers, these servers are running along side the live
system, data is applied to the hot-standby server periodically to keep it up
to date, thus in a failover situation the server is almost ready to go.
The problem with
this system is costs and manageability, also one server is usually dedicated
to one application, thus you may have to have many hot-standby servers.
The advantage is
that downtime is kept to a minimum, but there will be some downtime,
generally the time it take to get the hot-standby server up todate, for
example applying the last set of logs to a database.
|
Here is a summary table
that shows the most command aspects of cold failover versus hot failover
Aspects
|
Cold Failover
|
Hot Failover
|
Scalability/number
of nodes
|
Scalable limited to
the capacity of a single node
|
As nodes can be
added on demand, it provides infinite scalability. High number of nodes
supported.
|
User interruption
|
Required up to a
minimal extent. The failover operation can be scripted or automated to a
certain extent
|
Not required,
failover is automatic
|
Transparent failover
of applications
|
Not Possible
|
Transparent
application failover will be available where sessions can be transferred to
another node without user interruption
|
Load Balancing
|
Not possible, only
one server will be used
|
Incoming load can be
balanced between both nodes
|
Usage of resources
|
Only one server at a
time, the other server will be kept idle
|
Both the servers
will be used
|
Failover time
|
More than minutes as
the other system must be cold started
|
Less than a minute,
typically in a few seconds.
|
I have discussed
clustering in my Tomcat and JBoss topics, so I will only touch on the
subject lightly here. A cluster is a group of two or more interconnected nodes,
that provide a service. The cluster provides a high level of fault tolerance,
if a node were to become unavailable within the cluster the services are
moved/restored to another working node, thus the end user should never know
that a fault occurred.
Clusters can be setup to
use a single node in the cluster or to load balance between the nodes, but the
main object is to keep the service running, hence why you pay top dollar for
this. One advantage of a cluster is that it is very scalable because additional
nodes can be added or taken away (a node may need to be patched) without
interrupting the service.
Clustering has come a
long way, there are now three types of clustering architecture
Shared nothing
|
each node within the
cluster is independent, they share nothing. An example of this may be web
servers, you a have number of nodes within the cluster supplying the same web
service. The content will be static thus there is no need to share disks,
etc.
|
Shared disk only
|
each node will be
attached or have access to the same set of disks. These disks will contain
the data that is required by the service. One node will control the
application and the disk and in the event of a that node fails, the other
node will take control of both the application and the data. This means that
one node will have to be on standby setting idle waiting to take over if
required to do so.
|
Shared everything
|
again all nodes will
be attached or have access to the same set of disks, but this time each node
can read/write to the disks concurrently. Normally there will be a piece of
software that controls the reading and writing to the disks ensuring data
integrity. To achieve this a cluster-wide filesystem is introduced, so that
all nodes view the filesystem identically, the software then coordinates the
sharing and updating of files, records and databases.
Oracle RAC and IBM HACMP would be good examples
of this type of cluster
|
The first Oracle cluster
database was release with Oracle 6 for the digital VAX, this was the first
cluster database on the market. With Oracle 6.2 Oracle Parallel Server (OPS)
was born, which used Oracle's own DLM (Distributed Lock Manager). Oracle 7 used
vendor-supplied clusterware but this was complex to setup and manage, Oracle 8
introduce a general lock manager and this was a direction for Oracle to create
its own clusterware product. Oracle's lock manager is integrated with Oracle
code with an additional layer called OSD (Operating System Dependent), this was
soon integrated within the kernel and become known as IDLM (Integrated
Distributed Lock Manager) in later Oracle versions. Oracle Real Application
Clusters 9i (Oracle RAC) used the same IDLM and relied on external clusterware
software (Sun Cluster, Veritas Cluster, etc).
A Oracle parallel
database consists of two or more nodes that own Oracle instances and share a
disk array. Each node has its own SGA and its own redo logs, but the data files
and control files are all shared to all instances. All data and controls are
concurrently read and written by all instances, redo logs files on the other
hand can be read by any instance but only written by the owning instance. Each
instance has its own set of background processes.
The components of a OPS
database are
- Cluster Manager - OS Vendor specific
- Distributed Lock Manager (DLM)
- Cluster Interconnect
- Shared Disk Array
The Cluster Group
Services (CGS) has some OSD components (node monitor interface) and the rest is
built in the kernel. CGS has a key repository used by the DLM for communication
and network related activities. This layer provides the following
- Internode messaging
- Group member consistency
- Cluster synchronization
- Process grouping, registration and deregistration
The DLM is a integral
part of OPS and the RAC stack. In older versions the DLM API module had to rely
on external OS routines to check the status of a lock, this was done using UNIX
sockets and pipes. With the new IDLM the data is in the SGA of each instance
and requires only a serialized lookup using latches and/or enqueues and may
require global coordination, the algorithm for which was built directly into
the Oracle kernel. The IDLM job is to track every lock granted to a resource,
memory structures required by the DLM are allocated out of the shared pool. The
design of the DLM is such it can survive nodes failures in all but one node of
the cluster.
A user must require a
lock before it can operate on any resource, the Parallel Cache
Management (PCM) coordinates and maintains data blocks exists within
each data buffer cache (of an instance) so that data viewed or requested by
users is never inconsistent or incoherent. The PCM ensures that only one
instance in a cluster can modify a block at any given time, other instances
have to wait until the lock is released.
DLM maintains
information about all locks on a given resource, the DLM nominates one node to
manage all relevant lock information for a resource, this node is referred to
as the master node, lock mastering is distributed among all nodes.
Using the IPC layer the DLM permits it to share the load of mastering
resources, which means that a user can lock a resource on one node but actually
end up communicating with the processes on another node.
In OPS 8i Oracle
introduced Cache Fusion Stage 1, this introduced a new background process
called the Block Server Process (BSP). The BSP main roles was
to ship consistent read (CR) version(s) of a block(s) across an instance in a
read/write contention scenario, this shipping is performed over a high speed
interconnect. Cache Fusion Stage 2 in Oracle 9i and 10g, addresses some of the
issues with Stage 1, in which both types of blocks (CR and CUR) can be
transferred using the interconnect. Since 8i the introduction of the GV$ views
meant that a DBA could view cluster-wide database and other statistics sitting
on any node/instance of the cluster.
The limitations of OPS
are
- scalability is limited to the capacity of the node
- you cannot easily add additional nodes to a OPS
- OPS requires third-party clustering software adding to
the expense and complexity.
- OPS requires RAW partitions and these can be difficult
to manage.
Oracle RAC addresses the
limitation in OPS by extending Cache Fusion, and the dynamic lock mastering.
Oracle 10g RAC also comes with its own integrated clusterware and storage
management framework, removing all dependencies of a third-party clusterware
product. The latest Oracle RAC offers
- Availability - can be configured to have no SPOF even
when running on low spec'ed hardware
- Scalability - multiple servers in a cluster to manage a
single database transparently (scale out), basically means adding
additional server is easy.
- Reliability - improved code and monitoring RAC has
become very reliable
- Affordability - you can use low cost hardware, however
RAC is not going to come cheap.
- Transparency - RAC looks and feels like a standard
Oracle database to an application
RAC Architecture
Oracle Real Application
clusters allows multiple instances to access a single database, the instances
will be running on multiple nodes. In an standard Oracle configuration a
database can only be mounted by one instance but in a RAC environment many
instances can access a single database.
Oracle's RAC is heavy
dependent on a efficient, high reliable high speed private network called the
interconnect, make sure when designing a RAC system that you get the best that
you can afford.
The table below
describes the difference of a standard oracle database (single instance) an a
RAC environment
Component
|
Single Instance
Environment
|
|
SGA
|
Instance has its own
SGA
|
Each instance has
its own SGA
|
Background processes
|
Instance has its own
set of background processes
|
Each instance has
its own set of background processes
|
Datafiles
|
Accessed by only one
instance
|
Shared by all
instances (shared storage)
|
Control Files
|
Accessed by only one
instance
|
Shared by all
instances (shared storage)
|
Online Redo Logfile
|
Dedicated for
write/read to only one instance
|
Only one instance can write but other
instances can read during recovery and archiving. If an instance is shutdown,
log switches by other instances can force the idle instance redo logs to be
archived
|
Archived Redo
Logfile
|
Dedicated to the instance
|
Private to the
instance but other instances will need access to all required archive logs
during media recovery
|
Flash Recovery Log
|
Accessed by only one
instance
|
Shared by all
instances (shared storage)
|
Alert Log and Trace
Files
|
Dedicated to the instance
|
Private to each
instance, other instances never read or write to those files.
|
ORACLE_HOME
|
Multiple instances
on the same server accessing different databases ca use the same executable
files
|
Same as single
instance plus can be placed on shared file system allowing a common
ORACLE_HOME for all instances in a RAC environment.
|
The major components of
a Oracle RAC system are
- Shared disk system
- Oracle Clusterware
- Cluster Interconnects
- Oracle Kernel Components
The below diagram describes
the basic architecture of the Oracle RAC environment
Here are a list of
processes running on a freshly installed RAC
With today's SAN and NAS
disk storage systems, sharing storage is fairly easy and is required for a RAC
environment, you can use the below storage setups
- SAN (Storage Area Networks) - generally using fibre to
connect to the SAN
- NAS ( Network Attached Storage) - generally using a
network to connect to the NAS using either NFS, ISCSI
- JBOD - direct attached storage, the old traditional way
and still used by many companies as a cheap option
All of the above
solutions can offer multi-pathing to reduce SPOFs within the RAC environment,
there is no reason not to configure multi-pathing as the cost is cheap when
adding additional paths to the disk because most of the expense is paid when
out when configuring the first path, so an additional controller card and
network/fibre cables is all that is need.
The last thing to think
about is how to setup the underlining disk structure this is known as a raid
level, there are about 12 different raid levels that I know off, here are the
most common ones
raid 0 (Striping)
|
A number of disks
are concatenated together to give the appearance of one very large disk.
Advantages
Improved performance Can Create very large Volumes
Disadvantages
Not highly available (if one disk fails, the volume fails) |
raid 1 (Mirroring)
|
A single disk is
mirrored by another disk, if one disk fails the system is unaffected as it
can use its mirror.
Advantages
Improved performance Highly Available (if one disk fails the mirror takes over) Disadvantages Expensive (requires double the number of disks) |
raid 5
|
Raid stands for
Redundant Array of Inexpensive Disks, the disks are striped with parity
across 3 or more disks, the parity is used in the event that one of the disks
fails, the data on the failed disk is reconstructed by using the parity bit.
Advantages
Improved performance (read only) Not expensive Disadvantages Slow write operations (caused by having to create the parity bit) |
There are many other
raid levels that can be used with a particular hardware environment for example
EMC storage uses the RAID-S, HP storage uses Auto RAID, so check with the
manufacture for the best solution that will provide you with the best
performance and resilience.
Once you have you
storage attached to the servers, you have three choices on how to setup the
disks
- Raw Volumes - normally used for performance benefits,
however they are hard to manage and backup
- Cluster FileSystem - used to hold all the Oracle
datafiles can be used by windows and linux, its not used widely
- Automatic
Storage Management (ASM)
- Oracle choice of storage management, its a portable, dedicated and
optimized cluster filesystem
I will only be
discussing ASM, which i have already have a topic on called Automatic Storage
Management.
Oracle Clusterware
software is designed to run Oracle in a cluster mode, it can support you to 64
nodes, it can even be used with a vendor cluster like Sun Cluster.
The Clusterware software
allows nodes to communicate with each other and forms the cluster that makes
the nodes work as a single logical server. The software is run by the Cluster
Ready Services (CRS) using the Oracle Cluster Registry (OCR) that records and
maintains the cluster and node membership information and the voting disk which
acts as a tiebreaker during communication failures. Consistent heartbeat
information travels across the interconnect to the voting disk when the cluster
is running.
The CRS has four
components
- OPROCd - Process Monitor Daemon
- CRSd - CRS daemon, the failure of this daemon results
in a node being reboot to avoid data corruption
- OCSSd - Oracle Cluster Synchronization Service Daemon
(updates the registry)
- EVMd - Event Volume Manager Daemon
The OPROCd daemon
provides the I/O fencing for the Oracle cluster, it uses the hangcheck timer or
watchdog timer for the cluster integrity. It is locked into memory and runs as
a realtime processes, failure of this daemon results in the node being
rebooted. Fencing is used to protect the data, if a node were to have problems
fencing presumes the worst and protects the data thus restarts the node in
question, its better to be save than sorry.
The CRSd process manages
resources such as starting and stopping the services and failover of the
application resources, it also spawns separate processes to manage application
resources. CRS manages the OCR and stores the current know state of the
cluster, it requires a public, private and VIP interface in order to run. OCSSd
provides synchronization services among nodes, it provides access to the node
membership and enables basic cluster services, including cluster group services
and locking, failure of this daemon causes the node to be rebooted to
avoid split-brain situations.
The below functions are
covered by the OCSSd
- CSS provides basic Group Services Support, it is a
distributed group membership system that allows applications to coordinate
activities to archive a common result.
- Group services use vendor clusterware group services
when it is available.
- Lock services provide the basic cluster-wide
serialization locking functions, it uses the First In, First Out (FIFO)
mechanism to manage locking
- Node services uses OCR to store data and updates the
information during reconfiguration, it also manages the OCR data which is
static otherwise.
The last component is
the Event Management Logger, which runs the EVMd process. The daemon spawns a
processes called evmlogger and generates the events when
things happen. The evmlogger spawns new children processes on
demand and scans the callout directory to invoke callouts. Death of the EVMd
daemon will not halt the instance and will be restarted.
Quick recap
CRS Process
|
Functionality
|
Failure of the
Process
|
Run AS
|
OPROCd - Process
Monitor
|
provides basic
cluster integrity services
|
Node Restart
|
Root
|
EVMd - Event
Management
|
spawns a child
process event logger and generates callouts
|
Daemon automatically
restarted, no node restart
|
Oracle
|
OCSSd - Cluster
Synchronization Services
|
basic node
membership, group services, basic locking
|
Node Restart
|
Oracle
|
CRSd - Cluster Ready
Services
|
resource monitoring,
failover and node recovery
|
Daemon restarted
automatically, no node restart
|
root
|
The cluster-ready
services (CRS) is a new component in 10g RAC, its is installed in a separate
home directory called ORACLE_CRS_HOME. It is a mandatory component but can be
used with a third party cluster (Veritas, Sun Cluster), by default it manages
the node membership functionality along with managing regular RAC-related
resources and services
RAC uses a membership scheme, thus any node wanting to join the cluster as to become a member. RAC can evict any member that it seems as a problem, its primary concern is protecting the data. You can add and remove nodes from the cluster and the membership increases or decrease, when network problems occur membership becomes the deciding factor on which part stays as the cluster and what nodes get evicted, the use of a voting disk is used which I will talk about later.
RAC uses a membership scheme, thus any node wanting to join the cluster as to become a member. RAC can evict any member that it seems as a problem, its primary concern is protecting the data. You can add and remove nodes from the cluster and the membership increases or decrease, when network problems occur membership becomes the deciding factor on which part stays as the cluster and what nodes get evicted, the use of a voting disk is used which I will talk about later.
The resource management
framework manage the resources to the cluster (disks, volumes), thus you can
have only have one resource management framework per resource. Multiple
frameworks are not supported as it can lead to undesirable affects.
The Oracle Cluster Ready
Services (CRS) uses the registry to keep the cluster configuration, it should
reside on a shared storage and accessible to all nodes within the cluster. This
shared storage is known as the Oracle Cluster Registry (OCR) and its a major
part of the cluster, it is automatically backed up (every 4 hours) the daemons
plus you can manually back it up. The OCSSd uses the OCR extensively and writes
the changes to the registry
The OCR keeps details of
all resources and services, it stores name and value pairs of information such
as resources that are used to manage the resource equivalents by the CRS stack.
Resources with the CRS stack are components that are managed by CRS and have
the information on the good/bad state and the callout scripts. The OCR is also
used to supply bootstrap information ports, nodes, etc, it is a binary file.
The OCR is loaded as
cache on each node, each node will update the cache then only one node is
allowed to write the cache to the OCR file, the node is called the master. The
Enterprise manager also uses the OCR cache, it should be at least 100MB in
size. The CRS daemon will update the OCR about status of the nodes in the
cluster during reconfigurations and failures.
The voting disk (or
quorum disk) is shared by all nodes within the cluster, information about the
cluster is constantly being written to the disk, this is know as the heartbeat.
If for any reason a node cannot access the voting disk it is immediately
evicted from the cluster, this protects the cluster from split-brains (the
Instance Membership Recovery algorithm IMR is used to detect and resolve
split-brains) as the voting disk decides what part is the really cluster. The
voting disk manages the cluster membership and arbitrates the cluster ownership
during communication failures between nodes. Voting is often confused with quorum
the are similar but distinct, below details what each means
The only vote that
counts is the quorum member vote, the quorum member vote defines the cluster.
If a node or group of nodes cannot archive a quorum, they should not start any
services because they risk conflicting with an established quorum.
The voting disk has to
reside on shared storage, it is a a small file (20MB) that can be accessed by
all nodes in the cluster. In Oracle 10g R1 you can have only one voting disk,
but in R2 you can have upto 32 voting disks allowing you to eliminate any
SPOF's.
The original Virtual IP
in Oracle was Transparent Application Failover (TAF), this had limitations,
this has now been replaced with cluster VIPs. The cluster
VIPs will failover to working nodes if a node should fail, these
public IPs are configured in DNS so that users can access them. The cluster
VIPs are different from the cluster interconnect IP address and are
only used to access the database.
The cluster interconnect
is used to synchronize the resources of the RAC cluster, and also used to
transfer some data from one instance to another. This interconnect should be
private, highly available and fast with low latency, ideally they should be on
a minimum private 1GB network. What ever hardware you are using the NIC should
use multi-pathing (Linux - bonding, Solaris - IPMP). You can use crossover cables
in a QA/DEV environment but it is not supported in a production environment,
also crossover cables limit you to a two node cluster.
The kernel components
relate to the background processes, buffer cache and shared pool and managing
the resources without conflicts and corruptions requires special handling.
In RAC as more than one
instance is accessing the resource, the instances require better coordination
at the resource management level. Each node will have its own set of buffers
but will be able to request and receive data blocks currently held in another
instance's cache. The management of data sharing and exchange is done by the
Global Cache Services (GCS).
All the resources in the
cluster group form a central repository called the Global Resource Directory
(GRD), which is distributed. Each instance masters some set of resources and
together all instances form the GRD. The resources are equally distributed
among the nodes based on their weight. The GRD is managed by two services
called Global Caches Services (GCS) and Global Enqueue Services (GES), together
they form and manage the GRD. When a node leaves the cluster, the GRD portion
of that instance needs to be redistributed to the surviving nodes, a similar
action is performed when a new node joins.
Each node has its own
background processes and memory structures, there are additional processes than
the norm to manage the shared resources, theses additional processes maintain
cache coherency across the nodes.
Cache coherency is the
technique of keeping multiple copies of a buffer consistent between different
Oracle instances on different nodes. Global cache management ensures that
access to a master copy of a data block in one buffer cache is coordinated with
the copy of the block in another buffer cache.
The sequence of a
operation would go as below
- When instance A needs a block of data to modify, it
reads the bock from disk, before reading it must inform the GCS (DLM). GCS
keeps track of the lock status of the data block by keeping an exclusive
lock on it on behalf of instance A
- Now instance B wants to modify that same data block, it
to must inform GCS, GCS will then request instance A to release the lock,
thus GCS ensures that instance B gets the latest version of the data block
(including instance A modifications) and then exclusively locks it on
instance B behalf.
- At any one point in time, only one instance
has the current copy of the block, thus keeping the integrity of the
block.
GCS maintains data
coherency and coordination by keeping track of all lock status of each block that
can be read/written to by any nodes in the RAC. GCS is an in memory database
that contains information about current locks on blocks and instances waiting
to acquire locks. This is known as Parallel Cache Management (PCM).
The Global Resource Manager (GRM) helps to coordinate and communicate the lock
requests from Oracle processes between instances in the RAC. Each instance has
a buffer cache in its SGA, to ensure that each RAC instance obtains the block
that it needs to satisfy a query or transaction. RAC uses two processes the GCS
and GES which maintain records of lock status of each data file and each cached
block using a GRD.
So what is a resource,
it is an identifiable entity, it basically has a name or a reference, it can be
a area in memory, a disk file or an abstract entity. A resource can be owned or
locked in various states (exclusive or shared). Any shared resource is lockable
and if it is not shared no access conflict will occur.
A global resource is a
resource that is visible to all the nodes within the cluster. Data buffer cache
blocks are the most obvious and most heavily global resource, transaction
enqueue's and database data structures are other examples. GCS handle data
buffer cache blocks and GES handle all the non-data block resources.
All caches in the SGA
are either global or local, dictionary and buffer caches are global, large and
java pool buffer caches are local. Cache fusion is used to read the data buffer
cache from another instance instead of getting the block from disk, thus cache
fusion moves current copies of data blocks between instances (hence why you
need a fast private network), GCS manages the block transfers between the
instances.
Finally we get to the
processes
Oracle RAC Daemons and Processes
|
|||
LMSn
|
Lock Manager Server process
- GCS
|
this is the cache
fusion part and the most active process, it handles the consistent copies of
blocks that are transferred between instances. It receives requests from LMD
to perform lock requests. I rolls back any uncommitted transactions. There
can be up to ten LMS processes running and can be started dynamically if
demand requires it.
they manage lock
manager service requests for GCS resources and send them to a service queue
to be handled by the LMSn process. It also handles global deadlock detection
and monitors for lock conversion timeouts.
as a performance
gain you can increase this process priority to make sure CPU starvation does
not occur
you can see the
statistics of this daemon by looking at the view X$KJMSDP
|
|
LMON
|
Lock Monitor Process
- GES
|
this process manages
the GES, it maintains consistency of GCS memory structure in case of process
death. It is also responsible for cluster reconfiguration and locks
reconfiguration (node joining or leaving), it checks for instance deaths and
listens for local messaging.
A detailed log file
is created that tracks any reconfigurations that have happened.
|
|
LMD
|
Lock Manager Daemon
- GES
|
this manages the
enqueue manager service requests for the GCS. It also handles deadlock
detention and remote resource requests from other instances.
you can see the
statistics of this daemon by looking at the view X$KJMDDP
|
|
LCK0
|
Lock Process - GES
|
manages instance
resource requests and cross-instance call operations for shared resources. It
builds a list of invalid lock elements and validates lock elements during
recovery.
|
|
DIAG
|
Diagnostic Daemon
|
This is a
lightweight process, it uses the DIAG framework to monitor the health of the
cluster. It captures information for later diagnosis in the event of
failures. It will perform any necessary recovery if an operational hang is
detected.
|
|
RAC Administration
I am only going to talk
about RAC administration, if you need Oracle administration then see my Oracle section.
It is recommended that
the spfile (binary parameter file) is shared between all nodes within the
cluster, but it is possible that each instance can have its own spfile. The
parameters can be grouped into three categories
Unique parameters
|
These parameters are
unique to each instance, examples would be instance_name, thread and undo_tablespace
|
Identical parameters
|
Parameters in this
category must be the same for each instance, examples would be db_name and control_file
|
Neither unique or identical
parameters
|
parameters that are
not in any of the above, examples would be db_cache_size,
large_pool_size, local_listener andgcs_servers_processes
|
The main unique
parameters that you should know about are
- instance_name -
defines the name of the Oracle instance (default is the value of the oracle_sid variable)
- instance_number -
a unique number for each instance must be greater than 0 but smaller than
the max_instance parameter
- thread -
specifies the set of redolog files to be used by the instance
- undo_tablespace -
specifies the name of the undo
tablespace to
be used by the instance
- rollback_segments -
you should use Automatic Undo
Management
- cluster_interconnects -
use if only if Oracle has trouble not picking the correct interconnects
The identical unique
parameters that you should know about are below you can use the below query to
view all of them
select name, isinstance_modifiable from v$parameter where isinstance_modifiable = 'false' order by name;
select name, isinstance_modifiable from v$parameter where isinstance_modifiable = 'false' order by name;
- cluster_database -
options are true or false, mounts the control
file in either share (cluster) or exclusive mode, use false in the below
cases
- Converting from no archive log mode to archive log
mode and vice versa
- Enabling the flashback database feature
- Performing a media recovery on a system table
- Maintenance of a node
- active_instance_count -
used for primary/secondary RAC environments
- cluster_database_instances - specifies the number of instances that will be
accessing the database (set to maximum # of nodes)
- dml_locks -
specifies the number of DML locks for a particular instance (only change
if you get ORA-00055 errors)
- gc_files_to_locks -
specify the number of global locks to a data file, changing this disables
the Cache Fusion.
- max_commit_propagation_delay - influences the mechanism Oracle uses to
synchronize the SCN among all instances
- instance_groups -
specify multiple parallel query execution groups and assigns the current
instance to those groups
- parallel_instance_group - specifies the group of instances to be used for
parallel query execution
- gcs_server_processes -
specify the number of lock manager server (LMS) background processes used
by the instance for Cache Fusion
- remote_listener -
register the instance with listeners on remote nodes.
syntax for parameter
file
|
<instance_name>.<parameter_name>=<parameter_value>
inst1.db_cache_size = 1000000 *.undo_management=auto |
example
|
alter system set
db_2k_cache_size=10m scope=spfile sid='inst1';
Note: use the sid option to specify a particular instance |
The srvctl command
is used to start/stop an instance, you can also use sqlplus to
start and stop the instance
start all instances
|
srvctl start database
-d <database> -o <option>
Note: starts listeners if not already running, you can use the -o option to specify startup/shutdown options, see below for options force open mount nomount |
stop all instances
|
srvctl stop database
-d <database> -o <option>
Note: the listeners are not stopped, you can use the -o option to specify startup/shutdown options, see below for options immediate abort normal transactional |
start/stop
particular instance
|
srvctl [start|stop]
database -d <database> -i <instance>,<instance>
|
To recap on undo
management you can see my undo section, instances in a RAC do not share undo,
they each have a dedicated undo tablespace. Using the undo_tablespace parameter
each instance can point to its own undo tablespace
undo tablespace
|
instance1.undo_tablespace=undo_tbs1
instance2.undo_tablespace=undo_tbs2 |
With todays Oracle you
should be using automatic undo management, again I have a detailed discussion
on AUM in my undo section.
I have already discussed temporary
tablespace's, in a RAC environment
you should setup a temporary tablespace group, this group is then used by all
instances of the RAC. Each instance creates a temporary segment in the
temporary tablespace it is using. If an instance is running a large sort,
temporary segments can be reclaimed from segments from other instances in that
tablespace.
useful views
|
gv$sort_segment - explore current and maximum sort
segment usage statistics (check columns freed_extents, free_requests ,if they
grow increase tablespace size)
gv$tempseg_usage - explore temporary segment usage details such as name, SQL, etc v$tempfile - identify - temporary datafiles being used for the temporary tablespace |
I have already
discussed redologs, in a RAC environment every instance has its
own set of redologs. Each instance has exclusive write access to its own redologs,
but each instance can read each others redologs, this is used for recovery.
Redologs are located on the shared storage so that all instances can have
access to each others redologs. The process is a little different to the
standard Oracle when changing the archive mode
archive mode (RAC)
|
SQL> alter
system set cluster_database=false scope=spfile sid='prod1';
srvctl stop database -d <database> SQL> startup mount SQL> alter database archivelog; SQL> alter system set cluster_database=true scope=spfile sid='prod1'; SQL> shutdown; srvctl start database -d prod |
Again I have already
talked about flashback, there is no difference in RAC environment
apart from the setting up
flashback (RAC)
|
## Make sure that the
database is running in archive log mode
SQL> archive log list ## Setup the flashback SQL> alter system set cluster_database=false scope=spfile sid='prod1'; SQL> alter system set DB_RECOVERY_FILE_DEST_SIZE=200M scope=spfile; SQL> alter system set DB_RECOVERY_FILE_DEST='/ocfs2/flashback' scope=spfile; srvctl stop database -p prod1 SQL> startup mount SQL> alter database flashback on; SQL> shutdown; srvctl start database -p prod1 |
We have already come across
the srvctl above, this command is called the server control utility. It can
divided into two categories
- Database configuration tasks
- Database instance control tasks
Oracle stores database
configuration in a repository, the configuration is stored in the Oracle
Cluster Registry (OCR) that was created when RAC was installed, it will be
located on the shared storage. Srvctl uses CRS to communicate and perform startup and
shutdown commands on other nodes.
I suggest that you
lookup the command but I will provide a few examples
display the
registered databases
|
srvctl config database
|
status
|
srvctl status database
-d <database
srvctl status instance -d <database> -i <instance> srvctl status nodeapps -n <node> srvctl status service -d <database> srvctl status asm -n <node> |
stopping/starting
|
srvctl stop database
-d <database>
srvctl stop instance -d <database> -i <instance>,<instance> srvctl stop service -d <database> [-s <service><service>] [-i <instance>,<instance>] srvctl stop nodeapps -n <node> srvctl stop asm -n <node> srvctl start database -d <database> srvctl start instance -d <database> -i <instance>,<instance> srvctl start service -d <database> -s <service><service> -i <instance>,<instance> srvctl start nodeapps -n <node> srvctl start asm -n <node> |
adding/removing
|
srvctl add database -d
<database> -o <oracle_home>
srvctl add instance -d <database> -i <instance> -n <node> srvctl add service -d <database> -s <service> -r <preferred_list> srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network srvctl add asm -n <node> -i <asm_instance> -o <oracle_home> srvctl remove database -d <database> -o <oracle_home> srvctl remove instance -d <database> -i <instance> -n <node> srvctl remove service -d <database> -s <service> -r <preferred_list> srvctl remove nodeapps -n <node> -o <oracle_home> -A <name|ip>/network srvctl asm remove -n <node> |
Services are used to
manage the workload in Oracle RAC, the important features of services are
- used to distribute the workload
- can be configured to provide high availability
- provide a transparent way to direct workload
The view v$services contains
information about services that have been started on that instance, here is a
list from a fresh RAC installation
The table above is
described below
- Goal -
allows you to define a service goal using service time, throughput or none
- Connect Time Load Balancing Goal - listeners and mid-tier servers contain current information
about service performance
- Distributed Transaction Processing - used for distributed transactions
- AQ_HA_Notifications -
information about nodes being up or down will be sent to mid-tier servers
via the advance queuing mechanism
- Preferred and Available Instances - the preferred instances for a service,
available ones are the backup instances
You can administer
services using the following tools
- DBCA
- EM (Enterprise Manager)
- DBMS_SERVICES
- Server Control (srvctl)
Two services are created
when the database is first installed, these services are running all the time
and cannot be disabled.
- sys$background -
used by an instance's background processes only
- sys$users -
when users connect to the database without specifying a service they use
this service
add
|
srvctl add service -d
D01 -s BATCH_SERVICE -r node1,node2 -a node3
Note: the options are describe below -d - database -s - the service -r - the service will running on the these nodes -a - if nodes in the -r list are not running then run on this node |
remove
|
srvctl remove service
-d D01 -s BATCH_SERVICE
|
start
|
srvctl start service
-d D01 -s BATCH_SERVICE
|
stop
|
srvctl stop service -d
D01 -s BATCH_SERVICE
|
status
|
srvctl status service
-d D10 -s BATCH_SERVICE
|
service (example)
|
## create the JOB
class
BEGIN DBMS_SCHEDULER.create_job_class( job_class_name => 'BATCH_JOB_CLASS', service => 'BATCH_SERVICE'); END; / ## Grant the privileges to execute the job grant execute on sys.batch_job_class to vallep;
## create a job
associated with a job class
BEGIN DBMS_SCHDULER.create_job( job_name => 'my_user.batch_job_test', job_type => 'PLSQL_BLOCK', job_action => SYSTIMESTAMP' repeat_interval => 'FREQ=DAILY;', job_class => 'SYS.BATCH_JOB_CLASS', end_date => NULL, enabled => TRUE, comments => 'Test batch job to show RAC services'); END; / ## assign a job class to an existing job exec dbms_scheduler.set_attribute('MY_BATCH_JOB', 'JOB_CLASS', 'BATCH_JOB_CLASS'); |
CRS is Oracle's clusterware
software, you can use it with other third-party clusterware software, though it
is not required (apart from HP True64).
CRS is start
automatically when the server starts, you should only stop this service in the
following situations
- Applying a patch set to $ORA_CRS_HOME
- O/S maintenance
- Debugging CRS problems
CRS Administration
|
||
Starting
|
## Starting CRS using
Oracle 10g R1
not possible
## Starting CRS using
Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl start crs |
|
Stopping
|
## Stopping CRS using
Oracle 10g R1
srvctl stop -d database <database> srvctl stop asm -n <node> srvctl stop nodeapps -n <node> /etc/init.d/init.crs stop ## Stopping CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl stop crs |
|
disabling/enabling
|
## stop CRS restarting
after a reboot, basically permanent over reboots
## Oracle 10g R1 /etc/init.d/init.crs [disable|enable] ## Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl [disable|enable] crs |
|
Checking
|
$ORA_CRS_HOME/bin/crsctl
check crs
$ORA_CRS_HOME/bin/crsctl check evmd $ORA_CRS_HOME/bin/crsctl check cssd $ORA_CRS_HOME/bin/crsctl check crsd $ORA_CRS_HOME/bin/crsctl check install -wait 600 |
|
Resource Applications (CRS Utilities)
|
||
Status
|
$ORA_CRS_HOME/bin/crs_stat
$ORA_CRS_HOME/bin/crs_stat -t $ORA_CRS_HOME/bin/crs_stat -ls $ORA_CRS_HOME/bin/crs_stat -p Note: -t more readable display -ls permission listing -p parameters |
|
create profile
|
$ORA_CRS_HOME/bin/crs_profile
|
|
register/unregister
application
|
$ORA_CRS_HOME/bin/crs_register
$ORA_CRS_HOME/bin/crs_unregister |
|
Start/Stop an
application
|
$ORA_CRS_HOME/bin/crs_start
$ORA_CRS_HOME/bin/crs_stop |
|
Resource permissions
|
$ORA_CRS_HOME/bin/crs_getparam
$ORA_CRS_HOME/bin/crs_setparam |
|
Relocate a resource
|
$ORA_CRS_HOME/bin/crs_relocate
|
|
Nodes
|
||
member number/name
|
olsnodes -n
Note: the olsnodes command is located in $ORA_CRS_HOME/bin |
|
local node name
|
olsnodes -l
|
|
activates logging
|
olsnodes -g
|
|
Oracle Interfaces
|
||
Display
|
oifcfg getif
|
|
Delete
|
oicfg delig -global
|
|
Set
|
oicfg setif -global
<interface name>/<subnet>:public
oicfg setif -global <interface name>/<subnet>:cluster_interconnect |
|
Global Services Daemon Control
|
||
Starting
|
gsdctl start
|
|
Stopping
|
gsdctl stop
|
|
Status
|
gsdctl status
|
|
Cluster Configuration (clscfg is used during installation)
|
||
create a new
configuration
|
clscfg -install
Note: the clscfg command is located in $ORA_CRS_HOME/bin |
|
upgrade or downgrade
and existing configuration
|
clscfg -upgrade
clscfg -downgrade |
|
add or delete a node
from the configuration
|
clscfg -add
clscfg -delete |
|
create a special
single-node configuration for ASM
|
clscfg -local
|
|
brief listing of
terminology used in the other nodes
|
clscfg -concepts
|
|
used for tracing
|
clscfg -trace
|
|
Help
|
clscfg -h
|
|
Cluster Name Check
|
||
print cluster name
|
cemutlo -n
Note: in Oracle 9i the ulity was called "cemutls", the command is located in $ORA_CRS_HOME/bin |
|
print the
clusterware version
|
cemutlo -w
Note: in Oracle 9i the ulity was called "cemutls" |
|
Node Scripts
|
||
Add Node
|
||
Delete Node
|
||
As you already know the
OCR is the registry that contains information
- Node list
- Node membership mapping
- Database instance, node and other mapping information
- Characteristics of any third-party applications
controlled by CRS
The file location is
specified during the installation, the file pointer indicating the OCR device
location is the ocr.loc, this can be in either of the following
- linux - /etc/oracle
- solaris - /var/opt/oracle
The file contents look
something like below, this was taken from my installation
orc.loc
|
ocrconfig_loc=/u02/oradata/racdb/OCRFile
ocrmirrorconfig_loc=/u02/oradata/racdb/OCRFile_mirror local_only=FALSE |
OCR is import to the RAC
environment and any problems must be immediately actioned, the command can be
found in located in $ORA_CRS_HOME/bin
OCR Utilities
|
||
log file
|
$ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
|
|
Checking
|
ocrcheck
Note: will return the OCR version, total space allocated, space used, free space, location of each device and the result of the integrity check |
|
dump contents
|
ocrdump
Note: by default it dumps the contents into a file named OCRDUMPFILE in the current directory |
|
export/import
|
ocrconfig -export
<file>
ocrconfig -restore <file> |
|
backup/restore
|
# show backups
ocrconfig -showbackup # to change the location of the backup, you can even specify a ASM disk ocrconfig -backuploc <path|+asm> # perform a backup, will use the location specified by the -backuploc location ocrconfig -manualbackup # perform a restore ocrconfig -restore <file> # delete a backup orcconfig -delete <file> Note: there are many more option so see the ocrconfig man page |
|
add/remove/replace
|
## add/relocate the
ocrmirror file to the specified location
ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf' ## relocate an existing OCR file ocrconfig -replace ocr '/ocfs1/ocr_new.dbf' ## remove the OCR or OCRMirror file ocrconfig -replace ocr ocrconfig -replace ocrmirror |
|
The voting disk as I
mentioned in the architecture is used to resolve membership issues in
the event of a partitioned cluster, the voting disk protects data integrity.
querying
|
crsctl query css
votedisk
|
Adding
|
crsctl add css
votedisk <file>
|
deleting
|
crsctl delete css votedisk
<file>
|
RAC Backups and Recovery
Backups and recovery is
very similar to a single instance database. This article covers only the
specific issues that surround RAC backups and recovery, I have already written
a article on standard Oraclebackups and recovery.
Backups can be different
depending on the the size of the company
- small company - may use tools such as tar, cpio, rsync
- medium/large company - Veritas Netbackup, RMAN
- Enterprise company - SAN mirroring with a backup option
like Netbackup or RMAN
Oracle RAC can use all
the above backup technologies, but Oracle prefers you to use RMAN oracle own
backup solution.
Oracle backups can be
taken hot or cold, a backup will comprise of the following
- Datafiles
- Control Files
- Archive redolog files
- Parameter files (init.ora or SPFILE)
Databases have now grown
to very large sizes well over a terabyte in size in some cases, thus tapes
backups are not used in these cases but sophisticated disk mirroring have taken
their place. RMAN can be used in either a tape or disk solution, it can even
work with third-party solutions such as Veritas Netbackup.
In a Oracle RAC
environment it is critical to make sure that all archive redolog files are
located on shared storage, this is required when trying to recover the
database, as you need access to all archive redologs. RMAN can use parallelism
when recovering, the node that performs the recovery must have access to all
archived redologs, however, during recovery only one node applies the archived
logs as in a standard single instance configuration.
Oracle RAC also supports
Oracle Data Guard, thus you can have a primary database configured as a RAC and
a standby database also configured as a RAC.
In a RAC environment
there are two types of recovery
- Crash Recovery -
means that all instances have failed, thus they all need to be recovered
- Instance Recovery -
means that one or more instances have failed. this instance can then be
recovered by the surviving instances
Redo information
generated by an instance is called a thread of redo. All log files
for that instance belong to this thread, an online redolog file belongs to a
group and the group belongs to a thread. Details about log group file and
thread association details are stored in the control file. RAC databases have
multiple threads of redo, each instance has one active thread, the threads are
parallel timelines and together form a stream. A stream consists of all the
threads of redo information ever recorded, the streams form the timeline of
changes performed to the database.
Oracle records the
changes made to a database, these are called change vectors. Each
vector is a description of a single change, usually a single block. A redo
record contains one or more change vectors and is located by its Redo
Byte Address (RBA) and points to a specific location in the redolog
file (or thread). It will consist of three components
- log sequence number
- block number within the log
- byte number within the block
Checkpoints are the same
in a RAC environment and a single instance environment, I have already
discussed checkpoints, when a checkpoint needs to be triggered,
Oracle will look for the thread checkpoint that has the lowest checkpoint SCN,
all blocks in memory that contain changes made prior to this SCN across all
instances must be written out to disk. I have discussed how to control recovery in my Oracle section and this applies to
RAC as well.
Crash recovery is
basically the same for a single instance and a RAC environment, I have a
complete recovery section in my Oracle section, here is a
note detailing the difference
For a single instance the following is the recovery process
For a single instance the following is the recovery process
- The on-disk block is the starting point for the
recovery, Oracle will only consider the block on the disk so the recovery
is simple. Crash recovery will automatically happen using the online redo
logs that are current or active
- The starting point is the last full checkpoint. The
starting point is provided by the control file and compared against the
same information in the data file headers, only the changes need to be
applied
- The block specified in the redolog is read into cache,
if the block has the same timestamp as the redo record (SCN match) the
redo is applied.
For a RAC instance the
following is the recovery process
- A foreground process in a surviving instance detects an
"invalid block lock" condition when a attempt is made to read a
block into the buffer cache. This is an indication that an instance has
failed (died)
- The foreground process sends a notification to instance
system monitor (SMON) which begin to search for dead instances. SMON maintains
a list of all the dead instances and invalid block locks. Once the
recovery and cleanup has finished this list is updated.
- The death of another instance is detected if the
current instance is able to acquire that instance's redo thread locks, which
is usually held by an open and active instance.
Oracle RAC uses a
two-pass recovery, because a data block could have been modified in any of the
instances (dead or alive), so it needs to obtain the latest version of the
dirty block and it uses PI (Past Image) and Block Written Record (BWR) to
archive this in a quick and timely fashion.
Block Written Record
(BRW)
|
The cache aging and
incremental checkpoint system would write a number of blocks to disk, when
the DBWR completes a data block write operation, it also adds a redo record
that states the block has been written (data block address and SCN). DBWn can
write block written records (BWRs) in batches, though in a lazy fashion. In
RAC a BWR is written when an instance writes a block covered by a global resource
or when it is told that its past image (PI) buffer it is holding is no longer
necessary.
|
Past Image (PI)
|
This is was makes
RAC cache fusion work, it eliminates the write/write contention problem that
existed in the OPS database. A PI is a copy of a globally dirty block and is
maintained in the database buffer cache, it can be created and saved when a
dirty block is shipped across to another instance after setting the resource
role to global. The GCS is responsible for informing an instance that its PI
is no longer needed after another instance writes a newer (current) version
of the same block. PI's are discarded when GCS posts all the holding
instances that a new and consistent version of that particular block is now
on disk.
I go into more details about PI's in my cache fusion section. |
The first pass does
not perform the actual recovery but merges and reads redo threads to create a
hash table of the blocks that need recovery and that are not known to have been
written back to the datafiles. The checkpoint SCN is need as a starting point
for the recovery, all modified blocks are added to the recovery set (a
organized hash table). A block will not be recovered if its BWR version is
greater than the latest PI in any of the buffer caches.
The second pass SMON
rereads the merged redo stream (by SCN) from all threads needing recovery, the
redolog entries are then compared against a recovery set built in the first
pass and any matches are applied to the in-memory buffers as in a single pass
recovery. The buffer cache is flushed and the checkpoint SCN for each thread is
updated upon successful completion.
I have a detailed
section on cache fusion, this section covers the recovery, cache fusion
is only used in RAC environments, as additional steps are required, such as GRD
reconfiguration, internode communication, etc. There are two types of recovery
- Crash Recovery - all instances have failed
- Instance Recovery - one instance has failed
In both cases the
threads from failed instances need to be merged, in a instance recovery SMON
will perform the recovery where as in a crash recovery a foreground process
performs the recovery.
The main features
(advantages) of cache fusion recovery are
- Recovery cost is proportional to the number of
failures, not the total number of nodes
- It eliminates disk reads of blocks that are present in
a surviving instance's cache
- It prunes recovery set based on the global resource
lock state
- The cluster is available after an initial log scan,
even before recovery reads are complete
In cache fusion the
starting point for recovery of a block is its most current PI version, this could
be located on any of the surviving instances and multiple PI blocks of a
particular buffer can exist.
Remastering is the term used that describes the operation
whereby a node attempting recovery tries to own or master the resource(s) that
were once mastered by another instance prior to the failure. When one instance
leaves the cluster, the GRD of that instance needs to be redistributed to the
surviving nodes. RAC uses an algorithm called lazy remastering to
remaster only a minimal number of resources during a reconfiguration. The
entire Parallel Cache Management (PCM) lock space remains invalid while the DLM
and SMON complete the below steps
- IDLM master node discards locks that are held by dead
instances, the space is reclaimed by this operation is used to remaster
locks that are held by the surviving instance for which a dead instance
was remastered
- SMON issues a message saying that it has acquired the
necessary buffer locks to perform recovery
Lets look at an example
on what happens during a remastering, lets presume the following
- Instance A masters resources 1, 3, 5 and 7
- Instance B masters resources 2, 4, 6, and 8
- Instance C masters resources 9, 10, 11 and 12
Instance B is removed
from the cluster, only the resources from instance B are evenly remastered
across the surviving nodes (no resources on instances A and C are affected),
this reduces the amount of work the RAC has to perform, likewise when a
instance joins a cluster only minimum amount of resources are remastered to the
new instance.
Before Remastering
|
After Remastering
|
You can control the
remastering process with a number of parameters
_gcs_fast_config
|
enables fast
reconfiguration for gcs locks (true|false)
|
_lm_master_weight
|
controls which
instance will hold or (re)master more resources than others
|
_gcs_resources
|
controls the number
of resources an instance will master at a time
|
you can also force a
dynamic remastering (DRM) of an object using oradebug
force dynamic
remastering (DRM)
|
## Obtain the
OBJECT_ID form the below table
SQL> select * from v$gcspfmaster_info;
## Determine who
masters it
SQL> oradebug setmypid SQL> oradebug lkdebug -a <OBJECT_ID> ## Now remaster the resource SQL> oradebug setmypid SQL> oradebug lkdebug -m pkey <OBJECT_ID> |
The steps of a GRD reconfiguration
is as follows
- Instance death is detected by the cluster manager
- Request for PCM locks are frozen
- Enqueues are reconfigured and made available
- DLM recovery
- GCS (PCM lock) is remastered
- Pending writes and notifications are processed
- I Pass recovery
- The instance recovery (IR) lock is acquired by SMON
- The recovery set is prepared and built, memory space
is allocated in the SMON PGA
- SMON acquires locks on buffers that need recovery
- II Pass recovery
- II pass recovery is initiated, database is partially
available
- Blocks are made available as they are recovered
- The IR lock is released by SMON, recovery is then
complete
- The system is available
Graphically it looks
like below
RAC Performance
I have already discussed
basic Oracle tuning, in this section I will mainly dicuss Oracle
RAC tuning. First lets review the best pratices of a Oracle design regarding
the application and database
- Optimize connection management, ensure that the middle
tier and programs that connect to the database are efficent in connection
management and do not log on or off repeatedly
- Tune the SQL using the available tools such as ADDM and SQL Tuning Advisor
- Ensure that applications use bind variables, cursor_sharing was
introduced to solve this problem
- Use packages and
procedures (because
they are compiled) in place of anonymous PL/SQL blocks and big SQL
statements
- Use locally
managed tablespaces and automatic
segment space management to
help performance and simplify database administration
- Use automatic undo
management and temporary
tablespace to
simplify administration and increase performance
- Ensure you use large caching when using sequences, unless you cannot afford to lose sequence during a
crash
- Avoid using DDL in production, it increases
invalidations of the already parsed SQL statements and they need to be
recompiled
- Partion tables and indexes to reduce index leaf
contention (buffer busy global cr problems)
- Optimize contention on data blocks (hot spots) by
avoiding small tables with too many rows in a block
Now we can review RAC
specific best practices
- Consider using application partitioning (see below)
- Consider restricting DML-intensive users to using one
instance, thus reducing cache contention
- Keep read-only tablespaces away from DML-intensive
tablespaces, they only require minimum resources thus optimizing Cache
Fusion performance
- Avoid auditing in RAC, this causes more shared library
cache locks
- Use full tables scans sparingly, it causes the GCS to service lots of block requests, see
table v$sysstat column "table scans (long
tables)"
- if the application uses lots of logins, increase the
value of sys.audsess$ sequence
Workload partitioning is
a certian type of workload that is executed on an instance, that is
partitioning allows users who access the same set of data to log on to the same
instance. This limits the amount of data that is shared between instances thus
saving resources used for messaging and Cache Fusion data block transfer.
You should consider the
following when deciding to implement partitioning
- If the CPU and private interconnects are of high
performance then there is no need to to partition
- Partitioning does add complexity, thus if you can
increase CPU and the interconnect performance the better
- Only partition if performance is betting impacted
- Test both partitioning and non-partitioning to what
difference it makes, then decide if partitioning is worth it
An event is
an operation or particular function that the Oracle kernel performs on behalf
of a user or a Oracle background process, events have specific names like database
event. Whenever a session has to wait for something, the wait time is
tracked and charged to the event that was associated with that wait. Events
that are associated with all such waits are known as wait events.
The are a number of wait classes
- Commit
- Scheduler
- Application
- Configuration
- User I/O
- System I/O
- Concurrency
- Network
- Administrative
- Cluster
- Idle
- Other
There are over 800
different events spread across the above list, however you probably will only
deal with about 50 or so that can improve performance.
When a session requests
access to a data block it sends a request to the lock master for proper
authorization, the request does not know if it will receive the block via Cache
Fusion or a permission to read from the disk. Two placeholder events
- global cache cr request (consistent read - cr)
- global cache curr request (current - curr)
keep track of the time a
session spends in this state. There are number of types of wait events
regarding access to a data block
Wait Event
|
Contention type
|
Description
|
gc current block
2-way
|
write/write
|
an instance requests
authorization for a block to be accessed in current mode to modify a block,
the instance mastering the resource receives the request. The master has the
current version of the block and sends the current copy of the block to the
requestor via Cache Fusion and keeps a Past Image (.PI)
If you get this then
do the following
|
gc current block
3-way
|
write/write
|
an instance requests
authorization for a block to be accessed in current mode to modify a block,
the instance mastering the resource receives the request and forwards it to
the current holder of the block, asking it to relinquish ownership. The
holding instance sends a copy of the current version of the block to the
requestor via Cache Fusion and transfers the exclusive lock to the requesting
instance. It also keeps a past Image (PI).
Use the above
actions to increase the performance
|
gc current block
2-way
|
write/read
|
The difference with
the one above is that this sends a copy of the block thus keeping the current
copy.
|
gc current block
3-way
|
write/read
|
The difference with
the one above is that this sends a copy of the block thus keeping the current
copy.
|
gc current block
busy
|
write/write
|
The requestor will
eventually get the block via cache fusion but it is delayed due to one of the
following
If you get this then
do the following
|
gc current buffer
busy
|
local
|
This is the same as
above (gc current block busy), the difference is that another session on the
same instance also has requested the block (hence local contention)
|
gc current block
congested
|
none
|
This is caused if
heavy congestion on the GCS, thus CPU resources are stretched
|
Oracle RAC uses a
queuing mechanism to ensure proper use of shared resources, it is called Global
Enqueue Services (GES). Enqueue wait is the time spent by a session waiting for
a shared resource, here are some examples of enqueues:
- updating the control file (CF enqueue)
- updating an individual row (TX enqueue)
- exclusive lock on a table (TM enqueue)
Enqueues can be managed
by the instance itself others are used globally, GES is responsible for
coordinating the global resources. The formula used to calculate the number of
enqueue resources is as below
GES
Resources = DB_FILES + DML_LOCKS + ENQUEUE_RESOURCES + PROCESS + TRANSACTION x
(1 + (N - 1)/N)
N = number of RAC instances
N = number of RAC instances
displaying enqueues
stats
|
SQL> column
current_utilization heading current
SQL> column max_utilization heading max_usage SQL> column initial_allocation heading initial SQL> column resource_limit format a23; SQL> select * from v$resource_limit; |
I have already
discussed AWR in a single instance environment, so for a
quick refresh take a look and come back here to see how you can use it in a RAC
environment.
From a RAC point of view
there are a number of RAC-specific sections that you need to look at in the
AWR, in the report section is a AWR of my home RAC environment, you can view
the whole report here.
RAC AWR Section
|
Report
|
Description
|
Number of Instances
|
lists the number of
instances from the beginning and end of the AWR report
|
|
Instance global
cache load profile
|
information about
the interinstance cache fusion data block and messaging traffic, because
my AWR report is lightweight here is a more heavy
used RAC example
Global Cache Load
Profile
~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Global Cache blocks received: 315.37 12.82 Global Cache blocks served: 240.30 9.67 GCS/GES messages received: 525.16 20.81 GCS/GES messages sent: 765.32 30.91 The first two statistics indicate the number of blocks transferred to or from this instance, thus if you are using a 8K block size Sent: 240 x 8,192 = 1966080 bytes/sec = 2.0 MB/sec Received: 315 x 8,192 = 2580480 bytes/sec = 2.6 MB/sec to determine the amount of network traffic generated due to messaging you first need to find the average message size (this was 193 on my system) select sum(kjxmsize * (kjxmrcv + kjxmsnt + kjxmqsnt)) / sum((kjxmrcv + kjxmsnt + kjxmqsnt)) "avg Message size" from x$kjxm where kjxmrcv > 0 or kjxmsnt > 0 or kjxmqsnt > 0; then calculate the amount of messaging traffic on this network 193 (765 + 525) = 387000 = 0.4 MB to calculate the total network traffic generated by cache fusion = 2.0 + 2.6 + 0.4 = 5 MBytes/sec = 5 x 8 = 40 Mbits/sec The DBWR Fusion writes statistic indicates the number of times the local DBWR was forced to write a block to disk due to remote instances, this number should be low. |
|
Glocal cache
efficiency percentage
|
this section shows
how the instance is getting all the data blocks it needs. The best order is
the following
The first two give
the cache hit ratio for the instance, you are looking for a
value less than 10%, if you are getting higher values then you may
consider application
partitioning.
|
|
GCS and GES -
workload characteristics
|
this section
contains timing statistics for global enqueue and global cache. As a general
rule you are looking for
|
|
Messaging statistics
|
The first section
relates to sending a message and should be less than 1 second.
The second section
details the breakup of direct and indirect messages, direct messages are sent
by a instance foreground or the user processes to remote instances, indirect
are messages that are not urgent and are pooled and sent.
|
|
Service statistics
|
shows the resources
used by all the service instance supports
|
|
Service wait class
statistics
|
summarizes waits in
different categories for each service
|
|
Top 5 CR and current
block segements
|
conatns the names of
the top 5 contentious segments (table or index). If a table or index has a
very high percentage of CR and Current block transfers you need to
investigate. This is pretty much like a normal single instance.
|
As I stated above the
interconnect it a critical part of the RAC, you must make sure that this is on
the best hardware you can buy. You can confirm that the interconnect is being
used in Oracle 9i and 10g by using the command oradebug to
dump information out to a trace file, in Oracle 10g R2 the cluster interconnect
is also contained in the alert.log file, you can view my
information from here.
interconnect
|
SQL> oradebug
setmypid
SQL> oradebug ipc
Note: look in the
user_dump_dest directory, the trace will be there
|
Global Resource
Directory (GRD)
The RAC environment includes
many resources such as multiple versions of data block buffers in buffer caches
in different modes, Oracle uses locking and queuing mechanisms to coordinate
lock resources, data and interinstance data requests. Resources such as data
blocks and locks must be synchronized between nodes as nodes within a cluster
acquire and release ownership of them. The synchronization provided by the
Global Resource Directory (GRD) maintains a cluster wide concurrency of the
resources and in turn ensures the integrity of the shared data. Synchronization
is also required for buffer cache management as it is divided into multiple
caches, and each instance is responsible for managing its own local version of
the buffer cache. Copies of data are exchanged between nodes, this sometimes is
referred to as the global cache but in reality each nodes
buffer cache is separate and copies of blocks are exchanged through traditional
distributed locking mechanism.
Global Cache Services
(GCS) maintain the cache coherency across buffer cache resources and Global
Enqueue Services (GES) controls the resource management across the clusters
non-buffer cache resources.
Cache coherency
identifies the most up-to-date copy of a resource, also called the master copy,
it uses a mechanism by which multiple copies of an object are keep consistent
between Oracle instances. Parallel Cache Management (PCM) ensures that the
master copy of a data block is stored in one buffer cache and consistent copies
of the data block are stored in other buffer caches, the process LCKx is responsible for this task.
The lock and resource
structures for instance locks reside in the GRD (also called the DLM), its a
dedicated area within the shared pool. Details about the data blocks resources
and cached versions are maintained by GCS. Additional details such as the
location of the most current version, state of the buffer, role of the data
block (local or global) and ownership are maintained by GES. Global cache
together with GES form the GRD. Each instance maintains a part of the GRD in
its SGA. The GCS and GES nominate one instance, this will become the resource
master, to manage all information about a particular resource. Each
instance knows which instance master is with which resource.
A resource is an
identifiable entity, it has a name or reference. The referenced entity is
usually a memory region, a disk file, a data block or an abstract entity. A
resource can be owned or locked in various states (exclusive or shared), all
resources are lockable. A global resource is visible throughout the cluster,
thus a local resource can only be used by the instance at it is local too. Each
resource can have a list of locks called the grant queue, that are
currently granted to users. A convert queue is a queue of
locks that are waiting to be converted to particular mode, this is the process
of changing a lock from one mode to another, even a NULL is a lock. A resource
has a lock value block (LVB). The Global Resource Manager (GRM) keeps the lock
information valid and correct across the cluster.
Locks are placed on a
resource grant or a convert queue, if the lock changes it moves between the
queues. A lock leaves the convert queue under the following conditions
- The process requests the lock termination (it remove
the lock)
- The process cancels the conversion, the lock is moved
back to the grant queue in the previous mode
- The requested mode is compatible with the most
restrictive lock in the grant queue and with all the previous modes of the
convert queue, and the lock is at the head of the convert queue
Convert requests are
processed on a FIFO basis, the grant queue and convert queue are associated
with each and every resource that is managed by the GES.
Enqueues are basically
locks that support queuing mechanisms and that can be acquired in different
modes. An enqueue can be held in exclusive mode by one process and others can
hold a non-exclusive mode depending on the type. Enqueues are the same in RAC
as they are in a single instance.
GES coordinates the
requests of all global enqueues, it also deals with deadlocks and timeouts.
There are two types of local locks, latches and enqueues, latches do not affect the
cluster only the local instance, enqueues can affect both the cluster and the
instance.
Enqueues are shared
structures that serialize access to database resources, they support multiple
modes and are held longer than latches, they protect persistent objects such as
tables or library cache objects. Enqueues can use any of the following modes
Mode
|
Summary
|
Description
|
NULL
|
NULL
|
no access rights, a
lock is held at this level to indicate that a process is interested in a
resource
|
SS
|
SubShared
|
the resource can be
read in an unprotected fashion other processes can read and write to the
resource, the lock is also known as a row share lock
|
SX
|
Shared Exclusive
|
the resource can be
read and written to in an unprotected fashion, this is also known as a RX
(row exclusive) lock
|
S
|
Shared
|
a process cannot
write to the resource but multiple processes can read it. This is the
traditional share lock.
|
SSX
|
SubShared Exclusive
|
Only one process can
hold a lock at this level, this makes sure that only processes can modify it
at a time. Other processes can perform unprotected reads. This is also know
as a SRX (shared row exclusive) table lock.
|
X
|
Exclusive
|
grants the holding
process exclusive access to the resource, other processes cannot read or
write to the resource. This is also the traditional exclusive lock.
|
Each node has
information for a set of resources, Oracle uses a hashing algorithm to
determine which nodes hold the directory tree information for the resource.
Global locks are mainly of two types
- Locks used by the GCS for buffer cache management,
these are called PCM locks
- Global locks (global enqueue) that Oracle synchronizes
within a cluster to coordinate non-PCM resources, they protect the enqueue
structures
An instance owns a
global lock that protects a resource (i.e. data block or data dictionary entry)
when the resource enters the instance's SGA.
GES locks control access
to data files (not the data blocks) and control files and also serialize
interinstance communication. They also control library caches and the
dictionary cache. Examples of this are DDL, DML enqueue table locks,
transaction enqueues and DDL locks or dictionary locks. The SCN and mount lock
are global locks.
Transaction and row
locks are the same as in a single instance database, the only difference is
that the enqueues are global enqueues, take a look in locking for an in depth view on how Oracle locking
works.
The difference between
RAC and a single instance messaging is that RAC uses the high speed
interconnect and a single instance uses shared memory and semaphores,
interrupts are used when one or more process want to use the processor in a
multiple CPU architecture. GES uses messaging for interinstance communication,
this is done by messages and asynchronous traps (ASTs). Both LMON and LMD use messages to communicate to other
instances, the GRD is updated when locks are required. The messaging traffic
can be viewed using the view V$GES_MISC.
A three-way lock message
involves up to a maximum of three instances, Master instance (M), Holding
instance (H) and the Requesting instance (R), the sequence is detailed below
where requesting instance R is interested in block B1 from holding instance H.
The resource is mastered in master instance M
|
Because GES heavily
rely's on messaging the interconnect must be of high quality (high performance
, low latency), also the messages are kept small (128 bytes) to increase
performance. The Traffic Controller (TRFC) is used to control the DLM traffic
between the instances in the cluster, it uses buffering to accommodate large
volumes of traffic. The TRFC keeps track of everything by using tickets
(sequence numbers), there is a predefined pool of tickets this is dependent on
the network send buffer size. A ticket is obtained before sending any messages,
once sent the ticket is returned to the pool, LMS or LMD perform this. If there
are no tickets then the message has to wait until a ticket is available. You
can control the number of tickets and view them
system parameter
|
_lm_tickets
_lm_ticket_active_sendback (used for aggressive messaging) |
ticket usage
|
select local_nid
local, remote_nid remote, tckt_avail avail, tckt_limit limit, snd_q_len
send_queue, tckt_wait waiting from v$ges_traffic_controller;
|
dump ticket
information
|
SQL> oradebug
setmypid
SQL> oradebug unlimit SQL> oradebug lkdebug -t |
GCS locks only protect
data blocks in the global cache (also know as PCM locks), it can be acquired in
share or exclusive mode. Each lock element can have the lock role set to either
local (same as single instance) or global. When in global role three lock modes
are possible, shared, exclusive and null. In global role mode you can read or
write to the data block only as directed by the master instance of that
resource. The lock and state information is held in the SGA and is maintained
by GCS, these are called lock elements. It also holds a chain of
cache buffer chains that are covered by the corresponding lock elements. These
can be view via v$lock_element, the parameter _db_block_hash_buckets controls
the number of hash buffer chain buckets.
GCS locks uses the
following modes as stated above
Exclusive (X)
|
used during update
or any DML operation, if another instance requires the block that has a
exclusive lock it asks GES to request that he second instance disown the
global lock
|
Shared (S)
|
used for select
operations, reading of data does not require a instance to disown a global
lock.
|
Null (N)
|
allows instances to
keep a lock without any permission on the block(s). This mode is used so that
locks need not be created and destroyed all the time, it just converts from
one lock to another.
|
Lock roles are used by
Cache Fusion, it can be either local or global, the resource is local if the
block is dirty only in the local cache, it is global if the block is dirty in a
remote cache or in several remote caches. A Past Image (PI) is kept by the instance when a block
is shipped to another instance, the role is then changed to a global role, thus
the PI represents the state of a dirty buffer. A node must keep a PI until it
receives notification from the master that a write to disk has completed
covering that version, the node will then log a block written record (BWR). I
have already discussed PI and BWR in my backupsection.
When a new current block
arrives, the previous PI remains untouched in case another node requires it. If
there are a number of PI's that exist, they may or may not merge into a single
PI, the master will determine this based on if the older PI's are required, a
indeterminate number of PI's can exist.
In the local role only S
and X modes are permitted, when requested by the master instance the holding
instance serves a copy of the block to others. If the block is globally clean
this instance lock role remains local. If the block is modified (dirty), a PI
is retained and the lock becomes global. In the global lock role lock modes can
be N, S and X, the block is global and it may even by dirty in any of the
instances and the disk version may be obsolete. Interested parties can only
modify the block using X mode, an instance cannot read from the disk as it may
not be current, the holding instance can send copies to other instances when
instructed by the master.
I have a complete
detailed walkthough in my cache_fusion section, which will help you better to
understand.
A lock element holds
lock state information (converting, granting, etc). LEs are managed by the lock
process to determine the mode of the locks, they also old a chain of cache
buffers that are covered by the LE and allow the Oracle database to keep track
of cache buffers that must be written to disk in a case a LE (mode) needs to be
downgraded (X > N).
LEs protect all the data blocks in the buffer cache, the list below describes the classes of the data block which are managed by the LEs using GCS locks (x$bh.class).
0
|
FREE
|
1
|
EXLCUR
|
2
|
SHRCUR
|
3
|
CR
|
4
|
READING
|
5
|
MRECOVERY
|
6
|
IRCOVERY
|
7
|
WRITING
|
8
|
PI
|
So putting this
altogether you get the following, GCS manages PCM locks in the GRD, PCM locks
manage the data blocks in the global cache. Data blocks are can be kept in any
of the instances buffer cache (which is global), if not found then it can be
read from disk by the requesting instance. The GCS monitors and maintains the
list and mode of the blocks in all the instances. Each instance will master a number
of resources, but a resource can only be mastered by one instance. GCS ensures
cache coherency by requiring that instances acquire a lock before modifying or
reading a database block. GCS locks are not row-level locks, row-level locks
are used in conjunction with PCM locks. GCS lock ensures that they block is
accessed by one instances then row-level locks manage the blocks at the
row-level. If a block is modified all Past Images (PI) are no longer current
and new copies are required to obtained.
Consistent read
processing means that readers never block writers, as the same in a single
instance. One parameter that can help is _db_block_max_cr_dba which limits the
number of CR copies per DBA on the buffer cache. If too many CR requests arrive
for a particular buffer, the holder can disown the lock on the buffer and write
the buffer to the disk, thus the requestor can then read it from disk,
especially if the requested block has a older SCN and needs to reconstruct it
(known as CR fabrication). This is technically known as fairness
downconvert, and the parameter _fairness_threshold can used to configure
it.
The lightwork rule is
involved when CR construction involves too much work and no current block or PI
block is available in the cache for block cleanouts. The below can be used to
view the number of times a downconvert occurs
downconvert
|
select cr_requests,
light_works, data_requests, fairness_down_converts from v$cr_block_server;
Note: lower the _fairness_threshold if the ratio goes above 40%, set to 0 if the instance is a query only instance. |
The GRD is a central
repository for locks and resources, it is distributed across all nodes (not a
single node), but only one instance masters a resource. The process of
maintaining information about resources is called lock mastering or resource
mastering. I spoke about lock remastering in my backup section.
Resource affinity allows
the resource mastering of the frequently used resources on its local node, it
uses dynamic resource mastering to move the location of the resource masters.
Normally resource mastering only happens when a instance joins or leaves the
RAC environment, as of Oracle 10g R2 mastering occurs at the object level which
helps fine-grained object remastering. There are a number of parameters that
can be used to dynamically remaster an object
_gc_affinity_time
|
specifies interval
minutes for remastering
|
_gc_affinity_limit
|
defines the number
of times a instance access the resource before remastering, setting to 0
disable remastering
|
_gc_affinity_minimum
|
defines the minimum
number of times a instance access the resource before remastering
|
_lm_file_affinity
|
disables dynamic
remastering for the objects belonging to those files
|
_lm_dynamic_remastering
|
enable or disable
remastering
|
You should consult
Oracle before changing any of the above parameters.
Cache Fusion
I mentioned above Cache
Fusion in my GRD section, here I go into great detail on how it works, I will
also provide a number of walk through examples on my RAC system.
Cache Fusion uses the
most efficient communications as possible to limit the amount of traffic used
on the interconnect, now you don't need this level of detail to administer a
RAC environment but it sure helps to understand how RAC works when trying to
diagnose problems. RAC appears to have one large buffer but this is not the
case, in reality the buffer caches of each node remain separate, data blocks
are shared through distributed locking and messagingoperations. RAC copies data blocks across the
interconnect to other instances as it is more efficient than reading the disk,
yes memory and networking together are faster than disk I/O.
The transfer of a data
block from instances buffer cache to another instances buffer cache is know as
a ping. As mentioned already when an instance requires a
data block it sends the request to the lock master to obtain a lock in the
desired mode, this process is known as blocking
asynchronous trap (BAST). When an instance receives a BAST it
downgrades the lock ASAP, however it might have to write the corresponding
block to disk, this operation is known as disk ping or hard
ping. Disk pings have been reduce in the later versions of RAC, thus
relaying on block transfers more, however there will always be a small amount
of disk pinging. In the newer versions of RAC when a BAST is received sending
the block or downgrading the lock may be deferred by tens of milliseconds, this
extra time allows the holding instance to complete an active transaction and
mark the block header appropriately, this will eliminate any need for the
receiving instance to check the status of the transaction immediately after
receiving/reading a block. Checking the status of a transaction is an expensive
operation that may require access (and pinging) to the related undo segment
header and undo data blocks as well. The parameter _gc_defer_time can be used
to define the duration by which an instance deferred downgrading a lock.
In the GRD section I mentioned Past Images (PIs), basically
they are copies of data blocks in the local buffer cache of an instance. When
an instance sends a block it has recently modified to another instance, it
preserves a copy of that block, marking as a PI. The PI is kept until that
block is written to disk by the current owner of the block. When the block is
written to disk and is known to have a global role, indicating the presents of
PIs in other instances buffer caches, GCS informs the instance holding the PIs
to discard the PIs. When a checkpoint is required it informs GCS of the write
requirement, GCS is responsible for finding the most current block image and
informing the instance holding that image to perform a block write. GCS then
informs all holders of the global resource that they can release the buffers
holding the PI copies of the block, allowing the global resource to be
released. You can view the past image blocks present in the fixed table X$BH
PIs
|
select state,
count(state) from X$BH group by state;
Note: the state column with 8 is the past images. |
Cache Fusion I is also
know as consistent read server and was introduced in Oracle 8.1.5, it keeps a
list of recent transactions that have changed a block.the original data
contained in the block is preserved in the undo segment, which can be used to
provide consistent read versions of the block.
In a single instance the
following happens when reading a block
- When a reader reads a recently modified block, it might
find an active transaction in the block
- The reader will need to read the undo segment header to
decide whether the transaction has been committed or not
- If the transaction is not committed, the process
creates a consistent read (CR) version of the block in the buffer cache
using the data in the block and the data stored in the undo segment
- If the undo segment shows the transaction is committed,
the process has to revisit the block and clean out the block (delay block
cleanout) and generate the redo for the
changes.
In an RAC environment if
the process of reading the block is on an instance other than the one that
modified the block, the reader will have to read the following blocks from the
disk
- data block to
get the data and/or transaction ID and Undo Byte Address (UBA)
- undo segment header block to find the last undo block used for the entire
transaction
- undo data block to
get the actual record to construct a CR image
Before these blocks can
be read the instance modifying the block will have to write those's blocks to
disk, resulting in 6 I/O operations. In RAC the instance can construct a CR
copy by hopefully using the above blocks that are still in memory and then
sending the CR over the interconnect thus reducing 6 I/O operations.
As from Oracle 8
introduced a new background process called the Block Server Process makes the
CR fabrication at the holders cache and ships the CR version of the block
across the interconnect, the sequence is detailed in the table below
|
While making a CR copy,
the holding instance may refuse to do so if
- it does not find any of the blocks needed in its buffer
cache, it will not perform a disk read to make a CR copy for another
instance
- It is repeatedly asked to send a CR copy of the same
block, after sending the CR copies four times it will voluntarily
relinquish the lock, write the block to the disk and let other instances
get the block from the disk. The number of copies it will serve before
doing so is governed by the parameter _fairness_threshold
Read/Write contention
was addressed in cache fusion I, cache fusion II addresses the write/write
contention
|
A quick recap of GCS, a
GCS resource can be local or global, if it is
local it can be acted upon without consulting other instances, if it is global
it cannot be acted upon without consulting or informing remote instances. GCS
is used as a messaging agent to coordinate manipulation of a global resource.
By default all resources are in NULL mode (remember null mode is used to
convert from one type to another (share or exclusive)).
The table below denotes
the different states of a resource
Mode/Role
|
Local
|
Global
|
Null (N)
|
NL
|
NG
|
Shared (S)
|
SL
|
SG
|
Exclusive (X)
|
XL
|
XG
|
States
|
||
SL
|
it can serve a copy
of the block to other instances and it can read the block from disk, since
the block is not modified there is no need to write to disk
|
|
XL
|
it has sole
ownership and interest in that resource, it has exclusive right to modify the
block, all changes to the blocks are in the local buffer cache and it can
write the block to the disk. If another instance wants the block it can to
come via the GCS
|
|
NL
|
used to protect
consistent read block, if an instance wants it in X mode, the current
instance will send the block to the requesting instance and downgrades its
role to NL
|
|
SG
|
a block is present
in one or more instances, an instance can read the read from disk and serve
it to other instances
|
|
XG
|
a block can have one
or more PIs, the instance with the XG role has the latest copy of the block
and is the most likely candidate to write the block to the disk. GCS can ask
the instance to write the block and serve it to other instances
|
|
NG
|
after discarding PIs
when instructed to by GCS, the block is kept in the buffer cache with NG
role, this serves only as the CR copy of the block.
|
Below are a number of
common scenarios to help understand the following
- reading from disk
- reading from cache
- getting the block from cache for update
- performing an update on a block
- performing an update on the same block
- reading a block that was globally dirty
- performing a rollback on a previously updated block
- reading the block after commit
We will assume the
following
- Four RAC environment (Instances A, B, C and D)
- Instance D is the master of the lock resource for the
data block BL
- We will only use one block and it will reside at SCN
987654
- We will use a three-letter code for the lock states
- first letter will indicate the lock mode - N = Null, S
= Shared and X = Exclusive
- second latter will indicate lock role - G = Global, L
= Local
- The third letter will indicate the PIs - 0 = no PIs, 1
= a PI of the bloc
for example a code of
SL0 means a global shared lock with no past images (PIs)
Reading a block from disk
|
|
instance C want to
read the block it will request a lock in share mode from the master instance
|
|
Reading a block from the cache
|
|
Carrying on from the
above example, Instance B wants to read the same block that
is cached in instance C buffer.
|
|
Getting a (Cached) clean block for update
|
|
Carrying on from the
above example, instance A wants to modify the same block that is already
cached in instance B and C (block 987654)
|
|
Getting a (Cached) modified block for update and commit
|
|
Carrying on from the
above example, instance C now wants to modify the block, if it tries to
modify the same row it will have to wait until instance A either commits or
rolls back. However in this case instance C wants to modify a different row
in the same block.
|
|
Commit the previously modified block and select the data
|
|
Carrying on from the
above example, instance A now issues a commit to release the row level locks
held by the transaction and flush the redo information to the redologs
|
|
Write the dirty buffers to disk due to a checkpoint
|
|
Carrying on from the
above example, instance B writes the dirty blocks from the buffer cache due
to a checkpoint (this is were it gets interesting and very clever)
|
|
Master instance crashes
|
|
Carrying on from the
above example
|
|
Select the rows from Instance A
|
|
Carrying on from the
above example, now instance A queries the rows from that table to get the
most recent data
|
The above sequence of
events can be seen in the table below
Example
|
Operation on Node
|
Buffer Status
|
||||||
A
|
B
|
C
|
D
|
A
|
B
|
C
|
D
|
|
1
|
read block from disk
|
SCUR
|
||||||
2
|
read the block from cache
|
CR
|
SCUR
|
|||||
3
|
update the block
|
XCUR
|
CR
|
CR
|
||||
4
|
update the same block
|
PI
|
CR
|
XCUR
|
||||
5
|
commit the changes
|
PI
|
CR
|
XCUR
|
||||
6
|
trigger checkpoint
|
CR
|
XCUR
|
|||||
7
|
instance crash
|
|||||||
8
|
select the rows
|
CR
|
XCUR
|
RAC Troubleshooting
This is the one section
what will be updated frequently as my experience with RAC grows, as RAC has
been around for a while most problems can be resolve with a simple google
lookup, but a basic understanding on where to look for the problem is required.
In this section I will point you where to look for problems, every instance in
the cluster has its own alert logs, which is where you would start to look.
Alert logs contain startup and shutdown information, nodes joining and leaving
the cluster, etc.
The cluster itself has a
number of log files that can be examined to gain any insight of occurring
problems, the table below describes the information that you may need of the
CRS components
$ORA_CRS_HOME/crs/log
|
contains trace files
for the CRS resources
|
$ORA_CRS_HOME/crs/init
|
contains trace files
for the CRS daemon during startup, a good place to start
|
$ORA_CRS_HOME/css/log
|
contains cluster
reconfigurations, missed check-ins, connects and disconnects from the client
CSS listener. Look here to obtain when reboots occur
|
$ORA_CRS_HOME/css/init
|
|
$ORA_CRS_HOME/evm/log
|
log files for the
event volume manager and eventlogger daemon
|
$ORA_CRS_HOME/evm/init
|
pid and lock files
for EVM
|
$ORA_CRS_HOME/srvm/log
|
|
$ORA_CRS_HOME/log
|
log files for Oracle
clusterware which contains diagnostic messages at the Oracle cluster level
|
As in a normal Oracle
single instance environment, a RAC environment contains the standard RDBMS log
files, these files are located by the parameter background_dest_dump.
The most important of these are
$ORACLE_BASE/admin/udump
|
contains any trace
file generated by a user process
|
$ORACLE_BASE/admin/cdump
|
contains core files
that are generated due to a core dump in a user process
|
Now lets look at a two
node startup and the sequence of events
First you must check
that the RAC environment is using the connect interconnect, this can be done by
either of the following
logfile
|
## The location of my
alert log, yours may be different
/u01/app/oracle/admin/racdb/bdump/alert_racdb1.log |
ifcfg command
|
oifcfg getif
|
table check
|
select inst_id,
pub_ksxpia, picked_ksxpia, ip_ksxpia from x$ksxpia;
|
oradebug
|
SQL> oradebug
setmypid
SQL> oradebug ipc Note: check the trace file which can be located by the parameter user_dump_dest |
system parameter
|
cluster_interconnects
Note: used to specify which address to use |
When the instance starts
up the Lock Monitor's (LMON) job is to register with the Node Monitor (NM) (see
below table). Remember when a node joins or leaves the cluster the GRD
undergoes a reconfiguration event, as seen in the logfile it is a seven step process (see below for more details on the seven
step process).
The LMON trace file also
has details about reconfigurations it also details the reason for the event
reconfiguation reason
|
description
|
1
|
means that the NM
initiated the reconfiguration event, typical when a node joins or leaves a
cluster
|
2
|
means that an
instance has died
How does the RAC
detect an instance death, every instance updates the control file with a
heartbeat through its checkpoint (CKPT), if the heartbeat information is missing
for x amount of time, the instance is considered to be dead and the Instance
Membership Recovery (IMR) process initiates reconfiguration.
|
3
|
means communication
failure of a node/s. Messages are sent across the interconnect if a message
is not received in an amount of time then a communication failure is assumed
by default UDP is used and can be unreliable so keep an eye on the logs if
too many reconfigurations happen for reason 3.
|
Example of a
reconfiguration, taken from the alert log.
|
Sat Mar 20 11:35:53
2010
Reconfiguration started (old inc 2, new inc 4) List of nodes: 0 1 Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Mar 20 11:35:53 2010 LMS 0: 0 GCS shadows cancelled, 0 closed Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Sat Mar 20 11:35:53 2010 LMS 0: 0 GCS shadows traversed, 3291 replayed Sat Mar 20 11:35:53 2010 Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Note: when a reconfiguration happens the GRD is frozen until the reconfiguration is completed |
Confirm that the
database has been started in cluster mode, the log file will state the
following
cluster mode
|
Sat Mar 20 11:36:02
2010
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE) Completed: ALTER DATABASE MOUNT |
Staring with 10g
the SCN is broadcast across all nodes, the system
will have to wait until all nodes have seen the commit SCN. You can change the
board cast method using the system parameter_lgwr_async_broadcasts.
The lamport algorithm
generates SCNs in parallel and they are assigned to transaction on a first come
first served basis, this is different than a single instance environment, a
broadcast method is used after a commit operation, this method is more CPU
intensive as it has to broadcast the SCN for every commit, but he other nodes
can see the committed SCN immediately.
The initialization
parameter max_commit_propagation_delay limits the maximum
delay allow for SCN propagation, by default it is 7 seconds. When set to less
than 100 the broadcast on commit algorithm is used.
There are times when you
may wish to disable RAC, this feature can only be used in a
Unix environment (no windows option).
Disable Oracle RAC (Unix only)
|
|
Enable Oracle RAC (Unix only)
|
|
Oracle can suffer a
number of different performance problems and can be categorized by the
following
- Hung Database
- Hung Session(s)
- Overall instance/database performance
- Query Performance
A hung database is
basically an internal deadlock between to processes, usually Oracle will detect
the deadlock and rollback one of the processes, however if the situation occurs
with the internal kernel-level resources (latches or pins), it is unable to
automatically detect and resolve the deadlock, thus hanging the database. When
this event occurs you must obtain dumps from each of the instances (3 dumps per
instance in regular times), the trace files will be very large.
capture information
|
## Using alter
session
SQL> alter session set max_dump_file_size = unlimited; SQL> alter session set events 'immediate trace name systemstate level 10';
# using oradebug
SQL> select * from dual; SQL> oradebug setmypid SQL> unlimit SQL> oradebug dump systemstate 10 # using oradebug from another instance SQL> select * from dual; SQL> oradebug setmypid SQL> unlimit SQL> oradebug -g all dump systemstate 10 Note: the select statement above is to avoid problems on pre 8 Oracle |
SQLPlus - problems
connecting
|
## If you get problems
connecting with SQLPLUS use the command below
$ sqlplus -prelim Enter user-name: / as sysdba |
A severe performance
problem can be mistaken for a hang, this usually happen because of contention
problems, a systemstate dump is normally used to analyze this problem, however
a systemstate dump taken a long time to complete, it also has a number of
limitations
- Reads the SGA in a dirty manner, so it may be
inconsistent
- Usually dumps a lot of information
- does not identify interesting processes on which to
perform additional dumps
- can be a very expensive operation if you have a large
SGA.
To overcome these
limitations a new utility command was released with 8i called hanganalyze which
provides clusterwide information in a RAC environment on a single shot.
sql method
|
alter session set
events 'immediate trace hanganalyze level <level>';
|
oradebug
|
SQL> oradebug
hanganalyze <level>
## Another way using oradebug SQL> setmypid SQL> setinst all SQL> oradebug -g def hanganalyze <level> Note: you will be told where the output will be dumped to |
hanganalyze levels
|
|
1-2
|
|
3
|
Level 2 + Dump only
processes thought to be in a hang (IN_HANG state)
|
4
|
Level 3 + Dump leaf
nodes (blockers) in wait chains (LEAF, LEAF_NW, IGN_DMP state)
|
5
|
Level 4 + Dump all
processes involved in wait chains (NLEAF state)
|
10
|
Dump all processes
(IGN state)
|
The hanganalyze command
uses internal kernel calls to determine whether a session is waiting for a
resource and reports the relationship between blockers and waiters, systemdump
is better but if you over whelmed try hanganalyze first.
A node is evicted from
the cluster after it kills itself because it is not able to service the
application, this generally happens when you have communication problems. For
eviction node problems look for ora-29740 errors in the alert log file and LMON
trace files.
To understand eviction
problems you need to now the basics of node membership and instance membership
recovery (IMR) works. When a communication failure happens the heartbeat
information in the control cannot happen, thus data corruption can happen. IMR
will remove any nodes from the cluster that it deems as a problem, IMR will
ensure that the larger part of the cluster will survive and kills any remaining
nodes. IMR is part of the service offered by Cluster Group Services (CGS). LMON
handles many of the CGS functionalities, this works at the cluster level and
can work with 3rd party software (Sun Cluster, Veritas Cluster). The Node
Monitor (NM) provides information about nodes and their health by registering
and communicating with the Cluster Manager (CM). Node membership is represented
as a bitmap in the GRD. LMON will let other nodes know of any changes in membership, for
example if a node joins or leaves the cluster, the bitmap is rebuilt and
communicated to all nodes.
Node registering
(alert log)
|
lmon registered with
NM - instance id 1 (internal mem no 0)
|
One thing to remember is
that all nodes must be able to read from and write to the controlfile. CGS
makes sure that members are valid, it uses a voting mechanism to check the
validity of each member. I have already discussed the voting disk in my architecture section, as stated above memberships is
held in a bitmap in the GRD, the CKPT process updates the controlfile every 3
seconds in an operation known as aheartbeat. It writes into a single
block that is unique for each instance, thus intra-instance coordination is not
required, this block is called the checkpoint progress record. You
can see the controlfile records using the gv$controlfile_record_section view,
all members attempt to obtain a lock on the controlfile record for updating,
the instance that obtains the lock tallies the votes from all members, the
group membership must conform to the decided (voted) membership before allowing
the GCS/GES reconfiguration to proceed, the controlfile vote result is stored
in the same block as the heartbeat in the control file checkpoint progress
record.
A cluster
reconfiguration is performed using 7 steps
- Name service is frozen, the CGS contains an internal
database of all the members/instances in the cluster with all their
configurations and servicing details.
- Lock database (IDLM) is frozen, this prevents processes
from obtaining locks on resources that were mastered by the departing/dead
instance
- Determination of membership and validation and IMR
- Bitmap rebuild takes place, instance name and
uniqueness verification, GCS must synchronize the cluster to be sure that
all members get the reconfiguration event and that they all see the same
bitmap.
- Delete all dead instance entries and republish all
names newly configured
- Unfreeze and release name service for use
- Hand over reconfiguration to GES/GCS
Oracle server management
configuration tools include a diagnostic and tracing facility for verbose
output for SRVCTL, GSD, GSDCTL or SRVCONFIG.
To capture diagnose
following the below
- use vi to edit the
gsd.sh/srvctl/srvconfig file in the $ORACLE_HOME/bin directory
- At the end of the file look for the below line
exec $JRE -classpath $CLASSPATH oracle.ops.mgmt.daemon.OPSMDaemon $MY_OHOME - Add the following just before the -classpath in the
exec $JRE line
-DTRACING.ENABLED=true -DTRACING.LEVEL=2 - the string should look like this
exec $JRE -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath...........
In Oracle database 10g
setting the below variable accomplishes the same thing, set it to blank to
remove the debugging
Enable tracing
|
$ export
SRVM_TRACE=true
|
Disable tracing
|
$ export
SRVM_TRACE=""
|
Adding or Deleting a
Node
One of the jobs of a DBA
is adding and removing nodes from a RAC environment when capacity demands,
although you should add a node of a similar spec it is possible to add a node
of a higher or lower spec.
The first stage is to
configure the operating system and make sure any necessary drivers are
installed, also make sure that the node can see the shared disks available to
the existing RAC.
I am going to presume we
have a two RAC environment already setup, and we are going to add a third node.
You used the Cluster
Verification utility when installing the RAC environment, the tools check that
the node has been properly prepared for a RAC deployment. You can run the
command either from the new node or from any of the existing nodes in the
cluster
pre-install check
run from new node
|
runcluvfy.sh stage
-pre crsinst -n rac1,rac2,rac3 -r 10gr2
|
pre-install check
run from existing node
|
cluvfy stage -pre
crsinst -n rac1,rac2,rac3 -r 10g2
|
Make sure that you fix
any highlighted problems before continuing.
Cluster Ready Services
(CRS) should be installed first, this allows the node to become part of the
cluster. Adding the new node can be started from any of the existing nodes
- Log into any of the existing nodes as user oracle then
run the below command, the script below starts the OUI GUI tool, hopefully
the tool will already see the existing cluster and will fill in the
details for you
$ORA_RS_HOME/oui/bin/addnode.sh - In the specify cluster nodes to add to
installation screen, enter the new names for the public, private
and virtual hosts
- Click next to see a summary page
- Click install, the installer will copy the
files from the existing node to the new node. Once copied you will be asked
to run orainstRoot.sh and root.sh as
user root
- Run orainstRoot.sh and root.sh in
the new and rootaddnode.sh in the node that you are
running the installation from.
orainstRoot.sh
|
sets the Oracle
inventory in the new node and set ownerships and permissions to the inventory
|
root.sh
|
checks whether the
Oracle CRS stack is already configured in the new node, creates /etc/oracle
directory and adds the relevant OCR keys to the cluster registry and it adds
the daemon to CRS and starts CRS in the new node.
|
rootaddnode.sh
|
configures the OCR
registry to include the new nodes as part of the cluster
|
- Click next to complete the
installation. Now you need to configure Oracle Notification Services
(ONS). The port can be identified by the below command
cat $ORA_CRS_HOME/opmn/conf/ons.config - Now run the ONS utility by supplying the
<remote_port> number obtained above
racgons add_config rac3:<remote_port>
Once the CRS has been
installed and the new node is in the cluster, it is time to install the Oracle
DB software. Again you can use any of the existing nodes to install the
software.
- Log into any of the existing nodes as user oracle then
run the below command, the script below starts the OUI GUI tool, hopefully
the tool will already see the existing cluster and fill in the details for
you
$ORA_RS_HOME/oui/bin/addnode.sh - Click next on the welcome screen to
open the specify cluster nodes to add to installation screen,
you should have a list of all the existing nodes in the cluster, select
the new node and click next
- Check the summary page then click install to
start the installation
- The files will be copied to the new node, the script
will ask you to run run.sh on the new node, then
click OK to finish off the installation
- Login as user oracle, and set your DISPLAY environment
variable, then start the Network Configuration Assistant
$ORACLE_HOME/bin/netca - Choose cluster management
- Choose listener
- Choose add
- Choose the the name as LISTENER
These steps will add a
listener on rac3 as LISTENER_rac3
Run the below to create
the database instance on the new node
- Login as oracle on the new node, set
the environment to database home and then run the database creation
assistant (DBCA)
$ORACLE_HOME/bin/dbca - In the welcome screen choose oracle real
application clusters database to create the instance and
click next
- Choose instance management and
click next
- Choose add instance and click next
- Select RACDB (or whatever name you
gave you RAC environment) as the database and enter the SYSDBA and
password, click next
- You should see a list of existing instances,
click next and on the following screen enter ORARAC3 as
the instance and choose RAC3 as the node name (substitute any of the above
names for your environment naming convention)
- The database instance will now created, click next in
the database storage screen., choose yes when asked to
extend ASM
Removing a node is similar
to above but in reverse order
- Delete the instance on the node to be removed
- Clean up ASM
- Remove the listener from the node to be removed
- Remove the node from the database
- Remove the node from the clusterware
You can delete the
instance by using the database creation assistant (DBCA), invoke the program
choose the RAC database, choose instance management and then
choose delete instance, enter the sysdba user and password then
choose the instance to delete.
To clean up ASM follow
the below steps
- From node 1 run the below command to stop ASM on the
node to be removed
srvctl stop asm -n rac3
srvctl remove asm -n rac3 - Now run the following on the node to be removed
cd $ORACLE_HOME/admin
rm -rf +ASM
cd $ORACLE_HOME/dbs
rm -f *ASM* - Check that /etc/oratab file has no ASM entries, if so
remove them
Now remove the listener
for the node to be removed
- Login as user oracle, and set your DISPLAY environment
variable, then start the Network Configuration Assistant
$ORACLE_HOME/bin/netca - Choose cluster management
- Choose listener
- Choose Remove
- Choose the the name as LISTENER
Next we remove the node
from the database
- Run the below script from the node to be removed
cd $ORACLE_HOME/bin
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac3}" -local
./runInstaller - Choose to deinstall products and select the dbhome
- Run the following from node 1
cd $ORACLE_HOME/oui/bin
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac1,rac2,rac3}"
Lastly we remove the clusterware
software
- Run the following from node 1, you obtain the port
number from remoteport section in the ons.config file in
$ORA_CRS_HOME/opmn/conf
$CRS_HOME/bin/racgons remove_config rac3:6200 - Run the following from the node to be removed as user root
cd $CRS_HOME/install
./rootdelete.sh - Now run the following from node 1 as user root,
obtain the node number first
$CRS_HOME/bin/olsnodes -n
cd $CRS_HOME/install
./rootdeletenode.sh rac3,3 - Now run the below from the node to be removed as user oracle
cd $CRS_HOME/oui/bin
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac3}" CRS=TRUE -local
./runInstaller - Choose to deinstall software and remove the CRS_HOME
- Run the following from node as user oracle
cd $CRS_HOME/oui/bin
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={rac1,rac2,rac3}" CRS=TRUE - Check that the node has been removed, the first should
report "invalid node", the second you should not see any output
and the last command you should only see nodes rac1 and rac2
srvctl status nodeapps -n rac3
crs_stat |grep -i rac3
olsnodes -n
RAC Cheatsheet
This is a quick and
dirty cheatsheet on Oracle RAC 10g, as my experience with RAC grows I will
update this section, below is a beginners guide on the commands and information
that you will require to administer Oracle RAC.
Acronyms
|
||
GCS
|
Global Cache
Services
|
in memory database
containing current locks and awaiting locks, also known as PCM
|
GES
|
Global Enqueue
Services
|
coordinates the
requests of all global enqueues uses the GCS, also known as non-PCM
|
GRD
|
Global Resource
Directory
|
|
GRM
|
Global Resource
Manager
|
helps to coordinate
and communicate the locks requests between Oracle processes
|
GSD
|
Global Services
Daemon
|
runs on each node
with one GSD process per node. The GSD coordinates with the cluster manager
to receive requests from clients such as the DBCA, EM, and the SRVCTL utility
to execute administrative job tasks such as instance startup or shutdown. The
GSD is not an Oracle instance background process and is therefore not started
with the Oracle instance
|
PCM (IDLM)
|
Parallel Cache
Management
|
formly know as
(integrated) Distributed Lock Manager, its another name for GCS
|
Resource
|
n/a
|
it is a identifiable
entity it basically has a name or a reference, it can be a area in memory, a
disk file or an abstract entity
|
Resource (Global)
|
n/a
|
a resource that can
be accessed by all the nodes within the cluster examples would be the
following
|
LVB
|
Lock Value Block
|
contains a small
amount of data regarding the lock
|
TRFC
|
Traffic Controller
|
Files and Directories
|
|
$ORA_CRS_HOME/cdata/<cluster_name>
|
OCR backups (default
location)
|
$ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
|
OCR command log file
|
$ORA_CRS_HOME/crs/log
|
contains trace files
for the CRS resources
|
$ORA_CRS_HOME/crs/init
|
contains trace files
for the CRS daemon during startup, a good place to start
|
$ORA_CRS_HOME/css/log
|
contains cluster
reconfigurations, missed check-ins, connects and disconnects from the client
CSS listener. Look here to obtain when reboots occur
|
$ORA_CRS_HOME/css/init
|
|
$ORA_CRS_HOME/evm/log
|
logfiles for the
event volume manager and eventlogger daemon
|
$ORA_CRS_HOME/evm/init
|
pid and lock files
for EVM
|
$ORA_CRS_HOME/srvm/log
|
|
$ORA_CRS_HOME/log
|
log fles for Oracle
clusterware which contains diagnostic messages at the Oracle cluster level
|
GCS and Cache Fusion Diagnostics
|
|
v$cache
|
contains information
about every cached block in the buffer cache
|
v$cache_transfer
|
contains information
from the block headers in SGA that have been pinged at least once
|
v$instance_cache_transfer
|
contains information
about the transfer of cache blocks through the interconnect
|
v$cr_block_server
|
contains statistics
about CR block transfer across the instances
|
v$current_block_server
|
contains statistics
about current block transfer across the instances
|
v$gc_element
|
contains one-to-one
information for each global cache resource used by the buffer cache
|
GES diagnostics
|
|
v$lock
|
contains information
about locks held within a database and outstanding requests for locks and latches
|
v$ges_blocking_enqueue
|
contains information
about locks that are being blocked or blocking others and locks that are
known to the lock manager
|
v$enqueue_statistics
|
contains details
about enqueue statistics in the instance
|
v$resource_limits
|
display enqueue
statistics
|
v$locked_object
|
contains information
about DML locks acquired by different transactions in databases with their
mode held
|
v$ges_statistics
|
contains
miscellaneous statistics for GES
|
v$ges_enqueue
|
contains information
about all locks known to the lock manager
|
v$ges_convert_local
|
contains information
about all local GES operations
|
v$ges_convert_remote
|
contains information
about all remote GES operations
|
v$ges_resource
|
contains information
about all resources known to the lock manager
|
v$ges_misc
|
contains information
about messaging traffic information
|
v$ges_traffic_controller
|
contains information
about the message ticket usage
|
Dynamic Resource Remastering
|
|
v$hvmaster_info
|
contains information
about current and previous master instances of GES resources in relation to
hash value ID of resource
|
v$gcshvmaster_info
|
the same as above
but globally
|
v$gcspfmaster_info
|
conatins information
about current and previous masters about GCS resources belonging to files
mapped to a particular master, including the number of times the resource has
remastered
|
Cluster Interconnect
|
|
v$cluster_interconnects
|
contains information
about interconnects that are being used for cluster communication
|
v$configured_interconnects
|
same as above but
also contains interconnects that AC is aware off that are not being used
|
Miscellanous
|
|
v$service
|
services running on
an instance
|
x$kjmsdp
|
display LMS daemon
statistics
|
x$kjmddp
|
display LMD daemon
statistics
|
Parameters
|
|
cluster_interconnects
|
specify a specific
IP address to use for the inetrconnect
|
_gcs_fast_config
|
enables fast
reconfiguration for gcs locks (true|false)
|
_lm_master_weight
|
controls which
instance will hold or (re)master more resources than others
|
_gcs_resources
|
controls the number
of resources an instance will master at a time
|
_lm_tickets
|
controls the number
of message tickets
|
_lm_ticket_active_sendback
|
controls the number
of message tickets (aggressive messaging)
|
_db_block_max_cr_dba
|
|
_fairness_threshold
|
used when too many
CR requested arrive for a particular buffer and the block becomes disowned
(see grd)
|
_gc_affinity_time
|
specifies interval
minutes for reamstering
|
_gc_affinity_limit
|
defines the number
of times a instance access the resource before remastering
|
_gc_affinity_minimum
|
defines the minimum
number of times a instance access the resource before remastering
|
_lm_file_affinity
|
disables dynamic
remastering for the objects belonging to those files
|
_lm_dynamic_remastering
|
enable or disable
remastering
|
_gc_defer_time
|
|
_lgwr_async_broadcast
|
Oracle RAC Daemons and Processes
|
||
OPROCd
|
Process Monitor
|
provides basic
cluster integrity services
|
EVMd
|
Event Management
|
spawns a child
process event logger and generates callouts
|
OCSSd
|
Cluster
Synchronization Services
|
basic node
membership, group services, basic locking
|
CRSd
|
Cluster Ready
Services
|
resource monitoring,
failover and node recovery
|
LMSn
|
Lock Manager Server
process - GCS
|
this is the cache
fusion part, it handles the consistent copies of blocks that are tranferred
between instances. It receives requests from LMD to perform lock requests. I
rools back any uncommitted transactions. There can be upto ten LMS processes
running and can be started dynamically if demand requires it.
they manage lock
manager service requests for GCS resources and send them to a service queue
to be handled by the LMSn process. It also handles global deadlock detection
and monitors for lock conversion timeouts.
|
LMON
|
Lock Monitor Process
- GES
|
this process manages
the GES, it maintains consistency of GCS memory in case of process death. It
is also responsible for cluster reconfiguration and locks reconfiguration
(node joining or leaving), it checks for instance deaths and listens for
local messaging.
A detailed log file
is created that tracks any reconfigurations that have happened.
|
LMD
|
Lock Manager Daemon
- GES
|
this manages the
enqueue manager service requests for the GCS. It also handles deadlock detention
and remote resource requests from other instances.
|
LCK0
|
Lock Process - GES
|
manages instance
resource requests and cross-instance call operations for shared resources. It
builds a list of invalid lock elements and validates lock elements during recovery.
|
DIAG
|
Diagnostic Daemon
|
This is a
lightweight process, it uses the DIAG framework to monitor the healt of the
cluster. It captures information for later diagnosis in the event of
failures. It will perform any neccessary recovery if an operational hang is
detected.
|
Managing the Cluster
|
|
starting
|
/etc/init.d/init.crs
start
crsctl start crs |
stopping
|
/etc/init.d/init.crs
stop
crsctl stop crs |
enable/disable at
boot time
|
/etc/init.d/init.crs
enable
/etc/init.d/init.crs disable crsctl enable crs crsctl disable crs |
Managing the database configuration with SRVCTL
|
|
start all instances
|
srvctl start database
-d <database> -o <option>
Note: starts listeners if not already running, you can use the -o option to specify startup/shutdown options force open mount nomount |
stop all instances
|
srvctl stop database
-d <database> -o <option>
Note: the listeners are not stopped, you can use the -o option to specify startup/shutdown options immediate abort normal transactional |
start/stop
particular instance
|
srvctl [start|stop]
database -d <database> -i <instance>,<instance>
|
display the
registered databases
|
srvctl config database
|
status
|
srvctl status database
-d <database>
srvctl status instance -d <database> -i <instance>,<instance> srvctl status service -d <database> srvctl status nodeapps -n <node> srvctl status asm -n <node> |
stopping/starting
|
srvctl stop database
-d <database>
srvctl stop instance -d <database> -i <instance>,<instance> srvctl stop service -d <database> -s <service>,<service> -i <instance>,<instance> srvctl stop nodeapps -n <node> srvctl stop asm -n <node> srvctl start database -d <database> srvctl start instance -d <database> -i <instance>,<instance> srvctl start service -d <database> -s <service>,<service> -i <instance>,<instance> srvctl start nodeapps -n <node> srvctl start asm -n <node> |
adding/removing
|
srvctl add database -d
<database> -o <oracle_home>
srvctl add instance -d <database> -i <instance> -n <node> srvctl add service -d <database> -s <service> -r <preferred_list> srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network srvctl add asm -n <node> -i <asm_instance> -o <oracle_home> srvctl remove database -d <database> -o <oracle_home> srvctl remove instance -d <database> -i <instance> -n <node> srvctl remove service -d <database> -s <service> -r <preferred_list> srvctl remove nodeapps -n <node> -o <oracle_home> -A <name|ip>/network srvctl asm remove -n <node> |
OCR utilities
|
|
log file
|
$ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
|
checking
|
ocrcheck
Note: will return the OCR version, total space allocated, space used, free space, location of each device and the result of the integrity check |
dump contents
|
ocrdump -backupfile
<file>
Note: by default it dumps the contents into a file named OCRDUMP in the current directory |
export/import
|
ocrconfig -export
<file>
ocrconfig -restore <file> |
backup/restore
|
# show backups
ocrconfig -showbackup # to change the location of the backup, you can even specify a ASM disk ocrconfig -backuploc <path|+asm> # perform a backup, will use the location specified by the -backuploc location ocrconfig -manualbackup # perform a restore ocrconfig -restore <file> # delete a backup orcconfig -delete <file> Note: there are many more option so see the ocrconfig man page |
add/remove/replace
|
## add/relocate the
ocrmirror file to the specified location
ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf' ## relocate an existing OCR file ocrconfig -replace ocr '/ocfs1/ocr_new.dbf' ## remove the OCR or OCRMirror file ocrconfig -replace ocr ocrconfig -replace ocrmirror |
CRS Administration
CRS Administration
|
|
starting
|
## Starting CRS using
Oracle 10g R1
not possible
## Starting CRS using
Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl start crs |
stopping
|
## Stopping CRS using
Oracle 10g R1
srvctl stop database -d <database> srvctl stop asm -n <node> srvctl stop nodeapps -n <node> /etc/init.d/init.crs stop ## Stopping CRS using Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl stop crs |
disabling/enabling
|
## use to stop CRS
restarting after a reboot
## Oracle 10g R1 /etc/init.d/init.crs [disable|enable] ## Oracle 10g R2 $ORA_CRS_HOME/bin/crsctl [disable|enable] crs |
checking
|
$ORA_CRS_HOME/bin/crsctl
check crs
$ORA_CRS_HOME/bin/crsctl check evmd $ORA_CRS_HOME/bin/crsctl check cssd $ORA_CRS_HOME/bin/crsctl check crsd $ORA_CRS_HOME/bin/crsctl check install -wait 600 |
Resource Applications (CRS Utilities)
|
|
status
|
$ORA_CRS_HOME/bin/crs_stat
|
create profile
|
$ORA_CRS_HOME/bin/crs_profile
|
register/unregister
application
|
$ORA_CRS_HOME/bin/crs_register
$ORA_CRS_HOME/bin/crs_unregister |
Start/Stop an
application
|
$ORA_CRS_HOME/bin/crs_start
$ORA_CRS_HOME/bin/crs_stop |
Resource permissions
|
$ORA_CRS_HOME/bin/crs_getparam
$ORA_CRS_HOME/bin/crs_setparam |
Relocate a resource
|
$ORA_CRS_HOME/bin/crs_relocate
|
Nodes
|
|
member number/name
|
olsnodes -n
|
local node name
|
olsnodes -l
|
activates logging
|
olsnodes -g
|
Oracle Interfaces
|
|
display
|
oifcfg getif
|
delete
|
oicfg delig -global
|
set
|
oicfg setif -global
<interface name>/<subnet>:public
oicfg setif -global <interface name>/<subnet>:cluster_interconnect |
Global Services Daemon Control
|
|
starting
|
gsdctl start
|
stopping
|
gsdctl stop
|
status
|
gsdctl status
|
Cluster Configuration (clscfg is used during installation)
|
|
create a new
configuration
|
clscfg -install
|
upgrade or downgrade
and existing configuration
|
clscfg -upgrade
clscfg -downgrade |
add or delete a node
from the configuration
|
clscfg -add
clscfg -delete |
create a special
single-node configuration for ASM
|
clscfg -local
|
brief listing of
terminology used in the other nodes
|
clscfg -concepts
|
used for tracing
|
clscfg -trace
|
help
|
clscfg -h
|
Cluster Name Check
|
|
print cluster name
|
cemulto -n
Note: in Oracle 9i the ulity was called "cemutls" |
print the clusterware
version
|
cemulto -w
Note: in Oracle 9i the ulity was called "cemutls" |
Node Scripts
|
|
Add Node
|
|
Delete Node
|
|
Enqueues
|
|
displaying
statistics
|
SQL> column
current_utilization heading current
SQL> column max_utilization heading max_usage SQL> column initial_allocation heading initial SQL> column resource_limit format a23; SQL> select * from v$resource_limit; |
Messaging (tickets)
|
|
ticket usage
|
select local_nid
local, remote_nid remote, tckt_avail avail, tckt_limit limit, snd_q_len
send_queue, tckt_wait waiting from v$ges_traffic_controller;
|
dump ticket
information
|
SQL> oradebug
setmypid
SQL> oradebug unlimit SQL> oradebug lkdebug -t |
Lighwork Rule and Fairness Threshold
|
|
downconvert
|
select cr_requests,
light_works, data_requests, fairness_down_converts from v$cr_block_server;
Note: lower the _fairness_threshold if the ratio goes above 40%, set to 0 if the instance is a query only instance. |
Remastering
|
|
force dynamic
remastering (DRM)
|
## Obtain the
OBJECT_ID form the below table
SQL> select * from v$gcspfmaster_info;
## Determine who
masters it
SQL> oradebug setmypid SQL> oradebug lkdebug -a <OBJECT_ID> ## Now remaster the resource SQL> oradebug setmypid SQL> oradebug lkdebug -m pkey <OBJECT_ID> |
GRD, SRVCTL, GSD and SRVCONFIG Tracing
|
|
Enable tracing
|
$ export
SRVM_TRACE=true
|
Disable tracing
|
$ export
SRVM_TRACE=""
|
Voting Disk
adding
|
crsctl add css
votedisk <file>
|
deleting
|
crsctl delete css
votedisk <file>
|
querying
|
crsctl query css
votedisk
|
What is RAC?
RAC stands for Real Application cluster. It is a
clustering solution from Oracle Corporation that ensures high availability of
databases by providing instance failover, media failover features.
Mention the Oracle RAC software components:-
Oracle RAC is composed of two or more database
instances. They are composed of Memory structures and background processes same
as the single instance database. Oracle RAC instances use two processes
GES(Global Enqueue Service), GCS(Global Cache
Service) that enable cache fusion.
Oracle RAC instances are composed of following
background processes:
ACMS—Atomic Controlfile to Memory Service (ACMS)
GTX0-j—Global Transaction Process
LMON—Global Enqueue Service Monitor
LMD—Global Enqueue Service Daemon
LMS—Global Cache Service Process
LCK0—Instance Enqueue Process
RMSn—Oracle RAC Management Processes (RMSn)
RSMN—Remote Slave Monitor
What is GRD?
GRD stands for Global Resource Directory. The GES
and GCS maintains records of the statuses of each datafile and each cached
block using global resource directory. This process is referred to as cache
fusion and helps in data integrity.
Give Details on Cache Fusion:-
Oracle RAC is composed of two or more instances.
When a block of data is read from datafile by an instance within the cluster
and another instance is in need of the same block,it is easy to get the block
image from the insatnce which has the block in its SGA rather than reading from
the disk. To enable inter instance communication Oracle RAC makes use of
interconnects. The Global Enqueue Service(GES) monitors and Instance enqueue
process manages the cache fusion.
Give Details on ACMS:-
ACMS stands for Atomic Controlfile Memory Service.
In an Oracle RAC environment ACMS is an agent that ensures a distributed SGA
memory update (ie) SGA updates are globally committed on success or globally
aborted in event of a failure.
Give details on GTX0-j :-
The process provides transparent support for XA
global transactions in a RAC environment. The database autotunes the number of
these processes based on the workload of XA global transactions.
Give details on LMON:-
This process monitors global enques and resources
across the cluster and performs global enqueue recovery operations. This is
called as Global Enqueue Service Monitor.
Give details on LMD:-
This process is called as global enqueue service
daemon. This process manages incoming remote resource requests within each
instance.
Give details on LMS:-
This process is called as Global Cache service
process. This process maintains statuses of datafiles and each cached block by
recording information in a Global Resource Directory (GRD). This process also
controls the flow of messages to remote instances and manages global data block
access and transmits block images between the buffer caches of different
instances.This processing is a part of cache fusion feature.
Give details on LCK0:-
This process is called as Instance enqueue process.
This process manages non-cache fusion resource requests such as library and row
cache requests.
Give details on RMSn:-
This process is called as Oracle RAC management
process. These pocesses perform managability tasks for Oracle RAC. Tasks
include creation of resources related Oracle RAC when new instances are added
to the cluster.
Give details on RSMN:-
This process is called as Remote Slave Monitor. This
process manages background slave process creation and communication on remote
instances. This is a background slave process.This process performs tasks on
behalf of a co-ordinating process running in another instance.
What components in RAC must reside in shared
storage?
All datafiles, controlfiles, SPFIles, redo log files
must reside on cluster-aware shred storage.
What is the significance of using cluster-aware
shared storage in an Oracle RAC environment?
All instances of an Oracle RAC can access all the
datafiles, control files, SPFILE's, redolog files when these files are hosted
out of cluster-aware shared storage which are group of shared disks.
Give few examples for solutions that support cluster
storage:-
ASM(automatic storage management), raw disk devices,
network file system(NFS), OCFS2 and OCFS(Oracle Cluster Fie systems).
What is an interconnect network?
An interconnect network is a private network that
connects all of the servers in a cluster.
The interconnect network uses a switch/multiple
switches that only the nodes in the cluster can access.
How can we configure the cluster interconnect?
Configure User Datagram Protocol(UDP) on Gigabit
ethernet for cluster interconnect.
On unix and linux systems we use UDP and RDS
(Reliable data socket) protocols to be used by Oracle Clusterware. Windows
clusters use the TCP protocol.
Can we use crossover cables with Oracle Clusterware
interconnects?
No, crossover cables are not supported with Oracle Clusterware
interconnects.
What is the use of cluster interconnect?
Cluster interconnect is used by the Cache fusion for
inter instance communication.
How do users connect to database in an Oracle RAC
environment?
Users can access a RAC database using a client/server
configuration or through one or more middle tiers, with or without connection
pooling. Users can use oracle services feature to connect to database.
What is the use of a service in Oracle RAC
environment?
Applications should use the services feature to
connect to the Oracle database.Services enable us to define rules and
characteristics to control how users and applications connect to database
instances.
What are the characteristics controlled by Oracle
services feature?
The characteristics include a unique name, workload
balancing and failover options, and high availability characteristics.
Which enable the load balancing of applications in
RAC?
Oracle Net Services enable the load balancing of
application connections across all of the instances in an Oracle RAC database.
What is a virtual IP address or VIP?
A virtual IP address or VIP is an alternate IP
address that the client connections use instead of the standard public IP
address. To configureVIP address, we need to reserve a spare IP address for
each node, and the IP addresses must use the same subnet as the public network.
What is the use of VIP?
If a node fails, then the node's VIP address fails
over to another node on which the VIP address can accept TCP connections but it
cannot accept Oracle connections.
Give situations under which VIP address failover
happens:-
VIP addresses failover happens when the node on
which the VIP address runs fails, all interfaces for the VIP address fails, all
interfaces for the VIP address are disconnected from the network.
What is the significance of VIP address failover?
When a VIP address failover happens, Clients that
attempt to connect to the VIP address receive a rapid connection refused error
.They don't have to wait for TCP connection timeout messages.
What are the administrative tools used for Oracle
RAC environments?
Oracle RAC cluster can be administered as a single
image using OEM (Enterprise Manager), SQL*PLUS, Servercontrol(SRVCTL),
clusterverificationutility(cvu), DBCA, NETCA.
How do we verify that RAC instances are running?
Issue the following query from any one node
connecting through SQL*PLUS.
$connect sys/sys as sysdba
SQL>select * from V$ACTIVE_INSTANCES;
The query gives the instance number under
INST_NUMBER column, host_: instancename under
INST_NAME column.
What is FAN?
Fast application Notification as it abbreviates to
FAN relates to the events related to instances, services and nodes. This is a
notification mechanism that Oracle RAC uses to notify other processes about the
configuration and service level information that includes service status
changes such as, UP or DOWN events. Applications can respond to FAN events and
take immediate action.
Where can we apply FAN UP and DOWN events?
FAN UP and FAN DOWN events can be applied to
instances, services and nodes.
State the use of FAN events in case of a cluster
configuration change?
During times of cluster configuration changes,Oracle
RAC high availability framework publishes a FAN event immediately when a state
change occurs in the cluster.So applications can receive FAN events and react
immediately.This prevents applications from polling database and detecting a
problem after such a state change.
Why should we have seperate homes for ASm instance?
It is a good practice to have ASM home seperate from
the database home(ORACLE_HOME).This helps in upgrading and patching ASM and the
Oracle database software independent of each other.Also,we can deinstall the
Oracle database software independent of the ASM instance.
What is the advantage of using ASM?
Having ASM is the Oracle recommended storage option
for RAC databases as the ASM maximizes performance by managing the storage
configuration across the disks.ASM does this by distributing the database file
across all of the available storage within our cluster database environment.
What is rolling upgrade?
It is a new ASM feature from Database 11g.ASM
instances in Oracle database 11g release(from 11.1) can be upgraded or patched
using rolling upgrade feature. This enables us to patch or upgrade ASM nodes in
a clustered environment without affecting database availability. During a
rolling upgrade we can maintain a functional cluster while one or more of the
nodes in the cluster are running in different software versions.
Can rolling upgrade be used to upgrade from 10g to
11g database?
No, it can be used only for Oracle database 11g
releases(from 11.1).
State the initialization parameters that must have
same value for every instance in an Oracle RAC database:-
Some initialization parameters are critical at the
database creation time and must have same values. Their value must be specified
in SPFILE or PFILE for every instance. The list of parameters that must be
identical on every instance are given below:
ACTIVE_INSTANCE_COUNT
ARCHIVE_LAG_TARGET
COMPATIBLE
CLUSTER_DATABASE
CLUSTER_DATABASE_INSTANCE
CONTROL_FILES
DB_BLOCK_SIZE
DB_DOMAIN
DB_FILES
DB_NAME
DB_RECOVERY_FILE_DEST
DB_RECOVERY_FILE_DEST_SIZE
DB_UNIQUE_NAME
INSTANCE_TYPE (RDBMS or ASM)
PARALLEL_MAX_SERVERS
REMOTE_LOGIN_PASSWORD_FILE
UNDO_MANAGEMENT
Can the DML_LOCKS and RESULT_CACHE_MAX_SIZE be
identical on all instances?
These parameters can be identical on all instances
only if these parameter values are set to zero.
What two parameters must be set at the time of
starting up an ASM instance in a RAC
environment?
The parameters CLUSTER_DATABASE and INSTANCE_TYPE
must be set.
Mention the components of Oracle clusterware:-
Oracle clusterware is made up of components like
voting disk and Oracle Cluster Registry(OCR).
What is a CRS resource?
Oracle clusterware is used to manage
high-availability operations in a cluster. Anything that Oracle Clusterware
manages is known as a CRS resource.Some examples of CRS resources are
database,an instance,a service,a listener,a VIP address,an application process
etc.
What is the use of OCR?
Oracle clusterware manages CRS resources based on
the configuration information of CRS resources stored in OCR(Oracle Cluster
Registry).
How does a Oracle Clusterware manage CRS resources?
Oracle clusterware manages CRS resources based on
the configuration information of CRS resources stored in OCR(Oracle Cluster
Registry).
Name some Oracle clusterware tools and their uses?
OIFCFG - allocating and deallocating network
interfaces
OCRCONFIG - Command-line tool for managing Oracle
Cluster Registry
OCRDUMP - Identify the interconnect being used
CVU - Cluster verification utility to get status of
CRS resources
What are the modes of deleting instances from ORacle
Real Application cluster Databases?
We can delete instances using silent mode or
interactive mode using DBCA(Database Configuration Assistant).
How do we remove ASM from a Oracle RAC environment?
We need to stop and delete the instance in the node
first in interactive or silent mode.After that asm can be removed using srvctl
tool as follows:
srvctl stop asm -n node_name
srvctl remove asm -n node_name
We can verify if ASM has been removed by issuing the
following command:
srvctl config asm -n node_name
How do we verify that an instance has been removed
from OCR after deleting an instance?
Issue the following srvctl command:
srvctl config database -d database_name
cd CRS_HOME/bin
./crs_stat
How do we verify an existing current backup of OCR?
We can verify the current backup of OCR using the
following command :
Ocrconfig –showbackup
What are the performance views in an Oracle RAC
environment?
We have v$ views that are instance specific. In
addition we have GV$ views called as global views that has an INST_ID column of
numeric data type.GV$ views obtain information from individual V$ views.
Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment:
Whenever, Database Administrator face Node Reboot issue, First thing to look at should be/var/log/message and OS Watcher logs of the Database Node which was rebooted.
var/log/messages will give you an actual picture of reboot:- Exact time of restart, status of resource like swap and RAM etc.
So, Every time you see a node eviction start investigation with /var/log/messages and Analyze OS Watcher logs. Below is a situation when a Database Node was reboot due to high load.
/var/log/messages output from a Database Node just before Node eviction:
Here is a scenario for voting disk not reachable:
3. Missed Network Connection between Nodes: In technical term this is called as Missed Network Heartbeat (NHB). Whenever there is communication gap or no communication between nodes on private network (interconnect) due to network outage or some other reason. A node abort itself to avoid "split brain" situation. The most common (but not exclusive) cause of missed NHB is network problems communicating over the private interconnect.
Suggestion to troubleshoot Missed Network Heartbeat.
Oracle
RAC: Node evictions & 11gR2 node eviction means restart of cluster stack
not reboot of node
crs_stop
What are the types of connection load-balancing?
There are two types of connection
load-balancing:server-side load balancing and clientside load balancing.
What is the differnece between server-side and
client-side connection load balancing?
Client-side balancing happens at client side where
load balancing is done using listener.In case of server-side load balancing
listener uses a load-balancing advisory to redirect connections to the instance
providing best service.
Give the usage of srvctl:-
srvctl start instance -d db_name -i
"inst_name_list" [-o start_options]srvctl stop instance -d name -i
"inst_name_list" [-o stop_options]srvctl stop instance -d orcl -i
"orcl3,orcl4" -o immediatesrvctl start database -d name [-o
start_options]srvctl stop database -d name
[-o stop_options]srvctl start database -d orcl -o
mount
Top 4 Reasons for
Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment
Remote DBA usually face Node
Reboots or Node Evictions in Real Application Cluster Environment.
Node Reboot is performed by CRS to maintain consistency in Cluster
environment by removing node which is facing some critical issue.
A critical problem could be a node not responding via
a network heartbeat, a node not responding via a disk heartbeat, a hung , or a
hung ocssd.binprocess etc. There could be many more reasons for node Eviction but Some
of them are common and repetitive.Here, I am listing:-
Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster (RAC) Environment:
Whenever, Database Administrator face Node Reboot issue, First thing to look at should be/var/log/message and OS Watcher logs of the Database Node which was rebooted.
var/log/messages will give you an actual picture of reboot:- Exact time of restart, status of resource like swap and RAM etc.
1. High Load on Database
Server: Out of 100 Issues, I have seen 70 to 80
time High load on the system was reason for Node Evictions, One common
scenario is due to high load RAM and SWAP space of DB node got exhaust and
system stops working and finally reboot.
So, Every time you see a node eviction start investigation with /var/log/messages and Analyze OS Watcher logs. Below is a situation when a Database Node was reboot due to high load.
/var/log/messages output from a Database Node just before Node eviction:
Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA:
2*4kB 1*8kB 0*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB
3*4096kB = 15792kB
Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA32:
150*4kB 277*8kB 229*16kB 186*32kB 51*64kB 65*128kB 82*256kB 13*512kB 3*1024kB
3*2048kB 78*4096kB = 380368kB
Apr 23 08:15:04 remotedb06 kernel: Node 0 Normal:
12362*4kB 58*8kB 0*16kB 0*32kB 0*64kB 0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB
0*4096kB = 52984kB
Apr 23 08:15:09 remotedb06 kernel: 83907 total
pagecache pages
Apr 23 08:15:11 remotedb06 kernel: 39826 pages in
swap cache
Apr 23 08:15:33 remotedb06 kernel: osysmond.bin:
page allocation failure. order:4, mode:0xd0
From above message, we can see that
this system has only 4kB free swap out of 24G swap space. This means system
neither has RAM not SWAP for processing, which case a reboot. This picture is
also clear from OS Watcher of system.
08:15:29 CPU
%user %nice %sys %iowait
%irq %soft %steal %idle intr/s
08:15:29
all 0.67 0.00 82.60
16.60 0.00 0.06
0.00 0.07 42521.98
08:17:08
all 1.69 0.00 67.02
30.93 0.00 0.06
0.00 0.29 37316.55
How to avoid Node Reboot due to High Load ?
The simple and best way to avoid this
is use Oracle Database Resource
Manager (DBRM). DBRM help to resolve this by allowing the database to have more
control over how hardware resources and their allocation. DBA should
setup Resource consumer group and Resource plan and should use
them as per requirements. In Exadata system
Exadata DBA can use IORM to setup resource allocation among multiple database
instances.
2. Voting Disk not
Reachable: One of the another reason for Node Reboot is clusterware is not
able to access a minimum number of the voting files.When the node
aborts for this reason, the node alert log will
show CRS-1606 error.
Here is a scenario for voting disk not reachable:
2013-01-26 10:15:47.177
[cssd(3743)]CRS-1606:The number of voting files
available, 1, is less than the minimum number of voting files required, 2,
resulting in CSSD
termination to ensure data integrity; details at
(:CSSNM00018:) in /u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.log
There could be two reasons for this issue:
A. Connection to the voting disk is interrupted.
B. if only one voting disk is in use and version is less than 11.2.0.3.4, hitting known bug 13869978.
How to Solve Voting Disk Outage ?
There could be many reasons for voting disk is not reachable, Here are few general approach for DBA to follow.
A. Connection to the voting disk is interrupted.
B. if only one voting disk is in use and version is less than 11.2.0.3.4, hitting known bug 13869978.
How to Solve Voting Disk Outage ?
There could be many reasons for voting disk is not reachable, Here are few general approach for DBA to follow.
1. Use command "crsctl query css
votedisk" on a node where clusterware is up to get a list of all the voting
files.
2. Check that each node can access
the devices underlying each voting file.
3. Check for permissions to each
voting file/disk have not been changed.
4. Check OS, SAN, and storage logs
for any errors from the time of the incident.
5. Apply fix for 13869978
if only one voting disk is in use. This is fixed in 11.2.0.3.4 patch set and
above, and 11.2.0.4 and above
If any voting files or underlying
devices are not currently accessible from any node, work with storage
administrator and/or system administrator to resolve it at storage and/or
OS level.
3. Missed Network Connection between Nodes: In technical term this is called as Missed Network Heartbeat (NHB). Whenever there is communication gap or no communication between nodes on private network (interconnect) due to network outage or some other reason. A node abort itself to avoid "split brain" situation. The most common (but not exclusive) cause of missed NHB is network problems communicating over the private interconnect.
Suggestion to troubleshoot Missed Network Heartbeat.
1. Check OS statistics from the
evicted node from the time of the eviction. DBA can use OS Watcher to look at
OS Stats at time of issue, check oswnetstat and oswprvtnet for network related
issues.
2. Validate the interconnect network
setup with the Help of Network administrator.
3. Check communication over the
private network.
4. Check that the OS network settings
are correct by running the RAC check tool.
4. Database Or ASM Instance
Hang: Sometimes Database or ASM instance hang can cause Node reboot. In
these case Database instance is hang and is terminated afterwards,
which cause either reboot cluster or Node eviction. DBA should check
alert log of Database and ASM instance for any hang situation which might cause
this issue.
Database Alert log file entry for Database Hang Situation:
Database Alert log file entry for Database Hang Situation:
Wed Jun 05 03:25:01 2013
Sweep [inc][122970]: completed
System State dumped to trace file
/u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122970/remotedb2_m005_118200_i122970_a.trc
Wed Jun 05 03:25:33 2013
DSKM process appears to be hung. Initiating system
state dump.
Incident details in:
/u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122897/remotedb2_pmon_9707_i122897.trc
Use ADRCI or Support Workbench to package the
incident.
System State dumped to trace file
/u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_diag_9759.trc
Instance terminated by PMON, pid = 9707
Wed Jun 05 03:45:58 2013
Starting ORACLE instance (normal)
At the same time resources at Cluster
level start failing and node was evicted by itself. Real Application Cluster
Log files
[/u01/app/11.2.0.3/grid/bin/oraagent.bin(96437)]CRS-5011:Check
of resource "PCAB1DE" failed: details at "(:CLSN00007:)" in
"/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2013-06-05 03:27:06.649
Details at (:CRSAGF00113:) {0:6:29} in
/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log.
[cssd(67762)]CRS-1608:This node was evicted
by node 3, exs2db03; details at (:CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log.
2013-06-05 03:37:26.647
[cssd(67762)]CRS-1608:This node was evicted by
node 3, exs2db03; details at (:CSSNM00005:) in
this was not because of high load Since OS Watcher
at this time doesn't show any load.
So, I believe this could be due to some bug in database.
In few of the cases, bugs
could be the reason for node reboot, bug may be at Database level, ASM
level or at Real Application Cluster level. Here, after initial investigation
from Database Administrator side, DBA should open an SR with Oracle Support.
What is Oracle RAC Node Eviction?
One of the most common and complex issue is node eviction issue.
A node is evicted from the cluster after it kills itself because it is not able
to service the applications.
This generally happens during the communication failure between
the instances, when the instance is not able to send heartbeat information to
the control file and various other reasons.
Oracle Clusterware is designed to perform a node eviction by
removing one or more nodes from the cluster if some critical problem is
detected. A critical problem could be a node not responding via a network
heartbeat, a node not responding via a disk heartbeat, a hung or severely
degraded machine, or a hung ocssd.bin process. The purpose of this node
eviction is to maintain the overall health of the cluster by removing bad
members.
During failures, to avoid data corruption, the failing instance
evicts itself from the cluster group. The node eviction process is reported as
Oracle error ORA-29740 in the alert log and LMON trace files.
Oracle
RAC: Node evictions & 11gR2 node eviction means restart of cluster stack
not reboot of node
Cluster integrity and cluster
membership will be governed by occsd (oracle cluster synchronization daemon)
monitors the nodes using 2 communication channels:
- Private
Interconnect aka Network Heartbeat
- Voting Disk based communication aka Disk Heartbeat
- Voting Disk based communication aka Disk Heartbeat
Network heartbeat:-
Each node in the cluster is
“pinged” every second
- Nodes must
respond in css_misscount time (defaults to 30 secs.)
– Reducing the css_misscount time is generally not supported - Network
heartbeat failures will lead to node evictions
- CSSD-log:
[date / time] [CSSD][1111902528]
clssnmPollingThread: node mynodename (5) at 75% heartbeat fatal, removal in 6.7 sec
Disk Heartbeat:-
Each node in the cluster
“pings” (r/w) the Voting Disk(s) every second
- Nodes must
receive a response in (long / short) diskTimeout time
– IF I/O errors indicate clear accessibility problems à timeout is irrelevant - Disk
heartbeat failures will lead to node evictions
- CSSD-log: …
[CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat:node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)
Now, we know with above
possibilities (network, disk heartbeat failures can lead to node eviction, but
sometime when the server/occsd/resource request also makes the node get evicted
which are extreme cases)
Why nodes should be
evicted?
Evicting (fencing) nodes is a
preventive measure (it’s a good thing)!
- Nodes are
evicted to prevent consequences of a split brain:
– Shared data must not be written by independently operating nodes
– The easiest way to prevent this is to forcibly remove a node from the cluster
How are nodes evicted?
– STONITH
Once it is determined that a node needs to be evicted,
Once it is determined that a node needs to be evicted,
- A “kill
request” is sent to the respective node(s)
– Using all (remaining) communication channels - A node
(CSSD) is requested to “kill itself” – “STONITH like”
– “STONITH” foresees that a remote node kills the node to be evicted
EXAMPLE: Voting Disk Failure
Voting Disks and heartbeat communication
is used to determine the node
- In a 2 node
cluster, the node with the lowest node number should survive
- In a n-node
cluster, the biggest sub-cluster should survive (votes based)
EXAMPLE: Network
heartbeat failure
- The network
heartbeat between nodes has failed
– It is determined which nodes can still talk to each other
– A “kill request” is sent to the node(s) to be evicted - Using all
(remaining) communication channels à Voting Disk(s)
- A node is
requested to “kill itself”; executer: typically CSSD
EXAMPLE: What if CSSD
is stuck or server itself is not responding?
A node is requested to “kill
itself”
- BUT CSSD is
“stuck” or “sick” (does not execute) – e.g.:
– CSSD failed for some reason
– CSSD is not scheduled within a certain margin
– CSSD is not scheduled within a certain margin
OCSSDMONITOR (was: oprocd) will take
over and execute
EXAMPLE: Cluster member
(rac instance) can request a to kill another member (RAC Instance)
A cluster member (rac instance
) can request a kill another member in order to protect the data integrity, in such cases like control file progress record not
written proper by the failure instance(read here) , then occsd tries to kill
that member, if not possible its tries to evict the node.
11gR2 Changes –>
Important, in 11GR2, the fencing (eviction) does not to reboot.
- Until
Oracle Clusterware 11.2.0.2, fencing (eviction) meant “re-boot”
- With Oracle
Clusterware 11.2.0.2, re-boots will be seen less, because:
– Re-boots affect applications that might run an a node, but are not protected
– Customer requirement: prevent a reboot, just stop the cluster – implemented…
How does this works?
With Oracle Clusterware
11.2.0.2, re-boots will be seen less: Instead of fast re-booting the node, a
graceful shutdown of the cluster stack is attempted
- It
starts with a failure – e.g. network heartbeat or interconnect failure
- Then
IO issuing processes are killed; it is made sure that no IO process
remains
– For a RAC DB mainly the log writer and the database writer are of concern - Once
all IO issuing processes are killed, remaining processes are stopped
– IF the check for a successful kill of the IO processes, fails → reboot - Once
all remaining processes are stopped, the stack stops itself with a
“restart flag”
- OHASD
will finally attempt to restart the stack after the graceful shutdown
Exception to above:-
- IF
the check for a successful kill of the IO processes fails → reboot
- IF
CSSD gets killed during the operation → reboot
- IF
cssdmonitor (oprocd replacement) is not scheduled → reboot
- IF
the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot
IMPORTANT
PARAMETER
By setting CLUSTER_DATABASE=FALSE, you are
configuring your system to be a single-node database. This in turn enables you
to 'MOUNT EXCLUSIVE,' allowing you to switch the archive mode for the instance.
As you indicated, you need to set this parameter value back to TRUE to
re-establish your RAC configuration after the archiving mode has been changed.
Managing Oracle Clusterware
Ongoing management of Oracle Clusterware is achieved
by using the crsctl and srvctl command-line utilities installed under the
Oracle Grid Infrastructure home directory. Oracle Clusterware components and
resources can be monitored and managed from any node in the cluster by using
crsctl. The srvctl utility provides similar monitoring and management
capabilities for Oracle-related resources such as database instances and
database services. Both utilities are provided with Oracle Clusterware.
However, most crsctl commands are available only to clusterware administrators,
whereas srvctl commands are available to other groups such as database
administrators.
crsctl
manages clusterware-related operations:
— Starting and stopping Oracle Clusterware
— Enabling and disabling Oracle Clusterware daemons
— Registering cluster resources
srvctl
manages Oracle resource–related operations:
— Starting and stopping database instances and
database services
Controlling Oracle Clusterware
The crsctl utility can be used to control Oracle
Clusterware.
To start or stop Oracle Clusterware on a specific
node:
# crsctl start crs
# crsctl stop crs
To enable or disable Oracle
Clusterware on a specific
node:
# crsctl enable crs
# crsctl disable crs
Controlling Oracle Clusterware
When a node that contains Oracle
Clusterware is started, the Oracle Clusterware wrapper
script is automatically started
by the /etc/init.d/ohasd startup script. When the
crsctl utility is used to disable
Cluster Ready Services (CRS) from automatically starting,
state information related to
startup is placed in the SLCS_SRC control files, preventing
automatic startup on machine
reboot. To check the status of CRS, use the following syntax:
# crsctl check crs
CRS-4638: Oracle High
Availability Services is online
CRS-4537: Cluster Ready Services
is online
CRS-4529: Cluster Synchronization
Services is online
CRS-4533: Event Manager is online
You may have to manually control
the Oracle Clusterware stack while applying patches or
during planned outages. You can
stop Oracle Clusterware by using the crsctl stop crs
command and start it by using the crsctl start crs
command.
Verifying the Status of Oracle
Clusterware
The crsctl utility can be used to
verify the status of Oracle Clusterware.
To determine the overall health on a specific node:
$ crsctl check crs
CRS-4638: Oracle High
Availability Services is online
CRS-4537: Cluster Ready Services
is online
CRS-4529: Cluster Synchronization
Services is online
CRS-4533: Event Manager is online
To check the viability of Cluster
Synchronization Services
(CSS) across nodes:
$ crsctl check cluster
CRS-4537: Cluster Ready Services
is online
CRS-4529: Cluster Synchronization
Services is online
CRS-4533: Event Manager is online
Verifying the Status of Oracle
Clusterware
The crsctl utility can be used to
verify the status of Oracle Clusterware on specific nodes and across nodes. In
contrast to the crsctl controlling commands that required the root access (shown
in the previous slide), the check commands do not require root and may be
executed by the Oracle Clusterware software owner. The overall health of the
clusterware on a specific node can be obtained by using the crsctl check crs
command. It is possible to target three of the individual daemons by using the
crsctl check <daemon> command for the crsd, evmd, and cssd daemons only.
These commands are processed only on the node on which they are executed. To
check the viability of Cluster Synchronization Services (CSS) across all nodes,
use the crsctl check cluster command. The output of the overall health check
performed on a specific node is shown in the slide.
crsctl check crs <<-- for the local node
crsctl check cluster <<-- for remote nodes in the cluster
crsctl check cluster <<-- for remote nodes in the cluster
crs_start
This command sets all the applications or the
specified applications ONLINE, and attempts to
start the specified registered applications or application resources. You
should have full admin privileges for this.
crs_stop
This command stops a resource on
the specified node. Checking your crs_stat always helps to verify which
application/resource we want to stop.
crs_stat –t
The crs_stat
command gives you the names of the applications, which you might need to shut
down some applications manually, in order to shut the whole cluster down and
restart it, it will check the resources of RAC.
$ORA_CRS_HOME/bin/crs_stat -t
Note: We cannot
stop the resources of RAC with cr_stop –all, it will give some errors,
Therefore we need to stop our instance first.
Example:
srvctl
stop instance -d fokerac -i fokerac1
srvctl stop instance -d fokerac -i fokerac2
Difference on
using SRVCTL and CRSCTL commands:
Use
SRVCTL to manage Oracle-supplied resources such as
Listener,
Instances,
Disk groups,
Networks.
Use CRSCTL for managing Oracle Clusterware and its resources.
Oracle strongly discourages directly manipulating Oracle-supplied resources (resources whose names begin with ora) using CRSCTL. This could adversely impact the cluster configuration.
If resource name begins with ora then use SRVCTL
Listener,
Instances,
Disk groups,
Networks.
Use CRSCTL for managing Oracle Clusterware and its resources.
Oracle strongly discourages directly manipulating Oracle-supplied resources (resources whose names begin with ora) using CRSCTL. This could adversely impact the cluster configuration.
If resource name begins with ora then use SRVCTL
How to check the status of CRS on specific node?
[ora11g@racnode1 ~]$ crsctl check crs
Note: crsctl command will be present in
GRID_INRA_HOME
How
to check the status of services on all nodes:
[ora11g@racnode1 ~]$ srvctl status nodeapps
How
to check the status of complete clusterware stack on all nodes:
[ora11g@racnode1 ~]$ crsctl status resource -t
How
to check the status of clusterware servers:
[ora11g@racnode1 ~]$ crsctl status server -f
How
to stop csr on all nodes of clusterware:
[root@racnode1 bin]# ./crsctl stop cluster -all
How
to stop crs on specific node:
[root@racnode1 bin]# ./crsctl stop crs
How
to start crs on specific node:
[root@racnode1 bin]# ./crsctl start crs
How to check the status of specific Instance across
cluster nodes:
[ora11g@racnode1 ~]$ srvctl status instance -d racdb
-i racdb1
How to check the status of all instances across the
cluster nodes:
[ora11g@racnode1 ~]$ srvctl status instance -d racdb
-i racdb1,racdb2,racdb3,racdb4
How to start specific Instance across cluster nodes:
[ora11g@racnode1 ~]$ srvctl start instance -d racdb
-i racdb4
How to check the status of database across cluster
nodes:
[ora11g@racnode1 ~]$ srvctl status database -d racdb
How to stop cluster database:
[ora11g@racnode1 ~]$ srvctl stop database -d racdb
-- Start/ Stop database / instance services status:
From oracle user
1) Verify instance status
$ srvctl status instance -d EHISKOL -i EHISKOL2
Instance EHISKOL2 is running on node ekrac2
2) To Stop instance
$ srvctl stop instance -d EHISKOL -i EHISKOL2
3) To start instance
$ srvctl start instance -d EHISKOL -i EHISKOL2
-- If below message found, then instance services
might be runnng.
PRCC-1015 : EHISKOL was already running on ekrac2
PRCR-1004 : Resource ora.ehiskol.db is already
running
4) To stop all instance services across the nodes:
$ srvctl stop database -d EHISKOL
-- Start/ Stop SCAN listeners in RAC databases:
1) Verify SCAN listener status:
$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node
ekrac2
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node
ekrac1
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node
ekrac1
$ srvctl status listener
Listener LISTENER is enabled
Listener LISTENER is running on node(s):
ekrac2,ekrac1
Note: Assume 3 scan IP configured with DNS.
When single / default IP configured with SCAN
listener, then status will be shown like below:
$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node
host1
2) To stop the scan listener:
$ srvctl stop scan_listener
3) To start the scan listener
$ srvctl start scan_listener
1) Verify Cluster services / CRS status
( for
single node)
# ./crsctl check crs
CRS-4638: Oracle High Availability Services is
online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
for all nodes)
# crsctl check cluster -all
**************************************************************
ekrac1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
2) To stop Cluster services / CRS (single node)
./crsctl stop crs
TOP 5 ISSUES WITH SCAN
VIP AND LISTENER
ISSUE 1 :
SCAN VIP showing
status UNKNOWN – CHECK TIMED OUT
crsctl stat res –t
ora.scan1.vip 1 ONLINE UNKNOWN rac2 CHECK TIMED OUT
ora.scan2.vip 1 ONLINE ONLINE rac1
ora.scan3.vip 1 ONLINE ONLINE rac1
CAUSE :
/ec/resolv.conf is not
consistent across all nodes .
SOLUTION :
Have
/etc/resolv.conf consistent across all nodes
Reduce the
value for lookup timeout so that the total lookup time is less than the check
timeout for VIP resource. E.g. add to /etc/resolv.conf : options
timeout :1
ISSUE 2 :
What to check on SCAN
VIP’s before installing GRID and what to check after installation .
BEFORE INSTALLATION
/etc/resolv.conf
must be consistent across all nodes
SCAN
should have the netmask already used by public interface
“nslookup
scan-name” should return name server and 3 ip addresses
If
tcp.validnode_checking=yes in GRID’s home sqlnet.ora, then TCP.INVITED_NODES
will require the SCAN VIP’s and node
VIP’s to be added to the Grid Infrastructure sqlnet.ora file.
AFTER INSTALLATION
crsctl
status resource –w ‘TYPE=ora.scan_vip.type’ --- it
should show LISTENER_SCAN<X> ONLINE
Also use,
srvctl
config scan
srvctl
config scan_listener
cluvfy
comp scan
ISSUE 3 :
Service not registered
with SCAN listener after SCAN listener failover .
DESCRIPTION :
After
SCAN VIP and SCAN listener failover, instance does not register with SCAN
listener. It might happen for only 1 SCAN listener.
Client connection gets
intermittent ORA-12514 TNS:listener does not currently know of service
requested in connect descriptor.
SOLUTION :
For both above bugs , the work around is to unregister and register remote listener on the db instance which does not register to a SCAN listener , with the following steps :
show parameter remote_listener ;
alter system set remote_listener=’ ‘ ;
alter system register ;
alter system set remote_listener=’<scan>:<port>’ ;
alter system register ;
ISSUE 4 :
SCAN
VIP’s do not failover when public n/w is down .
DESCRIPTION
:
SCAN
VIP’s should failover to next node when public n/w is down. On 11.2.0.1 under
some circumstances it is possible that SCAN VIP stays on wrong node.
SOLUTION
:
Install
last 11.2.0.1 PSU or 11.2.0.2/11.2.0.3 .
ISSUE 5 :
ISSUE 5 :
SCAN
listener troubles .
CAUSE
:
Listener.ora
exists in /etc or /var/opt/oracle .
Default port 1521 not used .
Default port 1521 not used .
SOLUTION:
Move listener.ora to other location. It is fixed in 11.2.0.3
and last 11.2.0.2 PSU .
SCAN listener port changed to default 1521 after upgrade to 11.2.0.3 GI from lower release .
SCAN listener port changed to default 1521 after upgrade to 11.2.0.3 GI from lower release .
Automatic
OCR Backups
The
Oracle Clusterware automatically creates OCR physical backups every four hours.
At any one time, Oracle always retains the last 3 backup copies of the OCR that
are 4 hours old. The CRSD process that creates these backups also creates and
retains an OCR backup for each full day and at the end of each week. You cannot
customize the backup frequencies or the number of OCR physical backup files
that Oracle retains.
The
default location for generating physical backups on UNIX-based systems
is CRS_home/cdata/cluster_name where cluster_name is
the name of your cluster (i.e. crs). Use the ocrconfig
-showbackup command to view all current OCR physical backups that were
taken from the master
node:
[oracle@racnode1 ~]$ ocrconfig -showbackup
Restoring
the OCR from an automatic physical backup is accomplished using the ocrconfig -restore command. Note that the CRS stack needs to
be shutdown on all nodes in the cluster prior to running the restore operation:
ocrconfig -restore <backup_file_name>
The Master Node
As
documented in Doc ID: 357262.1 on the My Oracle Support web site, the CRSD process only creates automatic OCR
physical backups on one node in the cluster, which is the OCR master
node. It does not create automatic backup copies on the other nodes; only
from the OCR master node. If the master node fails, the OCR backups will be
created from the new master node. You can determine which node in the cluster
is the master node by examining the $ORA_CRS_HOME/log/<node_name>/cssd/ocssd.log file
on any node in the cluster. In this log file, check for reconfiguration
information (reconfiguration successful) after which you will see which
node is the master and how many nodes are active in the cluster:
Node 1 - (racnode1)
[ CSSD]CLSS-3000:
reconfiguration successful, incarnation 1 with 2 nodes
[ CSSD]CLSS-3001:
local node number 1, master node number 1
Node 2 - (racnode2)
[ CSSD]CLSS-3000:
reconfiguration successful, incarnation 1 with 2 nodes
[ CSSD]CLSS-3001:
local node number 2, master node number 1
Another
quick approach is to use either of the following methods:
Node 1 - (racnode1)
# grep -i "master node"
$ORA_CRS_HOME/log/racnode?/cssd/ocssd.log | tail -1
[ CSSD]CLSS-3001:
local node number 1, master node number 1
Node 2 - (racnode2)
# grep -i "master node"
$ORA_CRS_HOME/log/racnode?/cssd/ocssd.log | tail -1
[ CSSD]CLSS-3001:
local node number 2, master node number 1
# If not found in the ocssd.log, then look through all
# of the ocssd archives:
Node 1 - (racnode1)
# for x in 'ls -tr
$ORA_CRS_HOME/log/racnode?/cssd/ocssd.*'
do grep -i "master node" $x; done | tail -1
[ CSSD]CLSS-3001:
local node number 1, master node number 1
Node 2 - (racnode2)
# for x in 'ls -tr $ORA_CRS_HOME/log/racnode?/cssd/ocssd.*'
do grep -i "master node" $x; done | tail -1
[ CSSD]CLSS-3001:
local node number 2, master node number 1
# The master node information is confirmed by the
# ocrconfig -showbackup command:
# ocrconfig -showbackup
racnode1 2009/09/29 13:05:22 /u01/app/crs/cdata/crs
racnode1 2009/09/29 09:05:22 /u01/app/crs/cdata/crs
racnode1 2009/09/29 05:05:22 /u01/app/crs/cdata/crs
racnode1 2009/09/28 05:05:21 /u01/app/crs/cdata/crs
racnode1 2009/09/22 05:05:13 /u01/app/crs/cdata/crs
Because
of the importance of OCR information, Oracle recommends that you make copies of
the automatically created backup files at least once a day from the master node
to a different device from where the primary OCR resides. You can use any
backup software to copy the automatically generated physical backup files to a
stable backup location:
1. Verify the OCR configuration by running
the ocrcheck command:
2.
[root@racnode1 ~]# ocrcheck
3.
Status of Oracle Cluster
Registry is as follows :
4.
Version : 2
5.
Total space (kbytes) :
262120
6.
Used space (kbytes) :
4668
7.
Available space (kbytes) : 257452
8.
ID :
1331197
9.
Device/File Name : /u02/oradata/racdb/OCRFile
10.
Device/File
integrity check succeeded <--
Primary OCR Restored
11.
Device/File Name : /u02/oradata/racdb/OCRFile_mirror
12.
Device/File
integrity check succeeded <--
Mirror OCR Restored
13.
Cluster registry integrity check
succeeded
View
Voting Disk Configuration Information
Use
the crsctl utility to verify how many voting disks are configured for
the cluster as well as their location. The the crsctl command can be
run as either the oracle or root user account:
[oracle@racnode1 ~]$ crsctl query css votedisk
0. 0
/u02/oradata/racdb/CSSFile
located 1 votedisk(s).
Adding
or removing a voting disk from the cluster is a fairly straightforward process.
Oracle Clusterware 10g Release 1 (10.1) only allowed for one voting disk while
Oracle Oracle Clusterware 10g Release 2 (10.2) lifted this restriction to allow
for 32 voting disks. Having multiple voting disks available to the cluster
removes the voting disk as a single point of failure and eliminates the need to
mirror them outside of Oracle Clusterware (i.e. RAID). The Oracle Universal
Installer (OUI) allows you to configure either one or three voting disks during
the installation of Oracle Clusterware. Having three voting disks available
allows Oracle Clusterware (CRS) to continue operating uninterrupted when
any one of the voting disks fail.
crsctl add css votedisk
<path>
You must be logged in as
the root user to run the crsctl command to add/remove
voting disks.
Backing up the voting disk(s) is often performed
on a regular basis by the DBA to guard the cluster against a single point of
failure as the result of hardware failure or user error. Because the node
membership information does not usually change, it is not a strict requirement
that you back up the voting disk every day. At a minimum, however, your backup
strategy should include procedures to back up all voting disks
at the following times and make certain that the backups are stored in a secure
location that is accessible from all nodes in the cluster in the event the
voting disk(s) need to be restored:
- After
installing Oracle Clusterware
- After
adding nodes to or deleting nodes from the cluster
- After
performing voting disk add or delete operations
Oracle Clusterware 10g Release 1 (10.1) only
allowed for one voting disk while Oracle Clusterware 10g Release 2 (10.2)
lifted this restriction to allow for 32 voting disks. For high availability,
Oracle recommends that Oracle Clusterware 10g R2 users configure multiple
voting disks while keeping in mind that you must have an odd number of voting
disks, such as three, five, and so on. To avoid simultaneous loss of multiple
voting disks, each voting disk should be placed on a shared storage device that
does not share any components (controller, interconnect, and so on) with the
storage devices used for the other voting disks. If you define a single voting
disk, then you should use external mirroring to provide redundancy.
To make a backup copy of the voting disk on
UNIX/Linux, use the dd command:
dd if=<voting_disk_name> of=<backup_file_name>
bs=<block_size>
Perform this operation on every voting
disk where voting_disk_name is the name of the active voting
disk (input file), backup_file_name is the name of the file to
which you want to back up the voting disk contents (output file), and block_size is
the value to set both the input and output block sizes. As a general rule on
most platforms, including Linux and Sun, the block size for the ddcommand
should be 4k to ensure that the backup of the voting disk gets complete blocks.
If your voting disk is stored on a raw device,
use the device name in place of voting_disk_name. For example:
dd if=/dev/raw/raw3
of=/u03/crs_backup/votebackup/VotingDiskBackup.dmp bs=4k
Step by Step Cloning Database 10g RAC to
RAC
Cloning RAC Database (RAC-RAC)
·
Install Oracle
Clusterware on target server
·
Install Oracle Database
Software on target server
·
Copy the Cold backup to
target server shared location.
(If you are using hotbackup then also copy the archive log files. We require
archivelog files from both instances for e.g : 1st Instance =
log_652206883_34730_1.arc 2nd Instance =log_652206883_35888_2.arc to recover
the database)
·
Generate pfile from
spfile on Production and transfer the file to target $ORACLE_HOME/dbs
·
Edit the pfile to
replace the entries of PROD to TEST/UAT/DEV. Also remove the below entries from
the pfile (initERPUAT.ora)
*.compatible='10.2.0'
*.control_files='/uatdata/oracle/db/apps_st/data/cntrl01.dbf','/uatdata/oracle/db/apps_st/data/cntrl02.dbf','/uatdata/oracle/db/apps_st/data/cntrl03.dbf'
*.db_name='ERPUAT'
ERPUAT.instance_name='ERPUAT1'
ERPUAT.local_listener='ERPUAT1_LOCAL'
*.log_archive_dest_1='location=/uatarch/arch'
*.log_archive_format='log_%r_%s_%t.arc'
ERPUAT.remote_listener='ERPUAT_REMOTE'
ERPUAT.service_names='ERPUAT'
*.db_file_name_convert=(/oraproddata/oracle/db/apps_st/DATA,/uatdata/oracle/db/apps_st/DATA)
*.log_file_name_convert=(/oraproddata/oracle/db/apps_st/DATA,/uatdata/oracle/db/apps_st/DATA)
*._no_recovery_through_resetlogs=TRUE
·
Export the
ORACLE_SID=ERPUAT and start the database in nomount state
·
Create the control file
or rename the datafiles (change the location of datafiles) if you want to keep
the db name same
Note
: If using hotbackup, once the control file is created recover the database
using the below command.Provide necessary archivelog files.
To check which archivelog files to supply. Execute the below command on Source
Database
No comments:
Post a Comment