Friday, February 28, 2020

ASM Administration Commands

ASM Administration Commands



check ASMCMD’s version
$ asmcmd -V

Check diskgroup including the dismounted DGs
$ asmcmd
lsdg -g  --discovery

Check ASM disk status
$ asmcmd
lsdsk -k -g

Check ASM disk status for candidate disk
$  asmcmd
lsdsk -k -g --candidate

Clients accessing ASM instance:
$ asmcmd
lsct

Check a particular diskgroup.
$ asmcmd
 lsdg DATA

Check listed Mounted ASM disks.
$ asmcmd
$lsdsk

Check details of ASM disks.
$ asmcmd
lsdsk -k

Check above information for a particular disk.
$ asmcmd
lsdsk -k -G <dg_name>

Check Input/Output State details of disks.
$ asmcmd
iostat

Check details for a particular Disk.
$ asmcmd
ostat -G FRA

Check information of Reads/Writes Error.
$ asmcmd
iostat -e

Check information about OracleASM ServerParameter file.
$ asmcmd
spget

Check permissions on a file.
$ asmcmd
ls --permission

Check disk usage.
$ asmcmd
du

To find file with name.
$ asmcmd
find + system* data*

To list currently open files by all instances in ASM Use Below Command.
$ asmcmd
lsof



How to check Grid Infrastructure version

How to check Grid Infrastructure version


# crsctl query crs activeversion
# crsctl query crs releaseversion
# crsctl query crs softwareversion

What does crsctl query crs activeversion mean ?
If some nodes become unreachable in the middle of an upgrade, then you cannot complete the upgrade, because the upgrade script (rootupgrade.sh) did not run on the unreachable nodes. Because the upgrade is incomplete, Oracle Clusterware remains in the previous release version.

What does crsctl query crs releaseversion mean ?
Use the crsctl query crs releaseversion command to display the version of the Oracle Clusterware software stored in the binaries on the local node.

What does crsctl query crs softwareversion mean ?
Use the crsctl query crs softwareversion command to display latest version of the software that has been successfully started on the specified node.

Enable Archive Log Mode In Oracle RAC 11g

Enable Archive Log Mode In Oracle RAC 11g


To check the status:

[oracle@rac1 ~]$ srvctl status database -d PROD
Instance PROD1 is running on node rac1
Instance PROD2 is running on node rac2

To stop the database:

[oracle@rac1 ~]$ srvctl stop database -d PROD -o immediate

To check the status:

[oracle@rac1 ~]$ srvctl status database -d PROD
Instance PROD1 is not running on node rac1
Instance PROD2 is not running on node rac2

To mount the database:

[oracle@rac1 ~]$ srvctl start database -d PROD -o mount

To enable archivelog:

SQL> alter database archivelog;

To restart the database:

[oracle@rac1 ~]$ srvctl stop database -d PROD -o immediate
[oracle@rac1 ~]$ srvctl status database -d PROD
Instance PROD1 is not running on node rac1
Instance PROD2 is not running on node rac2

To start the database:

[oracle@rac1 ~]$ srvctl start database -d PROD
[oracle@rac1 ~]$ srvctl status database -d PROD
Instance PROD1 is running on node rac1
Instance PROD2 is running on node rac2

To set the archive destination to a ASM DISK:

alter system set log_archive_dest_1='LOCATION=+DATA/ARCH/' scope=both sid='*';

For disabling archive mode:

srvctl stop database -d PROD
srvctl start database -d PROD -o mount
alter database noarchivelog;
srvctl stop database -d PROD
srvctl start database -d PROD




Sunday, February 23, 2020

Oracle Networking Between Server & Client

Oracle Networking Between Server & Client


Server Side:

- set the IP address and host name on server.
- check ip using # ipconfig.
- make the listener.ora aby using $ netmgr.
- ping the client machine on server side.


Client Side:

- set the IP address and host name on client.
- check ip using # ipconfig
- make the listener.ora by using $ netmgr.
- ping the server macine on client side.
- make tnsnames.ora file on client side $ netca.


Client side database:

netmgr --> listener --> expand the listener --> delete the old listener --> add listener --> add address (hostname and ip) --> choose database services (sid and global sid)

netca --> naming methods configuration --> local naming --> next --> next --> local net service naming configuration --> add --> <service_name_server_side> --> <hostname> --> next --> perform a test --> system/manage --> next --> net service name --> click no and choose finish.

RMAN Active cloning method

RMAN Active cloning method 



EXAMPLE:

SOURCE DB – QPROD ( Also called target instance )
DESTINATION DB – QTEST ( Also called auxiliary instance )

NOTE – If your auxiliary instance already exists, then drop the database before starting the cloning.


1. Add the tns entry of the both database in tnsnames.ora file of DESTINATION host :

-- source db tns :

QPROD =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = rac1.oracle.com)(PORT = 1528))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = QPROD)
    )
  )


--Target db tns :


QTEST =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = rac2.oracle.com)(PORT = 1533))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = QTEST)
    )
  )


2. Create a listener for the target db ( with static registration)

 LISTENER_QTEST =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = TCP)(HOST = rac2.oracle.com)(PORT = 1533))
    ))


SID_LIST_LISTENER_QTEST =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME = QTEST)
      (ORACLE_HOME = /u01/app/oracle/product/11.2.0/dbhome_1)
      (SID_NAME = TESTDB )
    ))


-- START THE LISTENER

lsnrctl start LISTENER_TESTDB


3. Copy the pfile from source host :

Copy the pfile from source host to target host and modify the parameters like control_files,diagnostic_dest,audit_dump .
Apart from that add below two (mandatory) parameters in the the pfile of target db pfile.

*.db_file_name_convert = '/u09/QPROD', '/u01/QTEST'('< source db db file location> ','< target db db file location>')
*.log_file_name_convert= '/u09/QPROD', '/u01/QTEST'('< sourcec db redo log location>','<target db redo log location')


NOTE – FOR RAC DATABASE:

alter system set cluster_database=FALSE scope=spfile sid='*';
alter system set log_file_name_convert=’+REDOA/PROD/ONLINELOG’,’+REDO01/TEST/ONLINELOG’,’+REDOB/PROD/ONLINELOG’,’+REDO02/TEST/ONLINELOG’,’+PRODARCH02′,’+TESTARCH’  scope=spfile sid=’*’;
alter system set db_create_file_dest=’+DATA’ scope=spfile sid='*';

4. Create password file on both source and target db ( keep same password )

-- SOURCE DB ( QPROD)

cd $ORACLE_HOME/dbs
orapwd file=orapw$ORACLE_SID password=oracle force=y

-- TARGET DB ( QTEST)

cd $ORACLE_HOME/dbs
orapwd file=orapw$ORACLE_SID password=oracle force=y

NOTE – FOR RAC DB,create password file as below.
orapwd file=’+DATA’ dbuniquename=TESTDB
Enter password for SYS:

5. Start the target db/auxiliary instance ( QTEST) in nomount state:

export ORACLE_SID=QTEST
SQL> create spfile from pfile ;
SQL> startup nomount

6. Check the connection to both source and target as below

run the below command from QTEST:
rman target sys/oracle@< source db_tns_name>  auxiliary sys/oracle@<target db_tns_name>
i.e
rman target sys/oracle@QPROD auxiliary sys/oracle@QTEST
If you are getting any error while running this command, then fix the same, before proceeding further.

7 . Start the cloning:

Now run the below rman script from target db host:
rman target sys/oracle@PRODDB auxiliary sys/oracle@TESTDB

run
{
allocate channel c1 type disk;
allocate channel c2 type disk;
allocate auxiliary channel aux1 type disk;
allocate auxiliary channel aux2 type disk;
allocate auxiliary channel aux3 type disk;
allocate auxiliary channel aux4 type disk;
duplicate target database to QTEST from active database USING  BACKUPSET ;
}

Once this script completed, it will open the target db (QTEST) in resetlog mode.  With this cloning completes.

NOTE: If your oracle version is 11g, then use the below rman script.

run
{
allocate channel c1 type disk;
allocate channel c2 type disk;
allocate auxiliary channel aux1 type disk;
allocate auxiliary channel aux2 type disk;
allocate auxiliary channel aux3 type disk;
allocate auxiliary channel aux4 type disk;
duplicate target database to QTEST from active database;
}

for excluding particular tablespace while cloning:

Excluding tablespace Q_DATA.

run
{
allocate channel c1 type disk;
allocate channel c2 type disk;
allocate auxiliary channel aux1 type disk;
allocate auxiliary channel aux2 type disk;
allocate auxiliary channel aux3 type disk;
allocate auxiliary channel aux4 type disk;
duplicate target database to QTEST from active database SKIP TABLESPACE "Q_DATA";
}

Wednesday, February 19, 2020

Trick to read a database alertlogfile in detail

Trick to read a database alertlogfile in detail


[oracle@node1 trace]$ ls -ltrh alert_QLAB.log
-rw-r----- 1 oracle oinstall 61K Feb 20 00:32 alert_QLAB.log
[oracle@node1 trace]$ du -sh alert_QLAB.log
68K     alert_QLAB.log
[oracle@node1 trace]$ cat alert_QLAB.log |wc -l
1496
[oracle@node1 trace]$
[oracle@node1 trace]$ tail -500 alert_QLAB.log > newalerbymkm.log
[oracle@node1 trace]$ du -sh newalerbymkm.log
20K     newalerbymkm.log
[oracle@node1 trace]$ cat newalerbymkm.log |wc -l
500
[oracle@node1 trace]$vi newalerbymkm.log

Friday, February 14, 2020

Simple explanation of xplan plan

Simple explanation of xplan plan




SQL> explain plan for select * from emp where deptno=20;



SQL> create index idx1 on emp(deptno);
Index created.






Installing Sample Schemas

Installing Sample Schemas


SQL> @?/rdbms/admin/scott.sql
SQL> @?/sqlplus/demo/demobld.sql
SQL> @?/rdbms/admin/utlsampl.sql

Multiple xplain plan

Multiple xplain plan


explain plan for select * from mqm101;
select * from table(dbms_xplan.display());

Need to use SET STATEMENT_ID

statement.1
explain plan SET STATEMENT_ID='EXPLAIN1'
for select * from mqm101;

check the explain plan table status:
select plan_id, statement_id from plan_table;

statement.2
explain plan SET STATEMENT_ID='EXPLAIN2'
for select * from mqm102;

check the explain plan table status:
select plan_id, statement_id from plan_table;

statement.3
explain plan SET STATEMENT_ID='EXPLAIN3'
for select * from mqm103;

check the explain plan table status:
SQL> select plan_id, statement_id from plan_table;

   PLAN_ID STATEMENT_ID
---------- ------------------------------
         5 EXPLAIN1
         5 EXPLAIN1
         6 EXPLAIN2
         6 EXPLAIN2
         7 EXPLAIN3
         7 EXPLAIN3

Now generate the execution plan:

select * from table(dbms_xplan.display());
or
select * from table(dbms_xplan.display('PLAN_TABLE','EXPLAIN1','TYPICAL',NULL ));

Wednesday, February 12, 2020

RMAN useful information

RMAN useful information


The RMAN commands "crosscheck archivelog all" and "delete noprompt expired archivelog all" are used to clear RMAN's lookups so that the next RMAN run of "backup archivelog" does not look for non-existent archivelogs. However, you select "v$archived_log" and it will still see all the archive logs.

The number of records in V$ARCHIVED_LOG is 467, the records_used, last_index & last_recid in V$CONTROLFILE_RECORD_SECTION are 467, the number seems to be the same as in V$ARCHIVED_LOG.

V$ARCHIVED_LOG entries will be maintained for as long as CONTROLFILE_RECORD_KEEP_TIME.

V$CONTROLFILE_RECORD_SECTION shows you the number and size of the entries.

Trick to remove and trim alertlog file in oracle

Trick to remove and trim alertlog file in oracle


[root@rac1 trace]# du -sh alert_ORCL1.log
148K    alert_ORCL1.log
[root@rac1 trace]# pwd
/u02/app/oracle/diag/rdbms/orcl1/ORCL1/trace
[root@rac1 trace]# cat /dev/null > alert_ORCL1.log
[root@rac1 trace]# du -sh alert_ORCL1.log
0       alert_ORCL1.log

Below is the another terminal where i already opened the alertlog:

[root@rac1 trace]# tail -f alert_ORCL1.log

Wed Feb 12 23:07:24 2020
Begin automatic SQL Tuning Advisor run for special tuning task  "SYS_AUTO_SQL_TUNING_TASK"
End automatic SQL Tuning Advisor run for special tuning task  "SYS_AUTO_SQL_TUNING_TASK"
Wed Feb 12 23:12:48 2020
Resize operation completed for file# 3, old size 645120K, new size 655360K
tail: alert_ORCL1.log: file truncated
Wed Feb 12 23:16:17 2020
Thread 1 advanced to log sequence 22 (LGWR switch)
  Current log# 1 seq# 22 mem# 0: /u02/app/oracle/oradata/ORCL1/redo01.log
Wed Feb 12 23:16:17 2020
Archived Log entry 10 added for thread 1 sequence 21 ID 0x54684e49 dest 1:

If you remove the alertlogfile during the database up and running nothing will happen below are the commands after you removing the alertlog it will automatically created another except "alter system checkpoint" command.

alter system switch logfile;
alter database backup controlfile to trace;
alter database begin backup;

Note:

Before implementing on production, please test on development or UAT boxes.

Monday, February 10, 2020

ORA-01153: an incompatible media recovery is active On standby Database

ORA-01153: an incompatible media recovery is active On standby Database


alter database recover managed standby database using current logfile disconnect;

Note When shutting down physical standby database, firstly turn off media recovery process. Otherwise the next time when starting up redo apply again, you will encounter error ORA-01153.
ORACLE instance started.
Total System Global Area  835104768 bytes
Fixed Size                  2217952 bytes
Variable Size             490735648 bytes
Database Buffers          335544320 bytes
Redo Buffers                6606848 bytes
Database mounted.

alter database recover managed standby database using current logfile disconnect;
alter database recover managed standby database using current logfile disconnect
*
ERROR at line 1:
ORA-01153: an incompatible media recovery is active

Cause:
This indicates a currently running media recovery process.

Action:
alter database recover managed standby database cancel;
alter database recover managed standby database using current logfile disconnect;

Note:
When shutting down physical standby database, firstly turn off media recovery process. Otherwise the next time when starting up redo apply again, you will encounter error ORA-01153.

Dataguard Broker Basics

Dataguard Broker Basics


- The Data Guard broker logically groups primary and standby databases in a Data Guard setup into a broker configuration that allows the broker to manage and monitor them together as an integrated unit. You can manage a broker configuration using either the Oracle Enterprise Manager graphical user interface or the Data Guard command-line interface.

- Data Guard Broker is part of the normal Oracle Database Enterprise Edition installation and an integral part of dataguard. You can do Data Guard related configuration, set transport methods, apply setup and role change services and also change overall protection mode.

The Broker consists of three major parts:

A) BACKGROUND PROCESSES ON EACH NODE
B) BROKER CONFIGURATION FILES
C) COMMAND LINE INTERFACE KNOWN AS DGMGRL.

A) BACKGROUND PROCESSES ON EACH NODE
All of the processes run on every node of Data Guard setup.

i) Data Guard Monitor (DMON)
Coordinates all Broker actions as well as maintains the Broker configuration files
Enabled or disabled with the DG_BROKER_START parameter.

ii) Broker Resource Manager (RSM)
RSM process comes in play whenever broker need to run any SQL command in the database.
SQL may be required to be run during Data Guard setup or because of result of a change to the conguration made through DGMGRL

iii) Data Guard Net Server (NSVn)
NSV processes are responsible for making connection with the remote database and to send across anywork items to the remote database. So, DMON or RSM process may ask NSV process to connect to other node for any work request. DMON may need to send some communication to the other node and RSM may have to get some data through SQL from other node.

iv) DRCn
NSV processes contact DRC process running on other node to establish the connection, so DRC process acts like a receiver on other node. Each NSV process will have a partner DRC process on the target database, which will perform the actual work on behalf of the source database NSV process and return the results or status.

v) Internode servers (INSVs)
If Data Guard setup involves Real Application Clusters (RAC), INSV process come into play. It maintain a connection between the RAC database nodes in the cluster to ensure that the Broker on each node knows the state of the cluster.

WHAT HAPPENS AT DATABASE STARTUP:

Step 1) Primary database startup.
Step 2) DMON process at primary ask local NSV process to connect to each standby database.
Step 3) NSV process in turn will contact DRC process on standby and get confirmation.
Step 4) DMON process at primary will then ask local RSM to send the setup commands to the standby database.
Step 5) RSM will send the setup commands to NSV.
Step 6) NSV process in turn will contact DRC process on standby to run the setup commands.


B) BROKER CONFIGURATION FILES

Broker keeps its configuration details in flat file. The files are stored at each database nodes in the Data Guard configuration. Additionally two copies of the configuration files are always stored on each database for redundancy.

Below parameters control where the configuration files will be stored.

DG_BROKER_CONFIG_FILE1
DG_BROKER_CONFIG_FILE2

For RAC instances, you need to keep the files in shared location where all database instances can access it. Broker keeps in sync all of these configuration files but DMON process running on the primary database is the owner of the master copy of the files.

This means that if you started the standby database while Data guard process are down at primary node, then no data guard related activities will be done until the standby database can connect to the primary database.

Primary purpose of keeping configuration files on each node is when your primary node is gone and you need to failover, Data Guard processes can determine the original settings for the entire configuration from the configuration files on other surviving node.

C) COMMAND LINE INTERFACE (CLI) KNOWN AS DGMGRL

The third important part of Broker setup is the interface through which users will interact with Broker. The Broker CLI DGMGRL is included in the Oracle Database Enterprise Edition.

To access DGMGRL, type dgmgrl at the command prompt. Use the CONNECT command after you have started DGMGRL to connect to databases.

Below steps to configure Dataguard Broker

SQL> select NAME,DATABASE_ROLE,DATAGUARD_BROKER from v$database;
NAME      DATABASE_ROLE    DATAGUAR
--------- ---------------- --------
DELL      PRIMARY          DISABLED

- Change the location of config files

Since our Primary and Standby databases are RAC, we will change the default location of DG Broker files to a centralized location so that all nodes can access them. In our case, we will be using the ACFS mount point.

Login as oracle user on Primary and execute below commands.

SQL> show parameter dg_broker_config

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
dg_broker_config_file1               string      /u01/app/oracle/products/12.1.0/db/dbs/dr1orcldb.dat
dg_broker_config_file2               string      /u01/app/oracle/products/12.1.0/db/dbs/dr2orcldb.dat

SQL> !mkdir -p /u01/app/acfs/reco/dg_broker_config_file

SQL> alter system set dg_broker_config_file1='/u01/app/acfs/reco/dg_broker_config_file/dr1orcldb.dat';
System altered.

SQL> alter system set dg_broker_config_file2='/u01/app/acfs/reco/dg_broker_config_file/dr2orcldb.dat';
System altered.

Similarly, change the settings on Standby database server.

SQL> !mkdir -p /u01/app/acfsreco/dg_broker_config_file

SQL> alter system set dg_broker_config_file1='/u01/app/acfsreco/dg_broker_config_file/dr1orcldbs.dat';
System altered.

SQL> alter system set dg_broker_config_file2='/u01/app/acfsreco/dg_broker_config_file/dr2orcldbs.dat';
System altered.

- Change the dg_broker_start parameter value

Now change the parameter 'dg_broker_start' to TRUE on both Primary and Standby databases as shown below.
SQL> alter system set dg_broker_start=TRUE;
System altered.

- Register databases using dgmgrl

On the primary server, login to command line interface using dgmgrl and register the primary database in the broker configuration.

[oracle@rac1 ~]$ dgmgrl
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production
Copyright (c) 2000, 2013, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/password@orcldb
Connected as SYSDBA.
DGMGRL>

DGMGRL> CREATE CONFIGURATION dg_config AS PRIMARY DATABASE IS orcldb CONNECT IDENTIFIER IS orcldb;
Configuration "dg_config" created with primary database "orcldb"

[oracle@rac1 ~]$ dgmgrl
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production
Copyright (c) 2000, 2013, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/password@orcldb
Connected as SYSDBA.

DGMGRL> CREATE CONFIGURATION dg_config AS PRIMARY DATABASE IS orcldb CONNECT IDENTIFIER IS orcldb;
Configuration "dg_config" created with primary database "orcldb"
Now add the standby database in broker configuration.

DGMGRL> ADD DATABASE orcldbs AS CONNECT IDENTIFIER IS orcldbs MAINTAINED AS PHYSICAL;
Database "orcldbs" added

- Enable and Check the broker configuration
We need to enable the broker configuration and check if the configuration is enabled successfully or not.

DGMGRL> ENABLE CONFIGURATION;
Enabled.

DGMGRL> show configuration;
Configuration - dg_config
Protection Mode: MaxPerformance
Members:
orcldb  - Primary database
orcldbs - Physical standby database
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS   (status updated 13 seconds ago)

DGMGRL> ENABLE CONFIGURATION;
Enabled.

DGMGRL> show configuration;
Configuration - dg_config
Protection Mode: MaxPerformance
Members:
orcldb  - Primary database
orcldbs - Physical standby database
Fast-Start Failover: DISABLED
Configuration Status:
SUCCESS   (status updated 13 seconds ago)

Once the broker configuration is enabled, the MRP process should get start on the Standby database server. You can check using below command.

DGMGRL> show database orcldbs
Database - orcldbs
Role:               PHYSICAL STANDBY
Intended State:     APPLY-ON
Transport Lag:      0 seconds (computed 0 seconds ago)
Apply Lag:          0 seconds (computed 0 seconds ago)
Average Apply Rate: 39.00 KByte/s
Real Time Query:    OFF
Instance(s):
orcldbs1 (apply instance)
orcldbs2
Database Status:
SUCCESS

DGMGRL> show database orcldbs
Database - orcldbs
Role:               PHYSICAL STANDBY
Intended State:     APPLY-ON
Transport Lag:      0 seconds (computed 0 seconds ago)
Apply Lag:          0 seconds (computed 0 seconds ago)
Average Apply Rate: 39.00 KByte/s
Real Time Query:    OFF
Instance(s):
orcldbs1 (apply instance)
orcldbs2
Database Status:
SUCCESS

The output of above command shows that the MRP process is started on instance1 (orcldbs1) server. You can login to standby Node1 server and check whether MRP is running or not as shown below.
$ ps -ef | grep mrp
oracle   26667     1  0 15:17 ?        00:00:00 ora_mrp0_orcldbs1
oracle   27826 20926  0 15:21 pts/1    00:00:00 /bin/bash -c ps -ef | grep mrp

$ ps -ef | grep mrp
oracle   26667     1  0 15:17 ?        00:00:00 ora_mrp0_orcldbs1
oracle   27826 20926  0 15:21 pts/1    00:00:00 /bin/bash -c ps -ef | grep mrp
Now that the MRP process is running, login to both Primary and Standby database and check whether the logs are in sync or not.

- Dataguard Broker commands
Below are some extra commands which you can use and check status of database.

DGMGRL> VALIDATE DATABASE orcldb;
Database Role:    Primary database
Ready for Switchover:  Yes
Flashback Database Status:
orcldb:  Off

DGMGRL> VALIDATE DATABASE orcldb;
Database Role:    Primary database
Ready for Switchover:  Yes
Flashback Database Status:
orcldb:  Off

DGMGRL> VALIDATE DATABASE orcldbs;
Database Role:     Physical standby database
Primary Database:  orcldb
Ready for Switchover:  Yes
Ready for Failover:    Yes (Primary Running)
Flashback Database Status:
orcldb:   Off
orcldbs:  Off

DGMGRL> VALIDATE DATABASE orcldbs;

Database Role:     Physical standby database
Primary Database:  orcldb
Ready for Switchover:  Yes
Ready for Failover:    Yes (Primary Running)
Flashback Database Status:
orcldb:   Off
orcldbs:  Off
STANDBY DB (192.168.1.12)

SQL> select NAME,DATABASE_ROLE,DATAGUARD_BROKER from v$database;

NAME   DATABASE_ROLE    DATAGUAR
------ ---------------- --------
DELL   PHYSICAL STANDBY ENABLED

SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            /u01/app/oracle/DELL/arch
Oldest online log sequence     9
Next log sequence to archive   0
Current log sequence           11


Sunday, February 9, 2020

Node Eviction Overview

Node Eviction Overview


The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected.  A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process.  The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.

Starting in 11.2.0.2 RAC or above (or if you are on Exadata), a node eviction may not actually reboot the machine.  This is called a rebootless restart.  In this case we restart most of the clusterware stack to see if that fixes the unhealthy node.

1.0 - PROCESS ROLES FOR REBOOTS

OCSSD (aka CSS daemon) - This process is spawned by the cssdagent process. It runs in both
vendor clusterware and non-vendor clusterware environments.  OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. The health monitoring includes a network heartbeat and a disk heartbeat (to the voting files).  OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle user.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin

CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality).  This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent

CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor

2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT
Important files to review:

Clusterware alert log in
The cssdagent log(s)
The cssdmonitor log(s)
The ocssd log(s)
The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
CHM or OS Watcher data
opatch lsinventory -detail' output for the GRID home
Messages files

* Messages file locations:

Linux: /var/log/messages
Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
IBM: /bin/errpt -a > messages.out

11.2 Clusterware evictions should, in most cases, have some kind of meaningful error in the clusterware alert log.  This can be used to determine which process is responsible for the reboot.  Example message from a clusterware alert log:

[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component: cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653

This particular eviction happened when we had hit the network timeout.  CSSD exited and the cssdagent took action to evict. The cssdagent knows the information in the error message from local heartbeats made from CSSD.

If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes.

3.0 - TROUBLESHOOTING OCSSD EVICTIONS

If you have encountered an OCSSD eviction review common causes in section 3.1 below.

3.1 - COMMON CAUSES OF OCSSD EVICTIONS
Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by the CSS misscount) to cause a node eviction.
Problems writing to or reading from the CSS voting disk.  If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
A member kill escalation.  For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism.  If this times out it could escalate to a node kill.
An unexpected failure or hang of the OCSSD process, this can be caused by any of the above issues or something else.
An Oracle bug.
3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS

All files from section 2.0 from all cluster nodes.  More data may be required.

Example of an eviction due to loss of voting disk:

CSS log:

2012-03-27 22:05:48.693: [ CSSD][1100548416](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured voting disks available, need 2
2012-03-27 22:05:48.693: [ CSSD][1100548416]###################################
2012-03-27 22:05:48.693: [ CSSD][1100548416]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread

OS messages:

Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:All paths to Symm 000190104720 vol 0c71 are dead.
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:Symm 000190104720 vol 0c71 is dead.
Mar 27 22:03:58 choldbr132p kernel: Buffer I/O error on device sdbig, logical block 0

4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS

If you have encountered a CSSDAGENT or CSSDMONITOR eviction review common causes in section 4.1 below.

4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS
An OS scheduler problem.  For example, if the OS is getting locked up in a driver or hardware or there is excessive amounts of load on the machine (at or near 100% cpu utilization), thus preventing the scheduler from behaving reasonably.
A thread(s) within the CSS daemon hung.
An Oracle bug.
4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS

All files from section 2.0 from all cluster nodes. More data may be required.

Reference metalink Doc ID 1050693.1