Thursday, April 7, 2016

Hang-analyze In Oracle Database

Hang-analyze In Oracle Database

Collecting Hanganalyze and Systemstate Dumps

Logging in to the system
Using SQL*Plus connect as SYSDBA using the following command:
sqlplus '/ as sysdba'

If there are problems making this connection then in 10gR2 and above, the sqlplus "preliminary connection" can be used :
sqlplus -prelim '/ as sysdba'

Collection commands for Hanganalyze and Systemstate: Non-RAC:
Sometimes, database may actually just be very slow and not actually hanging. It is therefore recommended, where possible to get 2 hanganalyze and 2 systemstate dumps in order to determine whether processes are moving at all or whether they are "frozen".

Hanganalyze
sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug hanganalyze 3
-- Wait one minute before getting the second hanganalyze
oradebug hanganalyze 3
oradebug tracefile_name
exit

Systemstate
sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug dump systemstate 266
oradebug dump systemstate 266
oradebug tracefile_name
exit

Collection commands for Hanganalyze and Systemstate: RAC
There are 2 bugs affecting RAC that without the relevant patches being applied on your system, make using level 266 or 267 very costly. Therefore without these fixes in place it highly unadvisable to use these level

For information on these patches see:

Document 11800959.8 Bug 11800959 - A SYSTEMSTATE dump with level >= 10 in RAC dumps huge BUSY GLOBAL CACHE ELEMENTS - can hang/crash instances

Document 11827088.8 Bug 11827088 - Latch 'gc element' contention, LMHB terminates the instance

Collection commands for Hanganalyze and Systemstate: RAC with fixes for bug 11800959 and bug 11827088
sqlplus '/ as sysdba'
oradebug setorapname reco
oradebug unlimit
oradebug -g all hanganalyze 3
oradebug -g all hanganalyze 3
oradebug -g all dump systemstate 266
oradebug -g all dump systemstate 266
exit

Collection commands for Hanganalyze and Systemstate: RAC without fixes for Bug 11800959 and Bug 11827088
sqlplus '/ as sysdba'
oradebug setorapname reco
oradebug unlimit
oradebug -g all hanganalyze 3
oradebug -g all hanganalyze 3
oradebug -g all dump systemstate 258
oradebug -g all dump systemstate 258
exit

In RAC environment, a dump will be created for all RAC instances in the DIAG trace file for each instance

Explanation of Hanganalyze and Systemstate Levels-

Hanganalyze levels:
Level 3: In 11g onwards, level 3 also collects a short stack for relevant processes in hang chain
Systemstate levels:
Level 258 is a fast alternative but we'd lose some lock element data
Level 267 can be used if additional buffer cache / lock element data is needed with an understanding of the cost

Other Methods

If connection to the system is not possible in any form, then please refer to the following article which describes how to collect systemstates in that situation:

Document 121779.1 Taking a SYSTEMSTATE dump when you cannot CONNECT to Oracle.

On RAC Systems, hanganalyze, systemstates and some other RAC information can be collected using the 'racdiag.sql' script, OR see:
Document 135714.1 Script to Collect RAC Diagnostic Information (racdiag.sql)

$wait_chains

Starting from 11g release 1, the dia0 background processes starts collecting hanganalyze information and stores this in memory in the "hang analysis cache". It does this every 3 seconds for local hanganalyze information and every 10 seconds for global (RAC) hanganalyze information. This information can provide a quick view of hang chains occurring at the time of a hang being experienced.

More information:

Document 1428210.1 Troubleshooting Database Contention With V$Wait_Chains

waiting Session details -
SELECT chain_id, num_waiters, in_wait_secs, sid, sess_serial#, osid, blocker_osid, substr(wait_event_text,1,30) FROM v$wait_chains;

Blocking session details -
set pages 1000
set lines 120
set heading off
column w_proc format a50 tru
column instance format a20 tru
column inst format a28 tru
column wait_event format a50 tru
column p1 format a16 tru
column p2 format a16 tru
column p3 format a15 tru
column Seconds format a50 tru
column sincelw format a50 tru
column blocker_proc format a50 tru
column waiters format a50 tru
column chain_signature format a100 wra
column blocker_chain format a100 wra

SELECT *
FROM (SELECT 'Current Process: '||osid W_PROC, 'SID '||i.instance_name INSTANCE,
'INST #: '||instance INST,'Blocking Process: '||decode(blocker_osid,null,'<none>',blocker_osid)||
' from Instance '||blocker_instance BLOCKER_PROC,'Number of waiters: '||num_waiters waiters,
'Wait Event: ' ||wait_event_text wait_event, 'P1: '||p1 p1, 'P2: '||p2 p2, 'P3: '||p3 p3,
'Seconds in Wait: '||in_wait_secs Seconds, 'Seconds Since Last Wait: '||time_since_last_wait_secs sincelw,
'Wait Chain: '||chain_id ||': '||chain_signature chain_signature,'Blocking Wait Chain: '||decode(blocker_chain_id,null,
'<none>',blocker_chain_id) blocker_chain
FROM v$wait_chains wc,
v$instance i
WHERE wc.instance = i.instance_number (+)
AND ( num_waiters > 0
OR ( blocker_osid IS NOT NULL
AND in_wait_secs > 10 ) )
ORDER BY chain_id,
num_waiters DESC)
WHERE ROWNUM < 101;

Below is the sql to get BLOCKING Session -
set pages 1000
set lines 120
set heading off
column w_proc format a50 tru
column instance format a20 tru
column inst format a28 tru
column wait_event format a50 tru
column p1 format a16 tru
column p2 format a16 tru
column p3 format a15 tru
column Seconds format a50 tru
column sincelw format a50 tru
column blocker_proc format a50 tru
column fblocker_proc format a50 tru
column waiters format a50 tru
column chain_signature format a100 wra
column blocker_chain format a100 wra

SELECT *
FROM (SELECT 'Current Process: '||osid W_PROC, 'SID '||i.instance_name INSTANCE,
'INST #: '||instance INST,'Blocking Process: '||decode(blocker_osid,null,'<none>',blocker_osid)||
' from Instance '||blocker_instance BLOCKER_PROC,
'Number of waiters: '||num_waiters waiters,
'Final Blocking Process: '||decode(p.spid,null,'<none>',
p.spid)||' from Instance '||s.final_blocking_instance FBLOCKER_PROC,
'Program: '||p.program image,
'Wait Event: ' ||wait_event_text wait_event, 'P1: '||wc.p1 p1, 'P2: '||wc.p2 p2, 'P3: '||wc.p3 p3,
'Seconds in Wait: '||in_wait_secs Seconds, 'Seconds Since Last Wait: '||time_since_last_wait_secs sincelw,
'Wait Chain: '||chain_id ||': '||chain_signature chain_signature,'Blocking Wait Chain: '||decode(blocker_chain_id,null,
'<none>',blocker_chain_id) blocker_chain
FROM v$wait_chains wc,
gv$session s,
gv$session bs,
gv$instance i,
gv$process p
WHERE wc.instance = i.instance_number (+)
AND (wc.instance = s.inst_id (+) and wc.sid = s.sid (+)
and wc.sess_serial# = s.serial# (+))
AND (s.final_blocking_instance = bs.inst_id (+) and s.final_blocking_session = bs.sid (+))
AND (bs.inst_id = p.inst_id (+) and bs.paddr = p.addr (+))
AND ( num_waiters > 0
OR ( blocker_osid IS NOT NULL
AND in_wait_secs > 10 ) )
ORDER BY chain_id,
num_waiters DESC)
WHERE ROWNUM < 101;

Provide AWR/Statspack snapshots of General database performance

Hangs are a visible effect of a number of potential causes, this can range from a single process issue to something brought on by a global problem.
Collecting information about the general performance of the database in the build up to, during and after the problem is of primary importance since these snapshots can help to determine the nature of the load on the database at these times and can provide vital diagnostic information. This may prove invaluable in identifying the area of the problem and ultimately resolving the issue.

To do this, please take and upload snapshot reports of database performance (AWR (or statspack) reports) immediately before, during and after the hang..

Please refer to the following article for details of what to collect:
Document 781198.1 Diagnostics for Database Performance Issues

Gather an up-to date RDA

An up to date current RDA provides a lot of additional information about the configuration of the database and performance metrics and can be examined to spot background issues that may impact performance.
See the following note on My Oracle Support:
Document 314422.1 Remote Diagnostic Agent (RDA) 4 - Getting Started

PROACTIVE METHODS TO GATHER INFORMATION ON A HANGING SYSTEM

On some systems a hang can occur when the DBA is not available to run diagnostics or at times it may be too late to collect the relevant diagnostics. In these cases, the following methods may be used to gather diagnostics:
As an alternative to the manual collection method notes above, it is also possible to use the HANGFG script as described in the following note to collect the information:
Document 362094.1 HANGFG User Guide
Additionally, this script can collect information with lower impact on the target database.

LTOM

The Lite Onboard Monitor (LTOM) is a java program designed as a real-time diagnostic platform for deployment to a customer site.LTOM proactively provides real-time automatic problem detection and data collection.
For more information see:
Document 352363.1 LTOM - The On-Board Monitor User Guide
Procwatcher
Procwatcher is a tool that examines and monitors Oracle database and/or clusterware processes at a specific interval
The following notes explain how to use Procwatcher:
Document 459694.1 Procwatcher: Script to Monitor and Examine Oracle DB and Clusterware Processes
Document 1352623.1 How To Troubleshoot Database Contention With Procwatcher
OS Watcher Black Box OSWatcher Black Box contains a built in analyzer that allows the data that has been collected to be automatically analyzed, pro-actively looking for cpu, memory, io and network issues. It is recommended that all users install and run OSWbb since it is invaluable for looking at issues on the OS and has very little overhead. It can also be extremely useful for looking at OS performance degradation that may be seen when a hang situation occurs.

Refer to the following for download, user guide and usage videos on OSWatcher Black Box:

Document 301137.1 OSWatcher Black Box User Guide .

ORACLE ENTERPRISE MANAGER 12C REAL-TIME ADDM

Real-Time ADDM is a feature of Oracle Enterprise Manager Cloud Control 12c that allows you to analyze database performance automatically when you cannot logon to the database because it is hung or performing very slowly due to a performance issue. It analyzes current performance when database is hanging or running slow and reports sources of severe contention.

Oracle Enterprise Manager 12c Real-Time ADDM

RETROACTIVE INFORMATION COLLECTION

Sometimes we may only notice a hang after it has occurred. In this case the following information may help with Root Cause Analysis:

A series of AWR/Statspack reports leading up to and during the hang
ASH reports - one can obtain more granular reports during the time of the hang - even up to
one minute in time.

Raw ASH information. This can be obtained by issuing an ashdump trac.

To See more:

Document 243132.1 10g and above Active Session History (Ash) And Analysis Of Ash Online And Offline
Document 555303.1 ashdump* scripts and post-load processing of MMNL traces
Alert log and any traces created at time of hang
On a RAC specifically check the following traces files as well: dia0, lmhb, diag and lmd0 traces
RDA as above

No comments:

Post a Comment