Saturday, April 9, 2016

Memory Corruption Question & Answers

Memory Corruption Question & Answers

What is memory corruption?

Memory in Oracle is classified in the following manner:

- the duration of the instance [SGA],
- the process [PGA],
- the session [UGA],
- the duration of a call [CGA]

When any of the structures in the memory gets corrupted or altered, it become inaccessible, rendering the data unusable. The structures might be related a data block in the memory, a heap structure, etc. Such memory corruptions are well handled by the Oracle, safe guarding the underlying data.

What is a PGA corruption?

Program Global Area (PGA) is a memory region containing data and control information for a single process (server or background). One PGA is allocated for each server process; the PGA is exclusive to that server process and is read and written only by Oracle code acting on behalf of that process.

When any memory which is allocated for PGA gets corrupted, we call it a PGA corruption, where in access to that portion of memory (PGA) which is corrupt can lead to the termination of the session which is accessing it. Since the corruption is only to the PGA which is specific to a single process, it does not brings down the instance.

When we get such errors only on a specific session, we can try to close the session and invoke a new session and check if the error occurs, as closing the session will relinquish the memory allocated for that particular session.

What is a SGA corruption?

System Global Area is made up of the database buffer cache, shared pool and redo log buffer. Basically this is any data and program caches that are shared among database users.

Since this portion of memory is shared it is very critical that it is consistent. Incase there is a portion of this SGA which is corrupted, then when the process accesses this corrupted portion of memory, the instance will get terminated.

Many of the cases, bringing the instance up will solve the problem, unless it is a hardware problem, an Oracle bug or a third party application problem, which continues to get the SGA corrupted.

What is heap corruption?

Heap Corruption errors occur when addresses or pointers get corrupted in the memory, thereby rendering the corresponding heap inaccessible. When Oracle tries to read that part of the heap and encounters incorrect information errors will be signaled.

What are the symptoms of heap corruption?

Heap corruption can occur in any heap, SGA, PGA, UGA or CGA.

Generally corruption errors in the PGA and UGA might be triggered when that particular heap is accessed or when the session is terminated.

When corruption is in SGA, we can expect a possible termination of the instance.

Corruption errors occurs, in the form of:

- ORA-00600 [17XXX]
- Core dump in any function related to manipulation of memory structures.

Also the following memory leak errors can occur:

- ORA-00600 [711] : Freeing memory and stack discipline violated
- ORA-00600 [723] : PGA memory leak
- ORA-00600 [729] : UGA memory leak
- ORA-00600 [733] : Memory requested size to big for this port
- ORA-00600 [736] : Problems with number of elements in segmented loop macro

What are the causes of heap corruption?

The cause of heap corruption is difficult to identify. The reason being, that the process which is corrupting the memory is not the process which is reporting the corruption.

The various possible causes of heap corruption are as follows:

OS/Hardware:

There are chances for problems on the OS/Hardware which can corrupt memory pointers or even zero it out, and make it inaccessible. In such cases, its better to check the system logs to check for possible problems. Also performing OS/Hardware diagnostics might help.

Oracle Bug:

Oracle can also internally cause memory corruption if it is a bug with the Oracle code which manipulates the memory structures. For example, Dereferencing pointers, overwriting memory chunks, etc. In such cases, Oracle support will help you to identify if it might be a possible bug or not.

Other applications:

It is possible for other third party application to cause memory corruption, if such applications mistakenly overwrites memory chunks, or dereferences pointers.

However, the above reason will help only if the error is reproducible.

How are memory corruption errors reported?

Generally, memory corruption errors gets reported when the session is getting terminated. When the session is getting terminated, the heap allocated for that session will get dissolved and all the chunks will be freed. During this process if Oracle identifies any corrupted chunks, it will report in the alert log file. Also a trace file will be generated.

In case of corruption in the SGA, it might result in a possible termination of the instance. Before crashing the instance the error message will be reported in the alert log file and corresponding trace file will also be generated.

How can we identify which heap is corrupted?

At the beginning of the trace file, after the trace header dump, you will find such message. In this case the heap corruption error is ORA-00600 [17147].

********** Internal heap ERROR 17147 addr=ffffffff7b55dfa8 *********
***** Dump of memory around addr ffffffff7b55dfa8:

After this you will find a hex dump around the chunk which was corrupted. After the hex dump, you will find formatted heap dump which will contain the name of the heap dump which contains the corrupted chunk.

******************************************************
HEAP DUMP heap name="pga heap" desc=106528de0
extent sz=0x20c0 alt=184 het=32767 rec=0 flg=2 opc=3
parent=0 owner=0 nex=0 xsz=0xfff0

In our case the corrupted heap is "PGA HEAP". Similarly we can identify the corresponding heap which is corrupted.

What is a bad magic number?

Magic number in the chunk basically is used to make sure the chunk header is consistent. If you get an ORA-600 [17xxx] error that possibly indicates that the chunk is corrupted with a BAD MAGIC NUMBER.

In the formatted heap dump, you might find a similar pattern which indicates the corrupted chunk.

Chunk ffffffff7b55bf68 sz= 8256 BAD MAGIC NUMBER IN NEXT CHUNK (2059)

What is the impact of heap corruption?

In most of the cases heap corruption errors are short lived, except situation when they are Oracle bugs or OS/Hardware problems. In such cases the memory chunks might get corrupted due to some memory overwrites, however, oracle while flushing out the corrupted chunks and bringing in new chunks will automatically resolve such corruptions and Oracle will stop reporting of such corruptions in the alert log file.

Since this affects only the chunks in the memory, there is no data corruption due to this error. The data in the database is safe.

Is it possible to find the offending SQL if any?

The trace file should contain this information. If the error occurs while running particular SQL, then in the trace file, we can check out the section "Current SQL statement for this session". Below this section header, you should find the offending SQL if any. You can run this SQL to check if the error occurs.

Are there any initial diagnosis that can be done?

Possible things to check,

a. Operations that was performed when the error occured.
b. Is the error consistently occuring?
c. Any specific statements, if run generates this error.
d. Check if Third Party Applications involved.
e. Check the system logs to find if there were any OS/Hardware error messages.

Are there any possible workarounds that can be tried?

Incase of heap corruptions, we can try out the possible workaround:

a. If possible we can try to bounce the database. This will clear out all the allocated memory chunks and when the database comes up, it will allocate fresh chunks of memory. This can possibly resolve the memory corruption, unless the error reproduces after bouncing the database.

b. Flush the shared pool.

We can try to flush the shared pool, which will flush out all the chunks and will bring in fresh chunks.

SQL> alter system flush shared_pool;

This can possibly resolve the memory corruption, unless the error reproduces after flushing the shared pool.

No comments:

Post a Comment