Search This Blog

Saturday, 26 September 2020

Oracle Data Guard Explained – Step By Step

Oracle Data Guard consists of many components. In this post I’ll explain about an environment that helped me understand how all the components work. This is a 12.1.0.2 (with APR 2018 PSU, so should be pretty stable)with a physical standby in a remote site. The primary has 4 redo groups of 10GB each (single member) and the network between the sites (primary to standby) is slow (only a few MB/s). Because the files are large and the network is slow, every step of the standby operation takes time. This had taught me quite a lot about how DG works step by step.


Data Guard Overview

Before we start diving, a quick data guard overview. The common scenario for physical standby is async real time apply and this is what I’ll explain here (“real time apply” = redo is applied as soon at copied to the standby, “async” = the primary doesn’t wait for a confirmation from the standby to perform a commit). With this configuration (and actually in any DG configuration for that matter), there are 2 separated steps:


Log Transport – move redo data from the primary to the standby

Redo Apply – apply the redo on the standby database

Log Transport

Under normal circumstances, a TT process (Redo Transport Slave) is sending the redo stream to the standby site. On the standby site an RFS process (Remote File Server) will get the data and write it to the standby logs.


When there is a lag for some reason, the standby will use FAL (Fetch Archive Log) information to connect to the primary to ask for the missing archives. The primary will allocate an ARC process to copy the archive to the standby (it will be received by an RFS process as well).


How can we see this?


standby: lsof -p 35260 |grep sea

primary: netstat -np|grep 28215 (based on the output of previous)

Query the standby database to find a specific RFS process


1

2

3

4

5

6

7

8

SQL> select process,pid,status from v$managed_standby where process='RFS';

 

PROCESS   PID                      STATUS

--------- ------------------------ ------------

RFS       38028                    RECEIVING

RFS       50159                    RECEIVING

RFS       32011                    RECEIVING

RFS       33444                    RECEIVING

The PID column is the OS process ID, so we’ll continue with shell on the standby (using PID 50159):


1

2

[oracle@standby:/tmp] $ lsof -p 50159 |grep TCP

oracle_50 50159 oracle   16u  IPv6 2552326962        0t0        TCP standby:ncube-lm->primary:10698 (ESTABLISHED)

Now we see the port this RFS connects to in the primary, so we’ll do the last step using shell on the primary (using port 10698):


1

2

[oracle@primary:~] $ netstat -np|grep 10698

tcp        0 1504800 1.1.1.1:10698         2.2.2.2:1521          ESTABLISHED 56242/ora_arc4_DB1

And this is how we see that ARC4 process on the primary is copying data to this specific RFS.


Redo Apply

This part happens on the standby database and is basically recovering the standby database from redo information. The process responsible for this is called MRP0 (Managed Standby Recovery Process). This process reads archive logs and standby redo logs and applies them to the database files.


You can see its progress in v$managed_standby:


1

2

3

4

5

6

SQL> select process,status,sequence#,block#

  2  from v$managed_standby where process='MRP0';

 

PROCESS   STATUS        SEQUENCE#     BLOCK#

--------- ------------ ---------- ----------

MRP0      WAIT_FOR_LOG      33883      23871

Log Transport Process

Now, that we know how things work, let’s see what really happens. In this environment the redo logs were 10GB each, which means that it takes time to archive them, even locally. What I saw is that when a redo is full, Oracle first archives the file locally and only then starts copying it to the standby database.


As I said before, the network was slow, so copying a 10GB file to the standby took a little bit more than 1 hour. In many cases, during this time the primary managed to create more archives. What I saw was that LGWR (or actually the TT process) took part of copying the files. It took the current redo log while other archived logs were copied by ARCH processes.


One last thing, as far as I understand, in theory, when ARCH is copying an archive, it copies it to the archive location on the standby side. Then, MRP can access the file and apply it. What I saw is that even when ARCH was copying a file, in some cases the data went into the standby redo logs (I followed this in v$standby_log where you see the sequence and used_bytes).


I discussed that with Pieter van Puymbroeck (PM of Oracle DG) and this is what he said:


“When RFS receives live redo, it attempts to write it to an SRL. When RFS receives gap redo, it will use an SRL ONLY if there is an SRL already containing that branch/thread/sequence. Otherwise, RFS will write to an archive log (AL) file.


When RFS writes to an SRL, that file is available to MRP because the file already exists. When RFS writes to an archive log file, that file is not available to the MRP until all of the redo has been received and RFS has closed and committed the file. If RFS dies in the middle of receiving redo into an archive log file, the file is deleted. But redo written to an SRL is not lost. It is in the file as long as the disk write completes.”


Redo Apply Process

Following the process in v$managed_standby, I saw that even though data that came from the ARCH process and RFS wrote the data to a standby log, the MRP0 couldn’t apply it in real time. Every file that was copied by ARCH had to be fully copied before MRP0 started applying it.


Pieter explained this:


“Once the MRP has access to the file (SRL or AL), it needs to know how much redo it can read. For an AL, it looks at the NAB field in the header block (block 1) of the AL. For an SRL, it uses the RTA boundary array or the NAB field in the header block of the SRL. The RTA boundary array will only contain the number of blocks that can be read if the SRL is for live redo. So if gap redo is being received into an SRL, the RTA boundary array will NOT contain the number of blocks that can be read and the NAB field in the header block will not be correct until the entire file has been received because it is only updated after all the redo has been received.”


One more thing that happened to me is that sometimes MRP0 just got stuck in the middle of a log. In this case it waited for the entire redo file to be copied and then it continued. When this happened I got a “RTA boundary” message in MRP0 trace file. I guess that this happens when the primary switches a log file and start writing to a new one, but I haven’t checked that. I also have an SR on this, I will update this post when I have more information about it.


Minimizing the Lag

If you read carefully and followed the logic, you probably realized that we had a constant lag issue. The fact that copying an entire redo log takes over 1 hour, while during this time we create more archives led to a really huge lag, few hours during peaks, and we wanted to reduce this lag.


The problem is that in peak time we created more redo than we could copy. This caused to a case when MRP0 can’t apply in real time, so it waited for an entire redo to be copied (which took more than an hour), and because it took so long to copy the file we had a large lag all the time.


Our solution was to reduce the size of the redo logs to 2GB. That way, even if we create a few GB in a short while, we can copy them in parallel, allowing shorter delay until MRP0 can start applying. We still have lag during this time, of course, but we minimize the lag because we allow MRP to start applying faster (after 2GB instead of 10GB) while we keep copying more archives in parallel.


Script to Calculate Transfer Rate

This is not a perfect script, but I use it to measure to transfer rate. It’s quite simple, so it has the following limitations:


Assumes that block size for redo is 512B

Works only if during the script execution the files being copied have not changed (so if Oracle starts copying a new archive or finished copying an old one, it will return an error message instead of the rate)

#!/bin/bash


t=30

if [ -n "$1" ]; then

        t=$1

fi

first=$(sqlplus -s / as sysdba << EOF

        set pages 0 lines 1000 echo off

        select sum(sequence#),sum(block#) from v\$managed_standby where process='RFS';

        exit

EOF

)

echo "sleeping for $t seconds"

sleep $t

second=$(sqlplus -s / as sysdba << EOF

        set pages 0 lines 1000 echo off

        select sum(sequence#),sum(block#) from v\$managed_standby where process='RFS';

        exit

EOF

)


first_seq=$(echo ${first} | awk '{print $1}')

first_sum=$(echo ${first} | awk '{print $2}')

second_seq=$(echo ${second} | awk '{print $1}')

second_sum=$(echo ${second} | awk '{print $2}')


if [ "${first_seq}" != "${second_seq}" ]; then

        echo "something has changed, try again"

else

        diff=$((second_sum – first_sum))

        echo "blocked copied in $t seconds is $diff"

        diff_mb=$(bc <<< "scale=2; ${diff}*0.5/${t}/1024")

        echo "${diff_mb} MB/s"

fi


How to change the DBNAME /DBID using DBNEWID(nid) utility for Oracle database

 Change Only the DBID using NID Utility

Here we will be only changing the DBID of the Oracle database.


(1) Backup Database


rman target /

backup database;

exit

(2) shutdown immediate


sqlplus / as sysdba


SQL> shutdown immediate;

Database closed.

Database dismounted.

ORACLE instance shut down.

SQL>exit

(3) startup mount


sqlplus / as sysdba


SQL> startup mount;

ORACLE instance started.

 

Total System Global Area 4764747646 bytes

Fixed Size                  2253584 bytes

Variable Size             457873600 bytes

Database Buffers          343860800 bytes

Redo Buffers                7471104 bytes

Database mounted.

SQL> exit

(4) Open one session and run NID utility(DBNEWID utility) with sysdba privileges  to change the DB ID.


nid TARGET=SYS/password@exptest


Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

 

Password:

Connected to database EXPTEST (DBID=587585)

 

Connected to server version 11.2.0

 

Control Files in database:

    +DATA/EXPTEST/control01.ctl

    +FLASH/EXPTEST/control02.ctl

   

 

Change database ID ? (Y/[N]) => Y

 

Proceeding with operation

Changing database ID from 587585 to 587585534

 


Database ID for database EXPTEST changed to 587585534.

All previous backups and archived redo logs for this database are unusable.

Database has been shutdown, open database with RESETLOGS option.

Successfully changed database  ID.

DBNEWID - Completed successfully.

(5) Startup of the database with open resetlogs


sqlplus / as sysdba


SQL> startup mount;

ORACLE instance started.

 

Total System Global Area 4764747646 bytes

Fixed Size                  2253584 bytes

Variable Size             457873600 bytes

Database Buffers          343860800 bytes

Redo Buffers                7471104 bytes

Database mounted.

SQL> alter database open resetlogs;



Change Only the DBNAME

Here we will be only changing the DBNAME of the Oracle database.


(1) Backup Database


rman target /

backup database;

exit

(2) shutdown immediate


sqlplus / as sysdba


SQL> shutdown immediate;

Database closed.

Database dismounted.

ORACLE instance shut down.

SQL>exit

(3) startup mount


sqlplus / as sysdba


SQL> startup mount;

ORACLE instance started.

 

Total System Global Area 4764747646 bytes

Fixed Size                  2253584 bytes

Variable Size             457873600 bytes

Database Buffers          343860800 bytes

Redo Buffers                7471104 bytes

Database mounted.

SQL> exit


(4) Open one session and run NID with sysdba privileges


nid TARGET=SYS/password@test_db DBNAME=exptest_db2 SETNAME=Y

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

 

Password:

Connected to database EXPTEST (DBID=587585)

 

Connected to server version 11.2.0

 

Control Files in database:

    +DATA/EXPTEST/control01.ctl

    +FLASH/EXPTEST/control02.ctl

   

 

Change database name ? (Y/[N]) => Y

 

Proceeding with operation

Database name changed to EXPTEST_DB2.

 

All previous backups and archived redo logs for this database are unusable.

Database has been shutdown, open database with RESETLOGS option.

Succesfully changed database name.

DBNEWID - Completed succesfully.

the value of DBNAME is the new db-name of the database

SETNAME must be set to Y. The default is N and causes the DBID to be changed also.

(5) Set the DB_NAME initialization parameter in the initialization parameter

file to the new database name


(6) Create password file in oracle using orapwd


(7) Startup of the database(with resetlogs)


sqlplus / as sysdba


SQL> startup mount;

ORACLE instance started.

 

Total System Global Area 4764747646 bytes

Fixed Size                  2253584 bytes

Variable Size             457873600 bytes

Database Buffers          343860800 bytes

Redo Buffers                7471104 bytes

Database mounted.

SQL> alter database open resetlogs;

Change Both DBID and DBNAME

Here we will be changing both the DBID and DBNAME of the Oracle database.


(1) Backup Database


rman target /

backup database;

exit

(2) shutdown immediate


sqlplus / as sysdba


SQL> shutdown immediate;

Database closed.

Database dismounted.

ORACLE instance shut down.

SQL>exit

(3) startup mount


sqlplus / as sysdba


SQL> startup mount;

ORACLE instance started.

 

Total System Global Area 4764747646 bytes

Fixed Size                  2253584 bytes

Variable Size             457873600 bytes

Database Buffers          343860800 bytes

Redo Buffers                7471104 bytes

Database mounted.

SQL> exit

(4) Open one session and run NID with sysdba privileges


nid TARGET=SYS/password@test_db DBNAME=exptest_db2


Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.


Password:

Connected to database EXPTEST (DBID=587585)


Connected to server version 11.2.0


Control Files in database:

+DATA/EXPTEST/control01.ctl

+FLASH/EXPTEST/control02.ctl



Change database name and ID ? (Y/[N]) => Y


Proceeding with operation

Database name changed to EXPTEST_DB2.

Modify parameter file and generate a new password file before restarting.

Database ID for database EXPTEST_DB2 changed to 587585534.

All previous backups and archived redo logs for this database are unusable.

Database has been shutdown, open database with RESETLOGS option.

Successfully changed database name and ID.

DBNEWID - Completed successfully.

the value of DBNAME is the new dbname of the database

(5) After DBNEWID successfully changes the DBID,shutdown immediate


(6) Set the DB_NAME initialization parameter in the initialization parameter file to the new database name.


(7) Create a new password file with orapwd


(8) Startup of the database with open resetlogs


sqlplus / as sysdba


SQL> startup mount;

ORACLE instance started.

 

Total System Global Area 4764747646 bytes

Fixed Size                  2253584 bytes

Variable Size             457873600 bytes

Database Buffers          343860800 bytes

Redo Buffers                7471104 bytes

Database mounted.

SQL> alter database open resetlogs;

I hope you like this content on NID Utility in oracle. I have provided the step by step implementation for each of the path.

Oracle ORAchk | Database Health Check with Orachk


I will explain how to perform Database Health Check with Oracle ORAchk in this article.

 

Oracle ORAchk

ORAchk is a tool similar to the RACchk and Exadata Exachk tools, which offer database and clusterware, patchware, patch, OS recommendations and best practices. The most up-to-date version of ORAchk is 19.2 

 

ORAchk ( Database Health Check )

You can use the ORAchk tool to perform a health check, as well as run the RAC Database Upgrade, and get Pre-Upgrade Problems and recommendations.

The ORAchk tool can be used in the following environments.

Oracle Real Application Clusters (RAC) databases

non-RAC databases

MAA configuration

Oracle Golden Gate configuration

EBS checks

Preinstall configuration

Pre-upgrade configuration

Solaris system configuration

 

This Tool is supported on the following systems.

Linux x86-64* (Enterprise Linux, RedHat and SuSE 9, SuSE 10 & SuSE 11)

Oracle Solaris SPARC (Solaris 10 and 11)

Oracle Solaris x86-64 (Solaris 10 and 11)

AIX **

HPUX**

 

Supported database versions of the ORAchk tool are as follows.

Oracle 10g R2

Oracle 11g

Oracle 12c

Oracle 18c

Oracle 19c

 

I will use for upgrade on Oracle RAC 11.2.0.3 on IBM-AIX and will learn the status of the database before upgrading and Oracle’s best practices. After downloading the ORAchk tool, upload it to server, use ORAchk tool with the following steps.

 

 

 

 

Unzip orachk.zip file like following.

[oracle@Mussaib ~]$ cd orachk/

[oracle@Mussaib ~]$ ls -all

total 11288

drwxr-xr-x 2 oracle oinstall 256 Jul 02 10:33 .

drwxr-xr-x 10 oracle oinstall 4096 Jul 02 10:33 ..

-rw-r--r-- 1 oracle oinstall 5770368 Jul 02 09:53 orachk.zip



[oracle@Mussaib ~]$ unzip orachk.zip

Archive: orachk.zip

inflating: UserGuide.txt

inflating: rules.dat

inflating: orachk

creating: .cgrep/

................

................

................

inflating: readme.txt

inflating: collections.dat

[oracle@Mussaib ~]$

 

 

Contents of ORAchk are like following.

 

[oracle@Mussaib ~]$ ls -all

 total 75752

 drwxr-xr-x 3 oracle oinstall 4096 Jul 02 10:34 .

 drwxr-xr-x 10 oracle oinstall 4096 Jul 02 10:33 ..

 drwxrwxr-x 3 oracle oinstall 4096 May 31 09:37 .cgrep

 -rw-r--r-- 1 oracle oinstall 3435193 May 31 09:37 CollectionManager_App.sql

 -rw-r--r-- 1 oracle oinstall 432 May 31 09:37 UserGuide.txt

 -rw-rw-r-- 1 oracle oinstall 22951324 May 31 09:37 collections.dat

 -rwxr-xr-x 1 oracle oinstall 1604239 May 31 09:37 orachk

 -rw-r--r-- 1 oracle oinstall 5770368 Jul 02 09:53 orachk.zip

 -rwxr-xr-x 1 oracle oinstall 1604239 May 31 09:37 raccheck

 -rw-r--r-- 1 oracle oinstall 3879 May 31 09:37 readme.txt

 -rw-rw-r-- 1 oracle oinstall 3384097 May 31 09:37 rules.dat

 

 

Database Health Check

You can run the ORAchk tool with the following parameters for different purposes.

 

-a  When execute with the -a parameter, it performs all the checks on the database / clusterware and operating system and makes recommendations.

-b  Only the best practices are given with the -b parameter.

-p  With the -p parameter, patch recommendations are given for the database and the clusterware.

-u  With the -u parameter, pre-upgrade checks are performed and recommendations are provided.

-v  Gives ORAchk version

 

 

 

 

Lets make an example to learn ORAchk, I will use ORAchk for pre upgrade like following.

 

[oracle@Mussaib ~]$ ./orachk -u -o pre

Enter upgrade target version (valid versions are 11.2.0.3.0, 11.2.0.4.0, 12.1.0.1.0, 12.1.0.2.0):- 11.2.0.4.0

CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to /oracle/product/11.2.0.3/grid?[y/n][y]y

Checking ssh user equivalency settings on all nodes in cluster

Node testserver02 is configured for ssh user equivalency for oracle user

Searching for running databases . . . . .

. .

List of running databases registered in OCR

1. TESTDB

2. None of above

Select databases from list for checking best practices. For multiple databases, select 1 for All or comma separated number like 1,2 etc [1-2][1].1

. .

Checking Status of Oracle Software Stack - Clusterware, ASM, RDBMS

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

-------------------------------------------------------------------------------------------------------

Oracle Stack Status

-------------------------------------------------------------------------------------------------------

Host Name CRS Installed ASM HOME RDBMS Installed CRS UP ASM UP RDBMS UP DB Instance Name

-------------------------------------------------------------------------------------------------------

testserver01Yes N/A Yes Yes Yes Yes TESTDB1

testserver02Yes N/A Yes Yes Yes Yes TESTDB2

-------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------------

Installed components summary

---------------------------------------------------------------------------------------------------------------------------------

GI_HOME ORACLE_HOME Database Names

---------------------------------------------------------------------------------------------------------------------------------

/oracle/product/11.2.0.3/grid - 11.2.0.3.0 /oracle/app/product/11.2.0.3/db - 11.2.0.3.0 TESTDB

---------------------------------------------------------------------------------------------------------------------------------

Copying plug-ins

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

Checking Grid Infrastructure and RDBMS upgrade readiness to version 112040


Collections and audit checks log file is

/oracle/orachk/orachk_testserver01_TESTDB_070214_104039/log/orachk.log


Checking for prompts in /home/users/oracle/.profile on testserver01 for oracle user...

Checking for prompts in /home/users/oracle/.profile on testserver02 for oracle user...


Starting to run orachk in background on testserver02

=============================================================

Node name - testserver01

=============================================================

Collecting - ASM Diskgroup Attributes

Collecting - ASM initialization parameters

Collecting - Database Parameters for TESTDB database

Collecting - Database Undocumented Parameters for TESTDB database

Collecting - Pre-upgrade File Locations for TESTDB database

Collecting - AIX Packages

Collecting - CPU Information

Collecting - Disk Information

Collecting - DiskFree Information

Collecting - DiskMount Information

Collecting - Memory Information

Collecting - Network Interface Configuration

Collecting - Network Performance

Collecting - Network Service Switch

Collecting - Network tunable parameters

Collecting - OS fixes or patches

Collecting - OS version

Collecting - Patches for Grid Infrastructure

Collecting - Patches for RDBMS Home

Collecting - System and Kernel Settings

Data collections completed. Checking best practices on testserver01.

--------------------------------------------------------------------------------------

INFO => Prior to Upgrade Verify NO Materialized Views Being Refreshed (Automatically or Manually) for TESTDB

FAIL => Opatch version is lower than recommended in RDBMS_HOME for /oracle/app/product/11.2.0.3/db

FAIL => Opatch version is lower than recommended in GRID_HOME

INFO => Special Considerations for SCAN and HAIP of 11.2.0.2 and above Grid Infrastructure

INFO => Minimizing Planned Downtime for Upgrades and Patching

INFO => Upgrade Related References

INFO => Lifetime Support Policy and Product Certification INFO

INFO => Capture performance baseline, backup important configuration files and batch, cron, DBMS_JOBS and DBMS_SCHEDULER jobs

INFO => Known Issues Integrating pre-11gR2 Databases with Grid Infrastructure

INFO => Useful Information for Operating Oracle RAC on AIX Power Architecture

INFO => Be Aware of New Instant Client for Simplified Deployment

INFO => Oracle E-Business Suite interoperability and migration resources

INFO => Oracle Software Download Sites

WARNING => Some Users Needing Network ACLs for Oracle Utility Packages Found for TESTDB

FAIL => Auditing tables (SYS.AUD$/sys.fga_log$) row count is large enough that there could potentially be a problem during the upgrade process for TESTDB

INFO => Steps to execute rootupgrade.sh

INFO => Save "Oracle Enterprise Manager Database Control" files and data with the emdwgrd Utility before upgrading database.

INFO => Information about ASM process parameter when its not set to default value

INFO => Information about Grid Infrastructure software installation directory

WARNING => Package gpfs.base-3.3.0.11-ppc_64 is recommended but NOT installed


Copying results from testserver02 and generating report. This might take a while. Be patient.

=============================================================

Node name - testserver02

=============================================================

Collecting - AIX Packages

Collecting - CPU Information

Collecting - Disk Information

Collecting - DiskFree Information

Collecting - DiskMount Information

Collecting - Memory Information

Collecting - Network Interface Configuration

Collecting - Network Performance

Collecting - Network Service Switch

Collecting - Network tunable parameters

Collecting - OS fixes or patches

Collecting - OS version

Collecting - Patches for Grid Infrastructure

Collecting - Patches for RDBMS Home

Collecting - System and Kernel Settings

Data collections completed. Checking best practices on testserver02.

--------------------------------------------------------------------------------------


FAIL => Opatch version is lower than recommended in RDBMS_HOME for /oracle/app/product/11.2.0.3/db

FAIL => Opatch version is lower than recommended in GRID_HOME

INFO => Information about ASM process parameter when its not set to default value

WARNING => Package gpfs.base-3.3.0.11-ppc_64 is recommended but NOT installed


---------------------------------------------------------------------------------

CLUSTERWIDE CHECKS

---------------------------------------------------------------------------------

---------------------------------------------------------------------------------

Detailed report (html) - /oracle/orachk/orachk_testserver01_TESTDB_070214_104039/orachk_testserver01_TESTDB_070214_104039.html

UPLOAD(if required) - /oracle/orachk/orachk_testserver01_TESTDB_070214_104039.zip

[oracle@Mussaib ~]$


ORAchk generated a Compressed report, download it and open,  it will be like following.

ORAchk report will give your database a System health Score and some recommendations.