Monitoring Db2 HADR Takeover


In a DB2 HADR system, if the Primary database encounters a fatal error or is brought down, then the Standby database has to takeover to become the new Primary. Sometimes the process could take longer than most would expect. This is because during takeover the old Standby must undo all the open transactions before transforming into the new Primary. Recently we have received a case that the user thought the takeover hanged while it was just taking a long time to finish backward phase. Today I will talk about how to monitor the process of takeover in such scenario and our new feature to allow faster takeover.

The scenario

When takeover command is issued on an active Standby database, the Standby will have to replay all the log records it has received, and then bring up the database consistently. This means at the end of the logs, if there are open transactions, these transactions must be undone as they have not committed at the time of takeover. Normally, this should not take too long. But in the case that there are many open transactions with large workload, it would take significant amount of time to complete undo phase. This could cause significant downtime for the HADR system and impact its high availability.

Monitoring workload

To prevent the scenario from happening during failover, the user should be aware of the open transactions and avoid long running transactions in the workload. It’s possible to monitor the transactions using db2pd tool. The “-transaction” option outputs the transactions in progress. The following is an example output of “db2pd -transaction” taken on a HADR standby database:

Transactions:

Address             AppHandl  [nod-index]  TranHdl  Locks  State  Tflag
0x00007F409D103280  14        [000-00014]  4        0      READ   0x00000000
0x00007F409D109380  0         [000-00000]  6        0      WRITE  0x00000000

Tflag2      Firstlsn            Lastlsn             Firstlso  Lastlso
0x00000000  0x0000000000000000  0x0000000000000000  0         0
0x00000000  0x00000000000772A8  0x0000000000077437  61762921  61796038

LogSpace  SpaceReserved  TID             AxRegCnt  GXID
0         0              0x000000000100  1         0
0         0              0x00000000022E  0         0


As we can see, there is one write transaction that is going on. The transaction starts at LSN 0x0772A8 and is currently at 0x077437. This transaction is still relatively small and would not take too long to undo. However, if the difference between LastLsn and FirstLsn becomes too big, then it may take considerable time to undo the transaction when takeover happens. It’s recommended for customer to optimize the workload to avoid long-running transactions, so that it would take reasonable time for HADR standby to complete takeover in case of fatal failure on primary.

Monitoring takeover using db2pd

It’s possible to monitor the takeover progress using db2pd tool. The “-recovery” option gives an overview of the current status of recovery, which is part of the takeover process. The following is an example output of “db2pd -recovery” taken during backward phase of a takeover:

Database Member 0 -- Database TESTDB -- Standby -- Up 0 days 00:37:34 -- Date 2018-03-26-17.10.00.248172

Recovery:
Recovery Status        1000040F40080802
Current Log            S0000036.LOG
Current LSN            0000000000040375
Current LRI            0000000000000001000000000000125B0000000000040375
Current LSO            45423240
Job Type               ROLLFORWARD RECOVERY
Job ID                 4
Job Start Time         (1522098285) Mon Mar 26 17:04:45 2018
Job Description        Database Rollforward Recovery
Invoker Type           User
Total Phases           2
Current Phase          2

Progress:
Address             PhaseNum  Description  StartTime
0x0000000203A5F708  1         Forward      Mon Mar 26 17:04:45
0x0000000203A5F890  2         Backward     Mon Mar 26 17:05:40

CompletedWork  TotalWork
249 bytes      249 bytes
0 bytes        249 bytes




As we can see, the output shows the current LSN and LSO, which are currently being replayed/compensated. There is also the “progress” section at the bottom showing the progress of the recovery, including the description, starting time, and total work done for each phase. If these values change over time, then the recovery process must be progressing, and takeover has not hanged.

Database accessibility during HADR takeover

Beginning in 11.1.2.2, the database can be configured to allow connectivity during the backward phase of crash recovery or takeover. The backward phase is divided into two parts. The first part is synchronous, for undoing transactions with DDL, catalog changes, column-organized tables, etc. This part still does not allow connections. The second part is asynchronous, for undoing transactions that only touch regular objects. In this part the database can be connectable. 
By setting the registry variable DB2_ONLINERECOVERY=YES before takeover, the takeover command will return at the start of asynchronous backward phase, when the database become connectable. At this moment, the tables, indexes, or objects NOT associated with uncommitted transactions will be fully accessible.
Another registry variable DB2_ONLINERECOVERY_WITH_UR_ACCESS will control if the tables, indexes, or objects associated with these uncommitted transactions are accessible using UR isolation level. As these uncommitted transactions are compensated, the locks on the associated objects will be released and they will become fully accessible.
For details on this new feature, please see the Knowledge Center article: https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.ha.doc/doc/c0070281.html

Comments

Post a Comment