Monitoring Db2 HADR Takeover
In a DB2 HADR system, if the Primary database encounters a
fatal error or is brought down, then the Standby database has to takeover to
become the new Primary. Sometimes the process could take longer than most would
expect. This is because during takeover the old Standby must undo all the open
transactions before transforming into the new Primary. Recently we have
received a case that the user thought the takeover hanged while it was just
taking a long time to finish backward phase. Today I will talk about how to
monitor the process of takeover in such scenario and our new feature to allow faster
takeover.
The scenario
When takeover command is issued on an active Standby
database, the Standby will have to replay all the log records it has received,
and then bring up the database consistently. This means at the end of the logs,
if there are open transactions, these transactions must be undone as they have
not committed at the time of takeover. Normally, this should not take too long.
But in the case that there are many open transactions with large workload, it
would take significant amount of time to complete undo phase. This could cause
significant downtime for the HADR system and impact its high availability.
Monitoring workload
Transactions:
Address AppHandl [nod-index] TranHdl Locks State Tflag
0x00007F409D103280 14 [000-00014] 4 0 READ 0x00000000
0x00007F409D109380 0 [000-00000] 6 0 WRITE 0x00000000
Tflag2 Firstlsn Lastlsn Firstlso Lastlso
0x00000000 0x0000000000000000 0x0000000000000000 0 0
0x00000000 0x00000000000772A8 0x0000000000077437 61762921 61796038
LogSpace SpaceReserved TID AxRegCnt GXID
0 0 0x000000000100 1 0
0 0 0x00000000022E 0 0
As we can see, there is one write transaction that is going on. The transaction starts at LSN 0x0772A8 and is currently at 0x077437. This transaction is still relatively small and would not take too long to undo. However, if the difference between LastLsn and FirstLsn becomes too big, then it may take considerable time to undo the transaction when takeover happens. It’s recommended for customer to optimize the workload to avoid long-running transactions, so that it would take reasonable time for HADR standby to complete takeover in case of fatal failure on primary.
Monitoring takeover using db2pd
Database Member 0 -- Database TESTDB -- Standby -- Up 0 days 00:37:34 -- Date 2018-03-26-17.10.00.248172
Recovery:
Recovery Status 1000040F40080802
Current Log S0000036.LOG
Current LSN 0000000000040375
Current LRI 0000000000000001000000000000125B0000000000040375
Current LSO 45423240
Job Type ROLLFORWARD RECOVERY
Job ID 4
Job Start Time (1522098285) Mon Mar 26 17:04:45 2018
Job Description Database Rollforward Recovery
Invoker Type User
Total Phases 2
Current Phase 2
Progress:
Address PhaseNum Description StartTime
0x0000000203A5F708 1 Forward Mon Mar 26 17:04:45
0x0000000203A5F890 2 Backward Mon Mar 26 17:05:40
CompletedWork TotalWork
249 bytes 249 bytes
0 bytes 249 bytes
As we can see, the output shows the current LSN and LSO,
which are currently being replayed/compensated. There is also the “progress”
section at the bottom showing the progress of the recovery, including the
description, starting time, and total work done for each phase. If these values
change over time, then the recovery process must be progressing, and takeover
has not hanged.
Database accessibility during HADR takeover
Beginning in 11.1.2.2, the database can be configured to
allow connectivity during the backward phase of crash recovery or takeover. The
backward phase is divided into two parts. The first part is synchronous, for
undoing transactions with DDL, catalog changes, column-organized tables, etc.
This part still does not allow connections. The second part is asynchronous,
for undoing transactions that only touch regular objects. In this part the
database can be connectable.
By setting the registry variable DB2_ONLINERECOVERY=YES
before takeover, the takeover command will return at the start of asynchronous
backward phase, when the database become connectable. At this moment, the
tables, indexes, or objects NOT associated with uncommitted transactions will
be fully accessible.
Another registry variable DB2_ONLINERECOVERY_WITH_UR_ACCESS
will control if the tables, indexes, or objects associated with these
uncommitted transactions are accessible using UR isolation level. As these
uncommitted transactions are compensated, the locks on the associated objects
will be released and they will become fully accessible.
For details on this new feature, please see the Knowledge
Center article: https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.ha.doc/doc/c0070281.html
Good ...
ReplyDelete