Friday 26 June 2015

Automatic Failover :

Data Guard has implemented something called Fast-start Failover (FSFO) which uses the Broker to perform the failover actions when there is a problem. This architecture uses a third member quorum that ensures that the failover occurs only when everything meets the rules that you have defined, when the failover has happened the primary will never be allowed to open to avoid any split-brain scenarios, this would be a bit of a nightmare should both databases be open and processing transactions.

The third member is called the Observer and its job is to maintain a connection with the primary and target standby databases, monitoring there health and performing any failover's necessary, the Observer will also reinstate the old primary when it comes back on, if it can. The observer pings the primary database and that the first sign of trouble it will start to countdown (which you configure), if it does re-establish the connection it will make all the necessary checks before goes back to watch mode again, if the timer expires then it checks that the standby can take over and initiates a failover, this switchover will have all automatically and in the background using the Broker. If and when the primary comes back the Observer will reinstate the old primary as a standby database again using the Broker to achieve this.

It is import on where he Observer is placed in the network, only one observer per Data Guard installation can be installed, so this server must have access to both the primary and standby databases with as much redundant networking as possible. Next your thing is how much is the observer going to cost me, not much as it can run on most platforms and only required the Oracle Client Kit for the version of Oracle that you are running, you must setup the TNSNAMES on the observer to allow it to ping the databases. If the Observer was to crash it will have no impact on the current Oracle environment, the only impact is that FSFO will not be available until the Observer is up and running again. The Observer can monitoring the following

    Database crash
    System crash
    The loss of the network
    Complete site outage

You can also get FSFO to perform a shutdown abort on the primary when other issues arise such as

    Datafile Offline
    Corrupted Controlfile
    Corrupted Dictionary
    Inaccessible Logfile
    Stuck Archiver

The tags above must be entered as they are above otherwise the Broker will not understand them
Monitor a specific condition via the Broker     DGMGRL> enable fast_start failover condition "Corrupted Controlfile";
DGMGRL> enable fast_start failover condition "Datafile Offline";

To display what you are monitoring use
Display conditions that are be monitored     DGMGRL> show fast_start failover;

Fast-Start Failover: DISABLED

Threshold: 30 seconds
Target: (none)
Observer: (none)
Lag Limit: 30 seconds
Shutdown Primary: TRUE
Auto-reinstate: TRUE

Configurable Failover Conditions
Health Conditions:
Corrupted Controlfile YES
Corrupted Dictionary YES
Inaccessible Logfile NO
Stuck Archiver NO
Datafile Offline YES

Oracle Error Conditions:
(none)

Now that you have an overview of FSFO it's time to set it up and test it, just a quick check before we progress, make sure that the following has been setup or configured

    Use the Broker with all its prerequisites
    Enable flashback database on both primary and standby
    Setup the configuration correctly for the protection mode (standby redo logs files on both sides, redo transport setup the same in both directions)
    Install the Observer system and configure TNSNAME

If you are using more than one standby you must let the Broker know which one you want to become the primary, if you only have one then the broker will know already
Select the standby to become the primary     DGMGRL> edit database prod1 set property FastStartFailoverTarget = 'prod1dr';
DGMGRL> edit database prod1dr set property FastStartFailoverTarget = 'prod1';

Now its time to discuss how long you should wait before you want to failover, you don't want it too short just in case you network blips, by default it is set to 30 seconds but you can go down to 6 seconds if you wish.
change threshold     DGMGRL> edit configuration set property FastStartFailoverTargetThreshold = 45;

You can control the amount of data loss, if using one of the lesser protection modes, the greater the lag limit set the greater the data loss, again the time is in seconds.
lag limit     DGMGRL> edit configuration set property FastStartFailoverLagLimit = 60;

If the data loss is less then the limit the failover will proceed, if more redo would be loss than the lag limit, the failover will not occur and nothing happens until the primary database either comes back and processing continues or you choose to failover manually, suffering the additional data loss. If you are using maximum protection mode then this property is ignored.

Here are two more additional properties that you can setup regarding the primary, one is to shutdown it down if it becomes hung and the other is to reinstate it if a failover does occur
abort primary if in a hung state     DGMGRL>edit configuration set property FastStartFailoverPmyShutdown = true;
reinstate primary after a failover     DGMGRL>edit configuration set property FastStartFailoverAutoReinstate = true;

Once you are happy with everything you can now enable the FSFO
Enable FSFO     DGMGRL> enable fast_start failover;

## Display the configuration

DGMGRL> show fast_start failover;

Once all setup you can test the FSFO by performing a shutdown abort on the primary, and checking that the failover occurs and that they primary is reinstated and with the amount of data loss expected if using the lesser protection modes. If you are using a test environment this is the time to experiment and play around with different settings. Again keep an eye on the log files including the Broker log file to see how Oracle handles the failover's and to become familiar with them.


No comments:

Post a Comment