Google

Berkeley DB Reference Guide:
Berkeley DB Replication

PrevRefNext

Transactional guarantees

It is important to consider replication in the context of the overall database environment's transactional guarantees. To briefly review, transactional guarantees in a non-replicated application are based on the writing of log file records to "stable storage", usually a disk drive. If the application or system then fails, the Berkeley DB logging information is reviewed during recovery, and the databases are updated so that all changes made as part of committed transactions appear, and all changes made as part of uncommitted transactions do not appear. In this case, no information will have been lost.

If a database environment does not require that the log be flushed to stable storage on transaction commit (using the DB_TXN_NOSYNC flag to increase performance at the cost of sacrificing transactional durability), Berkeley DB recovery will only be able to restore the system to the state of the last commit found on stable storage. In this case, information may have been lost (for example, the changes made by some committed transactions may not appear in the databases after recovery).

Further, if there is database or log file loss or corruption (for example, if a disk drive fails), then catastrophic recovery is necessary, and Berkeley DB recovery will only be able to restore the system to the state of the last archived log file. In this case, information may also have been lost.

Replicating the database environment extends this model, by adding a new component to "stable storage": the client's replicated information. If a database environment is replicated, there is no lost information in the case of database or log file loss, because the replicated system can be configured to contain a complete set of databases and log records up to the point of failure. A database environment that loses a disk drive can have the drive replaced, and it can rejoin the replication group as a client.

Because of this new component of stable storage, specifying DB_TXN_NOSYNC in a replicated environment no longer sacrifices durability, as long as one or more clients have acknowledged receipt of the messages sent by the master. Since network connections are often faster than local disk writes, replication becomes a way for applications to significantly improve their performance as well as their reliability.

The return status from the send interface specified to the DB_ENV->set_rep_transport method must be set by the application to ensure the transactional guarantees the application wants to provide. The effect of the send interface returning failure is to flush the local database environment's log as necessary to ensure that any information critical to database integrity is not lost. Because this flush is an expensive operation in terms of database performance, applications will want to avoid returning an error from the send interface, if at all possible:

First, there is no reason for the send interface to ever return failure unless the DB_REP_PERMANENT flag is specified. Messages without that flag do not make visible changes to databases, and therefore the application's send interface can return success to Berkeley DB for such messages as soon as the message has been sent or even just copied to local memory.

Further, unless the master's database environment has been configured to not synchronously flush the log on transaction commit, there is no reason for the send interface to ever return failure, as any information critical to database integrity has already been flushed to the local log before send was called. Again, the send interface should return success to Berkeley DB as soon as possible. However, in this case, in order to avoid potential loss of information after the master database environment fails, the master database environment should be recovered before holding an election, as only the master database environment is guaranteed to have the most up-to-date logs.

To sum up, the only reason for the send interface to return failure is when the master database environment has been configured to not synchronously flush the log on transaction commit, the DB_REP_PERMANENT flag is specified for the message, and the send interface was unable to determine that some number of clients have received the current message (and all messages preceding the current message). How many clients should receive the message before the send interface can return success is an application choice (and may not depend as much on a specific number of clients reporting success as one or more geographically distributed clients).

Of course, it is important to ensure that the replicated master and client environments are truly independent of each other. For example, it does not help matters that a client has acknowledged receipt of a message if both master and clients are on the same power supply, as the failure of the power supply will still potentially lose information.

Finally, the Berkeley DB replication implementation has one other additional feature to increase application reliability. Replication in Berkeley DB is implemented to perform database updates using a different code path than the standard ones. This means operations which manage to crash the replication master due to a software bug will not necessarily also crash replication clients.

PrevRefNext

Copyright Sleepycat Software