Berkeley DB Reference Guide:
Distributed Transactions

Building a Global Transaction Manager

Managing distributed transactions and using the two-phase commit protocol of Berkeley DB from an application requires the application provide the functionality of a global transaction manager (GTM). The GTM is responsible for the following:

Communicating with the multiple environments (potentially on separate systems).
Managing the global transaction ID name space.
Maintaining state information about each distributed transaction.
Recovering from failures of individual environments.
Recovering the global transaction state after failure of the global transaction manager.

Communicating with multiple Berkeley DB environments

Two-phase commit is required if an application wants to transaction protect Berkeley DB calls across multiple environments. If the environments reside on the same machine, the application can communicate with each environment through its own address space with no additional complexity. If the environments reside on separate machines, the application can either use the Berkeley DB RPC server to manage the remote environments or it may use its own messaging capability, translating messages on the remote machine into calls into the Berkeley DB library (including the recovery calls). For some applications, it might be sufficient to use Tcl's remote invocation to remote copies of the tclsh utility into which the Berkeley DB library has been dynamically loaded.

Managing the Global Transaction ID (GID) name space

A global transaction is a transaction that spans multiple environments. Each global transaction must have a unique transaction ID. This unique ID is the global transaction ID (GID). In Berkeley DB, global transaction IDs must be represented with the confines of a DB_XIDDATASIZE size (currently 128 bytes) array. It is the responsibility of the global transaction manager to assign GIDs, guarantee their uniqueness, and manage the mapping of local transactions to GID. That is, for each GID, the GTM should know which local transactions managers participated. The Berkeley DB logging system or a Berkeley DB table could be used to record this information.

Maintaining state for each distributed transaction.

In addition to knowing which local environments participate in each global transaction, the GTM must also know the state of each active global transaction. As soon as a transaction becomes distributed (that is, a second environment participates), the GTM must record the existence of the global transaction and all participants (whether this must reside on stable storage or not depends on the exact configuration of the system). As new environments participate, the GTM must keep this information up to date.

When the GTM is ready to begin commit processing, it should issue DB_TXN->prepare calls to each participating environment, indicating the GID of the global transaction. Once all the participants have successfully prepared, then the GTM must record that the global transaction will be committed. This record should go to stable storage. Once written to stable storage, the GTM can send DB_TXN->commit requests to each participating environment. Once all environments have successfully completed the commit, the GTM can either record the successful commit or can somehow "forget" the global transaction.

If nested transactions are used (that is, the parent parameter is specified to DB_ENV->txn_begin), no DB_TXN->prepare call should be made on behalf of any child transaction. Only the ultimate parent should even issue a DB_TXN->prepare.

Should any participant fail to prepare, then the GTM must abort the global transaction. The fact that the transaction is going to be aborted should be written to stable storage. Once written, the GTM can then issue DB_TXN->abort requests to each environment. When all aborts have returned successfully, the GTM can either record the successful abort or "forget" the global transaction.

In summary, for each transaction, the GTM must maintain the following:

A list of participating environments
The current state of each transaction (pre-prepare, preparing, committing, aborting, done)

Recovering from the failure of a single environment

If a single environment fails, there is no need to bring down or recover other environments (the only exception to this is if all environments are managed in the same application address space and there is a risk that the failure of the environment corrupted other environments). Instead, once the failing environment comes back up, it should be recovered (that is, conventional recovery, via db_recover or by specifying the DB_RECOVER flag to DB_ENV->open should be run). If the db_recover utility is used, then the -e option must be specified. In this case, the application will almost certainly want to specify environmental parameters via a DB_CONFIG file in the environment's home directory, so that db_recover can create an appropriately configured environment. If the db_recover utility is not used, then DB_PRIVATE should not be specified, unless all processing including recovery, calls to DB_ENV->txn_recover, and calls to finish prepared, but not yet complete transactions take place using the same database environment handle. The GTM should then issue a DB_ENV->txn_recover call to the environment. This call will return a list of prepared, but not yet committed or aborted transactions. For each transaction, the GTM should look up the GID in its local store to determine if the transaction should commit or abort. If the GTM is running in a system with multiple GTMs, it is possible that some of the transactions returned via DB_ENV->txn_recover do not belong to the current environment. The GTM should detect this and call DB_TXN->discard on each such transaction handle. Furthermore, it is important to note that the environment does not retain information about which GTM has issued DB_ENV->txn_recover operations. Therefore, each GTM should issue all its DB_ENV->txn_recover calls, before another GTM issues its calls. If the calls are interleaved, each GTM may not get a complete and consistent set of transactions. The simplest way to enforce this is for each GTM to make sure it can receive all its outstanding transactions in a single DB_ENV->txn_recover call. The maximum number of possible outstanding transactions is bounded by the maximum number of active transactions in the environment. This number can be obtained by using the DB_ENV->txn_stat interface or the db_stat utility.

The newly recovered environment will forbid any new transactions from being started until the prepared but not yet committed/aborted transactions have been resolved. In the multiple GTM case, this means that all GTMs must recover before any GTM can begin issuing new transactions.

Because Berkeley DB flushed both commit and abort records to disk for two-phase transaction, once the global transaction has either committed or aborted, no action will be necessary in any environment. If local environments are running with the DB_TXN_WRITE_NOSYNC or DB_TXN_NOSYNC options (that is, is not writing and/or flushing the log synchronously at commit time), then it is possible that a commit or abort operation may not have been written in the environment. In this case, the GTM must always have a record of completed transactions to determine if prepared transactions should be committed or aborted.

Recovering from GTM failure

If the GTM fails, it must first recover its local state. Assuming the GTM uses Berkeley DB tables to maintain state, it should run db_recover (or the DB_RECOVER option to DB_ENV->open) upon startup. Once the GTM is back up and running, it needs to review all its outstanding global transactions, that is all transaction which are recorded, but not yet committed or aborted.

Any global transactions which have not yet reached the prepare phase should be aborted. If these transactions were on remote systems, the remote systems should eventually time them out and abort them. If these transactions are on the local system, we assume they crashed and were aborted as part of GTM startup.

The GTM must then identify all environments which need to have their DB_ENV->txn_recover interface called. This includes all environments that participate in any transaction that is in the preparing, aborting, or committing state. For each environment, the GTM should issue a DB_ENV->txn_recover call. Once each environment has responded, the GTM can determine the fate of each transaction. The correct behavior is defined depending on the state of the global transaction according to the table below.

preparing: if all participating environments return the transaction in the prepared but not yet committed/aborted state, then the GTM should commit the transaction. If any participating environment fails to return it, then the GTM should issue an abort to all environments that did return it.
committing: the GTM should send a commit to any environment that returned this transaction in its list of prepared but not yet committed/aborted transactions.
aborting: the GTM should send an abort to any environment that returned this transaction in its list of prepared but not yet committed/aborted transactions.