Previously on Jepsen, we introduced the problem of network partitions. Here, we demonstrate that a few transactions which “fail” during the start of a partition may have actually succeeded.

Postgresql is a terrific open-source relational database. It offers a variety of consistency guarantees, from read uncommitted to serializable. Because Postgres only accepts writes on a single primary node, we think of it as a CP system in the sense of the CAP theorem. If a partition occurs and you can’t talk to the server, the system is unavailable. Because transactions are ACID, we’re always consistent.


Well… almost. Even though the Postgres server is always consistent, the distributed system composed of the server and client together may not be consistent. It’s possible for the client and server to disagree about whether or not a transaction took place.


Postgres’ commit protocol, like most relational databases, is a special case of two-phase commit, or 2PC. In the first phase, the client votes to commit (or abort) the current transaction, and sends that message to the server. The server checks to see whether its consistency constraints allow the transaction to proceed, and if so, it votes to commit. It writes the transaction to storage and informs the client that the commit has taken place (or failed, as the case may be.) Now both the client and server agree on the outcome of the transaction.

What happens if the message acknowledging the commit is dropped before the client receives it? Then the client doesn’t know whether the commit succeeded or not! The 2PC protocol says that we must wait for the acknowledgement message to arrive in order to decide the outcome. If it doesn’t arrive, 2PC deadlocks. It’s not a partition-tolerant protocol. Waiting forever isn’t realistic for real systems, so at some point the client will time out and declare an error occurred. The commit protocol is now in an indeterminate state.

To demonstrate this, we’ll need an install of Postgres to work with.

salticid postgres.setup

This installs Postgres from apt, uploads some config files from jepsen/salticid/postgres, and creates a database for Jepsen. Then we’ll run a simple application which writes a single row for each number, inside a transaction.

cd salticid
lein run pg -n 100

If all goes well, you’ll see something like

85  :ok
91  :ok
90  :ok
95  :ok
96  :ok
Hit enter when ready to collect results.

Writes completed in 0.317 seconds

100 total
100 acknowledged
100 survivors
All 100 writes succeeded. :-D


Each line shows the number being written, followed by whether it was OK or not. In this example, all five nodes talk to a single postgres server on n1. Out of 100 writes, the clients reported that all 100 succeeded–and at the end of the test, all 100 numbers were present in the result set.

Now let’s cause a partition. Since this failure mode only arises when the connection drops after the server decides to acknowledge, but before the client receives it, there’s only a short window in which to begin the partition. We can widen that window by slowing down the network:

salticid jepsen.slow

Now, we start the test:

lein run pg

And while it’s running, cut off all postgres traffic to and from n1:

salticid jepsen.drop_pg

If we’re lucky, we’ll manage to catch one of those acknowledgement packets in flight, and the client will log an error like:

217 An I/O error occurred while sending to the backend. 
Failure to execute query with SQL:
INSERT INTO "set_app" ("element") VALUES (?)  ::  [219]
 Message: An I/O error occured while sending to the backend.
 SQLState: 08006
 Error Code: 0
218 An I/O error occured while sending to the backend.

After that, new transactions will just time out; the client will correctly log these as failures:

220 Connection attempt timed out.
222 Connection attempt timed out.

We can resolve the partition with salticid jepsen.heal, and wait for the test to complete.

1000 total
950 acknowledged
952 survivors
2 unacknowledged writes found! ヽ(´ー`)ノ
(215 218)
0.95 ack rate
0.0 loss rate
0.002105263 unacknowledged but successful rate


So out of 1000 attempted writes, 950 were successfully acknowledged, and all 950 of those writes were present in the result set. However, two writes (215 and 218) succeeded, even though they threw an exception claiming that a failure occurred! Note that this exception doesn’t guarantee that the write succeeded or failed: 217 also threw an I/O error while sending, but because the connection dropped before the client’s commit message arrived at the server, the transaction never took place.

There is no way to distinguish these cases from the client. A network partition–and indeed, most network errors–doesn’t mean a failure. It means the absence of information. Without a partition-tolerant commit protocol, like extended three-phase commit, we cannot assert the state of the system for these writes.

2PC strategies

Two-phase commit protocols aren’t just for relational databases. They crop up in all sorts of consensus problems. Mongodb’s documents essentially comprise an asynchronous network, and many users implement 2PC on top of their Mongo objects to obtain multi-key transactions.

If you’re working with two-phase commit, there are a few things you can do. One is to accept false negatives. In most relational databases, the probability of this failure occurring is low–and it can only affect writes which were in-flight at the time the partition began. It may be perfectly acceptable to return failures to clients even if there’s a small chance the transaction succeeded.

Alternatively, you can use consistency guarantees or other data structures to allow for idempotent operations. When you encounter a network error, just retry them blindly. A highly available queue with at-least-once delivery is a great place to put repeatable writes which need to be retried later.

Finally, within some databases you can obtain strong consistency by taking note of the current transaction ID, and writing that ID to the database during the transaction. When the partition is resolved, the client can either retry or cancel the transaction at a later time, by checking whether or not that transaction ID was written. Again, this relies on having some sort of storage suitable for the timescales of the partition: perhaps a local log on disk, or an at-least-once queue.

In the next post, we look at a very different kind of consistency model: Redis Sentinel.


bah… can’t see what you’ve put in your “heal” and few other scripts, only browsed rapidly through that unreadable clojure thing (sorry, never liked lisp-like languages). So I quite don’t understand what’s behind your “survivors” messages.

Additionnally, with PG and other R-DMBS, 2PC is not used for single-branch transactions, so I have serious doubts against the whole article. And the behaviour you describe seems normal and expected (I have already seen it with Oracle+JDBC).

FWIW with real 2PC you would have ended with stale transactions ready to commit, which would require manual intervention to definitely commit or rollback.

Colin Scott
Colin Scott on

For what it’s worth the pg app needs to be invoked from the top-level jepsen directory (the cd salticid step above is incorrect)

Bryan Taylor
Bryan Taylor on

This seems to be making a mountain out of a molehill. When a transaction is “in doubt”, after commit has been sent, but before the commit it’s been ack’d or failed, there isn’t a violation of consistency. The test is whether concurrent requests get the same answer. It takes time to get that answer, and in the case of a network partition that affects us, it could take a very long time. But if the answer eventually comes, we can verify that it was consistent by comparing with other requests at the same time. The fact that we don’t know yet, doesn’t mean that it will fail the test once we can perform it.

Michael South
Michael South on

The database may be consistent, but the system isn’t. A concurrent request to the db will get the answer “yes, the transaction has committed”, but the same request of the remote client gets “no, the transaction has not yet committed.” The system may eventually become consistent, if the partition is healed and the acknowledgement reaches the client. But it isn’t consistent until that point.

And the client can’t just wait indefinitely for acknowledgement–the commit request may not have reached the server, in which case the client would deadlock forever. Not to mention practical concerns (a customer and clerk aren’t going to wait very long for a credit card transaction to complete). Introducing timeouts then causes the temporary inconsistency to become permanent.

Peter Mogensen
Peter Mogensen on

The new PostgreSQL 9.4 bidirectional replication offers something they call “Global Sequences”.

They are supposed to be globally unique among several masters. They are not incremented in consensus, but different masters are allocated a subset of the sequence range in consensus. At least - so the documentation says:

“BDR uses a voting/election system to assign blocks from global sequences. The details are in the source code.”

misha on

aphyr, here is a new challenge, could you say some reasonable :) words about pg sync replication? 25.2.8. Synchronous Replication

Martin Grotzke

I’d also be interested to see what jepsen can say about postgresql multi master replication!

Shaposhnikov Alexander
Shaposhnikov Alexander on

Indeed, it is time to test postgres once again, now for real

Gavin on

Finally, within some databases you can obtain strong consistency by taking note of the current transaction ID

What is necessary for this to work? Is postgres one of the databases? I imagine XID wraparound might be problematic.

Janus on

Why is this post not listed on the Jepsen front page?

Aphyr on

Two-generals isn’t really a problem in Postgres, and that’s how people, in retrospect, have come to interpret Jepsen posts. Without the context of the original six-part series, I don’t think it makes sense.

Matthew Hayward

Late to the party here, I applaud this endeavor - but there are some issues with this article.

  1. Most relational DBMSs do not use two phase commit as their standard commit protocol. In fact, I doubt any do. Two phase commit protocols support distributed databases where multiple, distinct databases appear to external applications as a single database. Two phase commit is a way of coordinating a transaction among multiple distinct databases - not a way of committing data by an application to a single database. Two phase commit is usually invisible to the application.

When you commit it typical transaction to a typical relational database, two phase commit is not used.

  1. Your test application is not issuing two phase commits, which make use PostgreSQL’s PREPARE TRANSACTION statement.

  1. In any case, for either the standard commit protocol or the two phase commit protocol, what the protocol is trying to achieve is that the database is in a consistent state after the commit.

Whether or not the database can communicate that the commit was successful to the requesting app is not part of the commit protocol or RDBMS/ACID consistency model.

See for example:

Nandakumar M
Nandakumar M on

Postgres since version 10 supports txid_current() and txid_status(bigint) which could be used by the client to check the status of the transaction even in case of a network partition.

The following is an excerpt from Postgres documentation

txid_status(bigint) reports the commit status of a recent transaction. Applications may use it to determine whether a transaction committed or aborted when the application and database server become disconnected while a COMMIT is in progress. The status of a transaction will be reported as either in progress, committed, or aborted, provided that the transaction is recent enough that the system retains the commit status of that transaction.


Post a Comment

Comments are moderated. Links have nofollow. Seriously, spammers, give it a rest.

Please avoid writing anything here unless you're a computer. This is also a trap:

Supports Github-flavored Markdown, including [links](, *emphasis*, _underline_, `code`, and > blockquotes. Use ```clj on its own line to start an (e.g.) Clojure code block, and ``` to end the block.