Lock propagation in Postgres

Leo Sjöberg • August 9, 2024

We recently encountered the error Deadlock detected (SQLSTATE 40P01) in some of our code at work, and it took us ages to figure out what had gone wrong. I’m writing this to share the non-obvious ways in which postgres can lock a row for you, and how to avoid it.

At incident.io, we rely on our internal machine library, which is based on gocardless/statesman. We use this to track a bunch of different changes to entities, to ensure we have a historical record of changes made. In one of these processes, we managed to run into a deadlock, from running these seemingly innocuous queries in two processes (P1 meaning process 1, P2 meaning process 2):

1P1: SELECT * FROM escalations WHERE id = 'xyz' FOR UPDATE
2P2: UPDATE escalation_transitions SET most_recent = false WHERE escalation_id = 'xyz' and most_recent
3P1: UPDATE escalation_transitions SET most_recent = false WHERE escalation_id = 'xyz' and most_recent
4P1: LOCKWAIT (trying to run the same query as P2, which holds a lock)
5P2: INSERT INTO escalation_transitions (escalation_id, to_state, most_recent) VALUES ('xyz', 'new_state', true)
6P2: DEADLOCK [killed]

This really confused me when I first saw it. We’re getting a deadlock on an INSERT statement. A deadlock implies process 1 depends on process 2 and vice versa, but for process 2 to depend on process 1, it would have to depend on the escalation that P1 acquired a lock on, but it’s just an insert!

This confused me for a really long time. As it turns out, when you run an INSERT statement, Postgres will acquire a lock on all entities to which you have foreign key constraints.

That is to say, because we’re inserting escalation_id into our escalation_transitions table, we acquire a lock against escalations!

This was surprising to me, but what’s the lock? And importantly, why?

Understanding foreign key reference checks

When you insert a row that holds a foreign key, Postgres needs to verify that the key you’re inserting is valid. To do that, Postgres will acquire a lock FOR KEY SHARE, but what does this lock do?

A KEY SHARE lock is telling Postgres “you can update this row, but not any keys that may be referenced”. In Postgres, foreign keys must reference some unique column (primary key or otherwise), so when you take a FOR KEY SHARE lock, you allow other processes to update the row, as long as they don’t touch any unique columns.

That means another process can run an UPDATE query (but not DELETE), as long as it doesn’t modify row-unique keys. And this makes sense - in order to verify that the foreign key is valid, Postgres has to ensure that, by the end of the transaction, the key is still valid, so no other transaction is allowed to change it.

How our deadlock happened

In our case, this lock propagation caused a deadlock. Remember, the first thing that P1 does is acquire a lock by calling SELECT FOR UPDATE. This places a lock on the row in escalations, and allows the process that locked it to make any modifications, including changing the primary key. This means that no other process can take any lock, not even a key share lock, against the row.

So let’s go through the steps again, and see where we hit a deadlock:

  1. Process 1 executes SELECT * FROM escalations WHERE id = 'xyz' FOR UPDATE
    • That row in escalations now has a FOR UPDATE lock on it. Anyone who wants to do anything relating to the escalation will have to wait for the transaction to finish!
  2. Process 2 executes UPDATE escalation_transitions SET most_recent = false WHERE escalation_id = 'xyz' and most_recent
    • This acquires a FOR UPDATE lock on escalation_transitions against all rows matching the constraint within this transaction
  3. Process 1 starts executing UPDATE escalation_transitions SET most_recent = false WHERE escalation_id = 'xyz' and most_recent
    • This attempts to acquire the same FOR UPDATE lock that Process 2 does, but it’s already locked
  4. Process 1 enters a LOCKWAIT, waiting for Process 2 to release its lock on the rows in escalation_transitions
  5. Process 2 starts executing INSERT INTO escalation_transitions (escalation_id, to_state, most_recent) VALUES ('xyz', 'new_state', true)
    • This attempts to acquire a FOR KEY SHARE lock against the row in escalations that is referenced in the INSERT statement.
  6. 💥 DEADLOCK [killed] We’re now in a deadlock!
    • Process 1 is waiting for Process 2 to release its lock on escalation_transitions, while Process 2 is waiting for Process 1 to release its lock on escalations.

Fixing it

💡
It’s worth calling out that this had no impact on our product. It woke us up, but it was all happening in async processes, which got retried (and on the retry ended up not executing at the exact same time). It’s likely when you run into a deadlock, it’ll happen in async processes that start near simultaneously, rather than from user-triggered actions on HTTP endpoints.

In our case, we realised we wanted these two processes to never run concurrently, and so we simply modified Process 2 to start with the same SELECT FOR UPDATE as Process 1. That means any time this happens in the future, Process 2 will effectively wait (through LOCKWAIT) for all of Process 1 to finish before attempting to run any queries.

More generally, however, throwing FOR UPDATEs in all over is a bad idea, as it reduces your ability to concurrently handle work. Instead, if you need to lock against a resource, you should use a FOR NO KEY UPDATE lock. This will allow other processes to acquire a FOR KEY SHARE lock, but nothing else. It means you’re safe from other processes updating the row, but ensures that you can still insert new rows in other tables that reference the row you’re locking on.