Your relational database is good for it, really
Leo Sjöberg • July 26, 2024
I saw a tweet post on X the other week about using Postgres for everything, and it immediately made me go “YES, THAT”. So often, us engineers like to overcomplicate our setups. We get bored with the tools we have, and want to learn something new.
Up to a significant scale, you can use a single Postgres database for almost everything in your application.
— Shayan (@ImSh4yy) July 5, 2024
Queue system: Use SKIP LOCKED.
Document store: Use JSONB.
Full-text search: Use built-in tsvector and tsquery.
Vector search: Use pgvector extension.
Geospatial queries:…
We justify the use of some shiny new technology under the guise of performance, scalability, or the tool’s ability to solve some unique problem (we need Meilisearch for full-text search, and InfluxDB for time series, our relational database just isn’t built to handle it!). Now, we don’t lie - the tools we want to use are better at those specific tasks. However, for most organisations, we need to be more honest that the tool we have (which should be postgres) is already very capable.
While it’s exciting to use a new technology, leave that to internal hackathons or the weekends, or in rare cases, for when your organisation actually needs it.
All you need in postgres
Below is a collection of things you can do with postgres, which you may have considered other tools for. For each of them, I’ve included a very brief example of how you’d implement it, with some pseudocode.
Queues
You can easily implement a queue using SELECT FOR UPDATE SKIP LOCKED
. This will select one item off the queue and let you process it. Better yet, since it’s just a database, you don’t need to have a generic representation for queue jobs, you can process real things, like having a dedicated table for scheduled_messages
.
For example (in pseudocode):
1db.transaction(): 2 job = db.select('*') 3 .from('queue_jobs') 4 .clauses(clause.locking( 5 strength: 'update', 6 options: 'skip update', 7 )) 8 .where('processed_at', null) 9 .limit(1)10 .get()11 12 processJob(job)13 14 db.update('queue_jobs')15 .set('processed_at', now())16 .where('id', job.id)17 .run()18 19db.commit()
You can parallelise this with however many workers you need, as SKIP LOCKED
will select the next available row from the table.
Caching
You can cache expensive queries using Postgres’ materialised views. Materialised views get stored approximately like a regular table, but based on a query. It’s not instantly updated from the source data, but it’ll usually be handled within seconds, which should be good enough for any but the most demanding cache purposes.
Using a materialised view can be operationally simpler than creating a separate table on your own that you have to keep in sync with the source data, while still allowing you to avoid having to run a separate in-memory database like Redis or Memcached.
Mind you, caching using Postgres won’t be as quick as an in-memory database, so if you’re building for a use case where you need to handle tens of thousands of requests per second, you may well want to reach for an in-memory cache. But for most other purposes (up to hundreds of requests per second), Postgres will get you very far.
A materialised view is created just like a table, but derived from a query:
1CREATE MATERIALIZED VIEW users_with_contact_details AS2 SELECT id, name, email, user_contact_methods.phone_number as phone_number3 FROM users4 left join user_contact_methods on user_contact_methods.user_id = users.id;5 6CREATE UNIQUE INDEX unique_user_phone_number_idx on7 users_with_contact_details(id, phone_number);
Postgres will store this data like in any other table, allowing you to create efficient indices on derived data.
Full-text search
Postgres supports full-text search by letting you create vectors and trigrams from large text documents, and then lets you index it using a GIN
index (just like search databases do!). The official documentation for it is excellent, and its setup looks a little like this:
1-- Add a column2ALTER TABLE blog_posts ADD COLUMN content_search tsvector;3-- Populate our new column4UPDATE blog_posts SET content_search = to_tsvector('english', content)5-- Create an index to make the queries faster6CREATE INDEX blog_posts_content_search_idx ON blog_posts USING gin(content_search)
With this setup, you can then perform full-text searching with relevance ranking by combining the @@
operator, and the to_tsquery
and ts_rank
functions:
1SELECT id, title, content, ts_rank(content_search, to_tsquery('english', 'search terms')) as rank2FROM blog_posts3WHERE content_search @@ to_tsquery('english', 'search terms')4ORDER BY rank DESC;
For text searches that are not around for complete words, postgres also supports trigram operations and indices through the pg_trgm
extension. These allow you to make comparisons on matches that disregard word boundaries.
You can use the %
operator to get a boolean value for whether two terms are similar (based on a distance threshold set by pg_trgm.similarity_threshold
), or use <->
to get the distance between two terms, so a lower number means more similar terms.
So to sort some search results by relevance, you can run the following:
1SELECT id, title, content2FROM blog_posts3WHERE 'search query' % title -- where there is *some* similarity4order by 'search query' <-> title; -- ordered by distance
You can make this query more performant by adding a GiST trigram index:
1create index blog_posts_title_trgm_gist_idx on reviews2 using gist(title gist_trgm_ops(siglen=128));
Also note the siglen
argument here:
gist_trgm_ops GiST opclass approximates a set of trigrams as a bitmap signature. Its optional integer parameter siglen determines the signature length in bytes. The default length is 12 bytes. Valid values of signature length are between 1 and 2024 bytes. Longer signatures lead to a more precise search (scanning a smaller fraction of the index and fewer heap pages), at the cost of a larger index.
So Postgres will approximate the trigrams, and a higher siglen
value will result a more accurate representation of trigrams. How high your siglen
needs to be will depend on your content, but generally higher is better, but takes more space (and thus also results in slower writes).
With this, you can perform a fuzzy text search, all in postgres. There’s plenty of optimisations you can make once you start using it in a real situation too.
Geospatial queries
While not natively supported by Postgres, you can easily install the PostGIS extension on your Postgres server, which allows you to perform geospatial queries like calculating the distance between two coordinate points:
1SELECT ST_Distance(start, end) from locations;
Or checking if a point lies within a given arbitrary area (represented by a polygon):
1SELECT * FROM locations WHERE ST_Intersects(coordinates, 'POLYGON(...)');
Just make sure you index your column with a GiST index:
1CREATE INDEX point_idx on locations using GIST(coordinates);
PostGIS offers plenty more functionality, and the internet has lots of good guides for it. I don’t use it much myself - most applications rarely need to run geospatial queries at a scale where you need to do it in the database rather than just selecting all the data and using a library in your language of choice.
But what if we have to migrate later?
What if your traffic increases tenfold (after all, you do have that deal with Uber in the pipeline)? Or you have to build a feature next year that requires much more complicated query support to performantly retrieve the data you need?
Congratulations, you just found yourself (or an excited coworker!) a really exciting purely technical project to do in the future - the migration.
The truth is that migrating to a new database is a bit painful, but it’s not close to impossible. It requires some planning, and probably a few weeks of work, but if you can avoid the complexity of another database to maintain (and every engineer having to learn a new query language), that’s a worthwhile tradeoff. Especially when that day in the future might never come.