Using GIN indices for efficient queries on jsonb and array columns
Leo Sjöberg • July 16, 2024
PostgreSQL offers a rich variety of data types for your database columns, from the standard TEXT
, INT
, and TIMESTAMPTZ
to more complex types like JSON
, JSONB
, and arrays.
However, querying large datasets with JSON columns or arrays can be slow, often requiring a full table scan rather than leveraging an index.
While the most common index type is a btree
(short for balanced tree) index, those aren’t well suited for multi-value types like JSON and arrays. A btree
index stores the entire value in the column, making it useful only for complete or “full-document” comparisons.
For instance, consider a database of users where each user can track their nicknames
using a string array column (text[]
). With a btree
index, you can only perform complete comparisons:
1select * from users where nicknames = '{"phroggyy"}'
This type of index is ineffective for queries like:
1select * from users where nicknames @> '{phroggyy}'
If you’re unfamiliar with array types, the above query searches for users where the nicknames contains the element phroggyy
.
In the second query above, Postgres wouldn’t be able to use the index as we’re looking for a partial match (and remember, btree
only stores whole values), instead falling back to a full table scan.
You can see this for yourself by creating a database with the relevant index:
1create table users (2 id text primary key,3 name text,4 nicknames text[]5);6create index idx_user_nicknames on users using btree(nicknames);
After which we can run explain analyze
on the two queries:
1explain analyze2 select * from users where nicknames = '{"phroggyy"}';3 4explain analyze5 select * from users where nicknames @> '{phroggyy}';
The first one gives us what we’d expect: we’re using the index:
1 QUERY PLAN2---------------------------------------------------------------------------------------------------------------------------3 Index Scan using idx_user_nicknames on users (cost=0.15..8.17 rows=1 width=96) (actual time=0.003..0.003 rows=0 loops=1)4 Index Cond: (nicknames = '{phroggyy}'::text[])5 Planning Time: 0.513 ms6 Execution Time: 0.020 ms
However, for the second one, we get a far less satisfying query plan:
1 QUERY PLAN2-------------------------------------------------------------------------------------------------3 Seq Scan on users (cost=0.00..24.62 rows=3 width=96) (actual time=0.004..0.004 rows=0 loops=1)4 Filter: (nicknames @> '{phroggyy}'::text[])5 Planning Time: 0.078 ms6 Execution Time: 0.019 ms
That’s a Seq Scan
, or sequential scan, meaning we’d be running a full table scan in order to execute the query.
What does this mean in real terms? Well if we load up our table with just 10 000 users, each with two nicknames: the user’s name as a slug, and that slug reversed (e.g leo
and oel
), and then run these two queries, we can see the impact of the index.
1pip3 install faker "psycopg[binary]" 2python3 - <<EOF 3import psycopg 4from faker import Faker 5 6fake = Faker() 7 8with psycopg.connect("dbname=sandbox user=postgres password=postgres") as conn: 9 with conn.cursor() as cur:10 for i in range(10 * 1000):11 name = fake.name()12 slug = name.lower().replace(" ", "-")13 cur.execute("insert into users (id, name, nicknames) values (%s, %s, %s)", (i, name, [slug, slug[::-1]]))14EOF
This code will generate 10 000 users.
If I now run the two queries above, we can see the query time difference. For my test, I’ll query for the user Alexandra Owens
that got generated in my database:
1sandbox=# \timing on 2Timing is on. 3sandbox=# select * from users where nicknames = '{alexandra-owens,snewo-ardnaxela}'; 4 id | name | nicknames 5----+-----------------+----------------------------------- 6 1 | Alexandra Owens | {alexandra-owens,snewo-ardnaxela} 7(1 row) 8 9Time: 0.442 ms10sandbox=# select * from users where nicknames @> '{alexandra-owens}';11 id | name | nicknames12----+-----------------+-----------------------------------13 1 | Alexandra Owens | {alexandra-owens,snewo-ardnaxela}14(1 row)15 16Time: 4.306 ms
Even with just 10 000 rows in my database, the query time difference is nearly 10x (0.44ms vs 4.31ms). As the number of rows increases, this difference will also increase.
Indexing array and JSONB columns
Now, Postgres came up with a solution to this a long time ago: The GIN
index (short for Generalized Inverted iNdex).
A GIN
index takes all the individual values for arrays, or all keys and values for a jsonb
column, and stores them separately in the index. So if our user looks like this:
1+----+------+-----------------+2| id | name | nicknames |3+----+------+-----------------+4| 1 | Leo | {'leo', 'oel'} |5+----+------+-----------------+
Then a GIN
index on nicknames
will look like this:
1+-------+--------+2| value | row_id |3+-------+--------+4| leo | 1 |5| oel | 1 |6+-------+--------+
In other words, Postgres will break down an array or JSON object into its parts, and create an index that maps the individual values inside it to the row.
By using this index, we can query on array elements, or the values of specific keys in a JSON document. If we create a GIN index, and then plan our original query:
1create index idx_user_nicknames_gin on users using gin(nicknames);2explain analyze3 select * from users where nicknames @> '{phroggyy}';
We end up with a much better query plan!
1 QUERY PLAN2--------------------------------------------------------------------------------------------------------------------------------3 Bitmap Heap Scan on users (cost=12.83..16.85 rows=1 width=76) (actual time=0.041..0.042 rows=0 loops=1)4 Recheck Cond: (nicknames @> '{phroggyy}'::text[])5 -> Bitmap Index Scan on idx_user_nicknames_gin (cost=0.00..12.83 rows=1 width=0) (actual time=0.037..0.037 rows=0 loops=1)6 Index Cond: (nicknames @> '{phroggyy}'::text[])7 Planning Time: 0.698 ms8 Execution Time: 0.092 ms
That’s now doing an index scan on the GIN index, instead of the full table scan from before.
And indeed, if we run the query, we can see it’s now significantly faster:
Before: 4.306 ms
After: 0.759ms
GIN indices are an extremely powerful feature in Postgres, and can make it sustainable to use Postgres for your product for practically all data. This means not having to consider a document database like MongoDB just to store some of your data as JSON (while the majority is relational). In real terms, this means less maintenance, and empowering your engineering team to master a single tool, over just getting by with a handful.