Using GIN indices for efficient queries on jsonb and array columns

Leo Sjöberg • July 16, 2024

PostgreSQL offers a rich variety of data types for your database columns, from the standard TEXT, INT, and TIMESTAMPTZ to more complex types like JSON, JSONB, and arrays.

However, querying large datasets with JSON columns or arrays can be slow, often requiring a full table scan rather than leveraging an index.

While the most common index type is a btree (short for balanced tree) index, those aren’t well suited for multi-value types like JSON and arrays. A btree index stores the entire value in the column, making it useful only for complete or “full-document” comparisons.

For instance, consider a database of users where each user can track their nicknames using a string array column (text[]). With a btree index, you can only perform complete comparisons:

1select * from users where nicknames = '{"phroggyy"}'

This type of index is ineffective for queries like:

1select * from users where nicknames @> '{phroggyy}'

If you’re unfamiliar with array types, the above query searches for users where the nicknames contains the element phroggyy.

In the second query above, Postgres wouldn’t be able to use the index as we’re looking for a partial match (and remember, btree only stores whole values), instead falling back to a full table scan.

You can see this for yourself by creating a database with the relevant index:

1create table users (
2    id text primary key,
3    name text,
4    nicknames text[]
5);
6create index idx_user_nicknames on users using btree(nicknames);

After which we can run explain analyze on the two queries:

1explain analyze
2    select * from users where nicknames = '{"phroggyy"}';
3 
4explain analyze
5    select * from users where nicknames @> '{phroggyy}';

The first one gives us what we’d expect: we’re using the index:

1                                                        QUERY PLAN
2---------------------------------------------------------------------------------------------------------------------------
3 Index Scan using idx_user_nicknames on users  (cost=0.15..8.17 rows=1 width=96) (actual time=0.003..0.003 rows=0 loops=1)
4   Index Cond: (nicknames = '{phroggyy}'::text[])
5 Planning Time: 0.513 ms
6 Execution Time: 0.020 ms

However, for the second one, we get a far less satisfying query plan:

1                                           QUERY PLAN
2-------------------------------------------------------------------------------------------------
3 Seq Scan on users  (cost=0.00..24.62 rows=3 width=96) (actual time=0.004..0.004 rows=0 loops=1)
4   Filter: (nicknames @> '{phroggyy}'::text[])
5 Planning Time: 0.078 ms
6 Execution Time: 0.019 ms

That’s a Seq Scan, or sequential scan, meaning we’d be running a full table scan in order to execute the query.

What does this mean in real terms? Well if we load up our table with just 10 000 users, each with two nicknames: the user’s name as a slug, and that slug reversed (e.g leo and oel), and then run these two queries, we can see the impact of the index.

 1pip3 install faker "psycopg[binary]"
 2python3 - <<EOF
 3import psycopg
 4from faker import Faker
 5 
 6fake = Faker()
 7 
 8with psycopg.connect("dbname=sandbox user=postgres password=postgres") as conn:
 9    with conn.cursor() as cur:
10        for i in range(10 * 1000):
11            name = fake.name()
12            slug = name.lower().replace(" ", "-")
13            cur.execute("insert into users (id, name, nicknames) values (%s, %s, %s)", (i, name, [slug, slug[::-1]]))
14EOF

This code will generate 10 000 users.

If I now run the two queries above, we can see the query time difference. For my test, I’ll query for the user Alexandra Owens that got generated in my database:

 1sandbox=# \timing on
 2Timing is on.
 3sandbox=# select * from users where nicknames = '{alexandra-owens,snewo-ardnaxela}';
 4 id |      name       |             nicknames
 5----+-----------------+-----------------------------------
 6 1  | Alexandra Owens | {alexandra-owens,snewo-ardnaxela}
 7(1 row)
 8 
 9Time: 0.442 ms
10sandbox=# select * from users where nicknames @> '{alexandra-owens}';
11 id |      name       |             nicknames
12----+-----------------+-----------------------------------
13 1  | Alexandra Owens | {alexandra-owens,snewo-ardnaxela}
14(1 row)
15 
16Time: 4.306 ms

Even with just 10 000 rows in my database, the query time difference is nearly 10x (0.44ms vs 4.31ms). As the number of rows increases, this difference will also increase.

Indexing array and JSONB columns

Now, Postgres came up with a solution to this a long time ago: The GIN index (short for Generalized Inverted iNdex).

A GIN index takes all the individual values for arrays, or all keys and values for a jsonb column, and stores them separately in the index. So if our user looks like this:

1+----+------+-----------------+
2| id | name |    nicknames    |
3+----+------+-----------------+
4|  1 | Leo  | {'leo', 'oel'}  |
5+----+------+-----------------+

Then a GIN index on nicknames will look like this:

1+-------+--------+
2| value | row_id |
3+-------+--------+
4| leo   |      1 |
5| oel   |      1 |
6+-------+--------+

In other words, Postgres will break down an array or JSON object into its parts, and create an index that maps the individual values inside it to the row.

By using this index, we can query on array elements, or the values of specific keys in a JSON document. If we create a GIN index, and then plan our original query:

1create index idx_user_nicknames_gin on users using gin(nicknames);
2explain analyze
3    select * from users where nicknames @> '{phroggyy}';

We end up with a much better query plan!

1                                                           QUERY PLAN
2--------------------------------------------------------------------------------------------------------------------------------
3 Bitmap Heap Scan on users  (cost=12.83..16.85 rows=1 width=76) (actual time=0.041..0.042 rows=0 loops=1)
4   Recheck Cond: (nicknames @> '{phroggyy}'::text[])
5   ->  Bitmap Index Scan on idx_user_nicknames_gin  (cost=0.00..12.83 rows=1 width=0) (actual time=0.037..0.037 rows=0 loops=1)
6         Index Cond: (nicknames @> '{phroggyy}'::text[])
7 Planning Time: 0.698 ms
8 Execution Time: 0.092 ms

That’s now doing an index scan on the GIN index, instead of the full table scan from before.

And indeed, if we run the query, we can see it’s now significantly faster:

Before: 4.306 ms

After: 0.759ms

GIN indices are an extremely powerful feature in Postgres, and can make it sustainable to use Postgres for your product for practically all data. This means not having to consider a document database like MongoDB just to store some of your data as JSON (while the majority is relational). In real terms, this means less maintenance, and empowering your engineering team to master a single tool, over just getting by with a handful.