Handling null bytes in JSON payloads with postgres

Leo Sjöberg • June 28, 2024

📕

Throughout this post, I'll refer to UTF-8 codepoints, null bytes, and escape sequences. Here's what they mean:

A null byte is a byte representing nothingness. If you're inserting a null byte in code, you'll likely write \u0000, or \x00; these are escape sequences.
In UTF-8, the null byte is represented by the codepoint 0. A codepoint is just a numeric value that designates a UTF-8 character.
An escape sequence represents UTF-8 codepoints in a more limited character set, such as ASCII. This means you can write invisible characters, like a null byte, as \u0000 in your IDE and have that compile to an actual null byte.

As The Chainsmokers' "Don't Let Me" (which is my pager sound) wakes me up at 01:24 AM, I read the alert for INC-5607: parsing JSON into payload: invalid character '|' in string escape code. I spend 25 minutes investigating with little progress until, after escalating to a coworker to confirm, we agree that it can wait until the morning. I share some notes from my initial observations in the incident channel in Slack, and head back to bed for a few hours of sleep.

The next morning, after another hour of digging, I share an incident update:

We're not sure why this is happening yet, but suspect it may be caused by recent changes to our handling of null bytes resulting in invalid JSON.

Having dug into some of the code, we're gonna investigate whether a change last week to strip null-bytes caused this problem.

It's possible we're either doing something wrong in how we strip null bytes (which we do because they can't be inserted into postgres), and end up stripping too much, or that it's somehow possible to end up with "load-bearing" null data (i.e as some weird json delimiter).

I didn't know it yet, but this was the beginning of a deep dive into JSON, UTF-8, and the Go encoding library.

Some background

Before we dig into what happened, let's talk about JSON, postgres, and null bytes.

Firstly, the JSON payload itself. Null bytes are not valid in JSON payloads according to the spec. However, any UTF-8 escape sequences are! This means you can represent a null byte as \u0000, the literal byte sequence ['\\', 'u', '0', '0', '0', '0'].

When a JSON payload is processed through a decoder, it will parse the escape sequence and convert it to the actual null byte.

So what about postgres? Postgres has two ways to store JSON data: json and jsonb. When you use a jsonb column, postgres will decompose the JSON and store it in a binary format in a TEXT field under the hood. This means postgres ends up parsing the JSON, and then attempting to store it. When you use a json column, postgres will store the JSON as a raw string (also in TEXT), and not attempt to parse it at insert-time, but rather at query-time. This means you can insert invalid JSON into a json column, but you'll only find out when you try to query it.

However, postgres has one little caveat: it doesn't support storing null bytes in the TEXT data type!

This means that if you try to insert a JSON payload with a null byte escape sequence into a jsonb column, postgres will parse the escape sequence as a null byte, and then fail to store it. When this happens, you'll get the error code Unsupported Unicode escape sequence (SQLSTATE 22P05). But that's not the error we were seeing; we were seeing invalid character '|' in string escape code, coming from our Go code 🤔

🤓

Why use jsonb instead of json?

jsonb makes querying JSON columns much more efficient when you use the JSON functions, i.e foo->>bar, since the data is already stored in a decomposed binary format. When using a json column, postgres has to parse the JSON at query-time.

It also means you'll end up getting the same error, but later, when using a json field.

Discovering the what

This was all coming from us ingesting a webhook, so I took the failing payload from production and attempted to reproduce with it in my local environment. Lo' and behold, I thought I had figured out what went wrong:

We've discovered the cause of this, and verified it is indeed caused by how we sanitise the input data.

As far as I can tell, Sentry did send us valid JSON. We, somehow, then parsed that JSON incorrectly, such that \\u0000 (an escaped null byte, which should be parsed as ['\', 'u', '0', '0', '0', '0']) got parsed as a backslash and a null byte (['\', '\u0000']).

When we then strip null-bytes, we can end up with invalid escape sequences, in this case trying to parse \|.

I'm now going to look into how we ended up parsing this as a null byte in the first place, as I don't think that should've happened.

Okay, so, I wasn't entirely correct here, but I had a local reproduction, and a suspicion. At this point, it's 4 PM the next day. Because of some meeting breaks, total time spent up until this point was around three hours of investigation, and still not finding a cause... and then the penny dropped

I realise the code is stripping the byte sequence for an escaped null byte

Three hours in, and I've realised that we strip the byte sequence \u0000, which works fine if your payload actually contains a null byte, but not if it's an escaped null byte. For the sequence \\u0000|, it means we strip out the middle bit, and end up with \|, which Go's JSON library then can't parse because it's invalid (and it'd fail inserting into postgres if our code ever got that far). But why did we strip out this byte sequence?

3 days earlier: INC-5579

A few days before my incident, we had another incident caused by null bytes. In that incident, we received actual null bytes (\u0000) in the JSON payload from Sentry, rather than escaped null bytes (\\u0000). So to fix that, we introduced a change to strip out that byte sequence, which worked until we received an escaped null byte!

That code looked like this:

1func SanitizeByteArray(byteArray []byte) []byte {
2    cleanBytes := bytes.ToValidUTF8(byteArray, []byte{})
3    // This array of bytes is the json-marshall escaped version of a null (\u0000, \x00)
4    cleanBytes = bytes.ReplaceAll(cleanBytes, []byte{92, 117, 48, 48, 48, 48}, []byte{})
5    // And also strip generic null character just in-case
6    cleanBytes = bytes.Trim(cleanBytes, "\x00")
7    return cleanBytes
8}

This takes care of any payload that just contains null bytes, like this:

1{"error": "something went \u0000wrong"}
2// -->
3{"error": "something went wrong"}

This allowed us to close out that incident, but it caused us to create invalid JSON if the input was escaped:

1{"error": "something went \\u0000wrong"}
2// -->
3{"error": "something went \wrong"}

Since we naively stripped bytes, we ended up creating invalid payloads from valid ones (in the example above, \w gets parsed as an escape sequence, which is invalid).

Why it's not enough to just strip the actual null byte

In the code above, you may have noticed the following line:

1cleanBytes = bytes.Trim(cleanBytes, "\x00")

If you're like me when I first looked at this, you may think "wait, isn't it enough to just have that code?"; after all, Go always operates in UTF8, so you may expect, as I initially did, that escape sequences would be parsed.

However, in this case, we're dealing with raw input data over-the-wire, and Go will simply parse this as a UTF-8 6-character sequence []byte{'\\', 'u', '0', '0', '0', '0'} instead of running any encoding.

Coming up with options

At this point, we knew what was happening, why the change was introduced, and I started doing some research into what options we'd have. I spent the rest of the day combining research with various experiments as I kept learning.

Option 1: decoding and re-encoding

My first attempt was to try and coerce the input data into something with real null bytes, so I could strip null bytes. Having tried a bunch of different encodings, I slowly came to accept I'd have to do what I really didn't want to do: run the entire input through a JSON decoder, strip null bytes after decoding, and then re-encode it to an output stream.

At this point, I'd spent the entire afternoon talking to myself in a Slack channel

Me, talking to myself over hours, stating the technically correct solution to be running everything through a JSON decoder and encoder

It was at this point I dug down the rabbit hole of encoding.

Pseudocoding what I want

I kept combining research with various experiments in Go as I kept learning, and started writing out what I wanted my code to do:

 1func SanitizeJsonBytes(jsonBytes []byte) []byte {
 2    dec := json.NewDecoder(bytes.NewReader(jsonBytes))
 3  enc := json.NewEncoder(bytes.NewBuffer([]byte{}))
 4 
 5    for {
 6        tok, err := dec.Token()
 7        if err == io.EOF {
 8            break
 9        }
10 
11    tok = bytes.Trim([]byte(tok.(string)), "\u0000")
12    enc.Write(tok)
13    }
14 
15    return enc.Out()
16}

This would go through the input token-by-token, strip null bytes, and write the token back through an encoder. Now I just had to figure out a way to actually do it. At this point, it was well past working hours, and I decided to pause until the next day...

The problems with `encoding/json`

As alluded to in my Slack screenshot, decoding and re-encoding a bunch of bytes is painful in encoding/json, but why? Through my hours of combing through the internet, I happened upon a GitHub discussion around introducing a new encoding/json/v2 library, initiated by the fine folks over at Tailscale.

While the aforementioned GitHub discussion goes into much greater depth, in my case it was a very simple limitation of encoding/json: while there is a Decoder.Token() method to read a JSON payload token-by-token, there's no Encoder.WriteToken() method to then write that back to an output payload. The only public method exposed on the Encoder lets you write a complete byte slice ([]byte) of JSON data.

There was an RFC to introduce WriteToken which was accepted back in 2020, but at the time of writing, it remains unimplemented.

💡

There are other issues with encoding/json, largely around its inability to handle streams of data which prevents memory-efficient processing of large JSON data sets, and I highly recommend reading the rationale for a v2 library by the Tailscale folks, they provide much better nuance than I ever could!

I kept going until way past working hours, and eventually decided to put it down, and continue the next day.

Option 2: Switch from `jsonb` to `json`

The next morning, reading up on how Postgres stores and processes JSON data, I announced

I think the easiest solution is to switch from jsonb to json

Then followed a quick discussion in Slack, where we realised that, while easy, it's really not ideal. Doing this, you'll end up storing the actual text instead of the decomposed JSON. However, you'll get the same errors at query-time if you do a JSON query, so you're just delaying the problem. Worse yet, you're blind to it until you actually run a query!

A discussion between me and Lawrence, one of our engineers, summarising our two options

Option 3: Store the data in blob storage

We had a quick discussion around whether it would make sense to store the payloads in blob storage. After all, just not storing the data in postgres would solve things! From a technical POV, this may seem like the best option; use the right tool for the right job. However, unless you're already using that tool widely across your stack, you're introducing risk and maintenance cost. How confident are you in your team debugging an error from a database everyone's using every day, vs misconfigured IAM permissions on an S3 or GCS bucket that was configured 9 months ago by someone who's no longer at your company? Unless you're already using the "right tool for the job" widely within your team, introducing it for a minor use case is likely more risk than reward.

Picking an option

On the balance of risk and reward, that last option was the most appealing to us. While it means we don't store a true representation of what was received, we'd never end up rendering it either way, and it allowed us to keep things within postgres, a tool that every engineer at incident.io is familiar with.

It was now late on Tuesday, my second day of this incident. I had learned a lot about JSON, Postgres, and encoding/json, and felt good about trying out Tailscale's experimental v2 library (which they themselves already use in production). But given the time, I decided to relax, and pick up in the morning.

Implementing a fix

Day 3 of this incident. The only words on my mind were escape sequence, json, encoding, UTF-8, and why is the internet so complicated?!

Despite at this point probably having invested somewhere between 10-12 hours just in research and experimentation, I started the day optimistic. I knew what I had to do and how to do it: use Tailscale's experimental JSON encoding library to run through a decoder, sanitise the data, and re-encode, just like my pseudocode.

After only about an hour of work and a bit of debugging, I announced victory

Me, sharing successful output in Slack

The code

The code for my solution ended up remarkably similar to the pseudocode I wrote two days prior, with a bit more error handling:

 1func SanitizeJSONBytes(jsonBytes []byte) ([]byte, error) {
 2    dec := jsontext.NewDecoder(bytes.NewReader(jsonBytes))
 3    buf := bytes.NewBuffer([]byte{})
 4    enc := jsontext.NewEncoder(buf)
 5 
 6    for {
 7        tok, err := dec.ReadToken()
 8        if err == io.EOF {
 9            break
10        }
11 
12        if err != nil {
13            return nil, err
14        }
15 
16        // Unfortunately, the `jsontext` library doesn't ship these Kinds
17        // as variables, but `"` represents a string (you can see this
18        // in the `jsontext` source code).
19        if tok.Kind() == jsontext.Kind('"') {
20            // Strip null bytes from the string
21            tokString := string(SanitizeByteArray([]byte(tok.String())))
22            tok = jsontext.String(tokString)
23        }
24 
25        err = enc.WriteToken(tok)
26        if err != nil {
27            return nil, err
28        }
29    }
30 
31    // Trim out newlines, which is added by the encoder
32    return bytes.Trim(buf.Bytes(), "\n"), nil
33}
34 
35func SanitizeByteArray(byteArray []byte) []byte {
36    cleanBytes := bytes.ToValidUTF8(byteArray, []byte{})
37 
38    // Strip generic null character
39    cleanBytes = bytes.ReplaceAll(cleanBytes, []byte{'\u0000'}, []byte{})
40 
41    return cleanBytes
42}

This will simply parse the JSON token-by-token, and for any string token, ensure the input bytes are valid UTF-8, and then strip out the actual null byte. Since we've decoded the JSON input at this point, we have actual null bytes (i.e the UTF-8 codepoint 0) in the token, rather than the 6-byte escape sequence, which makes stripping it out much easier.

You may also notice we've updated SanitizeByteArray to no longer strip the escape sequence, and have switched from bytes.Trim to bytes.ReplaceAll, as we previously accidentally only removed null bytes from the start and end of tokens (a lesson I learned the hard way by being paged at 4am).

Where we're at now

In implementing this, I ran into a couple bugs I had introduced and not caught in tests, all of which are now fixed. At this point in time, I feel extremely confident in how we process data coming into our systems from third parties, and hope that it will be many months (years?) before I have to think about sanitising null bytes again.

What I learned

This incident ran across 4 days in the end. While I wasn't working full-time on it, I spent a lot of time learning about working with JSON, Postgres, and Go's encoding libraries.

One of the biggest realisations might've been that I didn't understand JSON. I thought I understood JSON simply because I work with it everyday, but it couldn't have been further from the truth. I suspect this is the case for a lot of engineers building APIs!
Encodings are really complicated; every request is a pipeline, and every stage of that pipeline might make different assumptions about character encoding or format.
Go's encoding/json library leaves a lot to be desired when it comes to low-level processing of JSON streams. Tailscale's library does a really good job of filling these gaps.
Postgres have a really obvious flaw in that it can't store/query completely valid JSON, and this is accepted as a wontfix problem. Unfortunate, and something most users are unlikely to expect, and ultimately what led me down this rabbit hole in the first place!