Shrinking Redis cache with msgp and zstd in Golang.

If you are storing structured data in Redis using encoding/json, you might be surprised how much memory you are wasting. JSON is readable, easy to debug, and universally supported. It's also bloated. Field names repeat on every single record, numbers get stored as strings, and boolean values take 4-5 bytes instead of 1.

In my previous article High-performance Golang struct optimizations: Paddings and Alignments, I showed how reordering struct fields can save 25% of RAM. This time, we are going after the other side of the equation: how the data is serialized before it hits Redis.

I will compare four approaches: plain JSON, MessagePack via tinylib/msgp, JSON compressed with zstd via klauspost/compress, and msgp + zstd combined. We care mostly about stored size here. CPU cost matters too, but for cache serialization, the bottleneck is almost always memory. The results weren't what I expected.

The struct.

I will reuse the struct from my previous article, with msg tags added for msgp code generation:

//go:generate go tool msgp

type NestedLayout struct {
    ID    int64 `msg:"id"`
    Phone int64 `msg:"phone"`
    Age   int32 `msg:"age"`
}

type Layout struct {
    BalanceInCents int64        `msg:"balance_in_cents"`
    IdempotencyKey int64        `msg:"idempotency_key"`
    Key            float64      `msg:"key"`
    User           NestedLayout `msg:"user"`
    AreaID         int32        `msg:"area_id"`
    CreatedAt      int32        `msg:"created_at"`
    UpdatedAt      int32        `msg:"updated_at"`
    ID             uint32       `msg:"id"`
    Status         uint16       `msg:"status"`
    IsActive       bool         `msg:"is_active"`
    IsSpecial      bool         `msg:"is_special"`
    IsMigrated     bool         `msg:"is_migrated"`
    TenantID       int8         `msg:"tenant_id"`
}

// Layouts is a named slice type.
// msgp cannot generate methods on anonymous slices like []*Layout,
// so we need this named type for codegen to work.
type Layouts []*Layout

One thing to note here: msgp requires a named slice type to generate MarshalMsg and UnmarshalMsg for collections. You can't just pass []*Layout to msgp's codegen. The named type Layouts solves this and lets you marshal the whole array in one call.

After adding these tags, install msgp as a tool dependency and run code generation:

go get -tool github.com/tinylib/msgp@latest
go generate ./...

This produces *_gen.go files with MarshalMsg, UnmarshalMsg, and Msgsize() methods for both types. No reflection, no runtime overhead for field lookup.

Why msgp and not protobuf or other codecs.

I know what you are thinking. Why not protobuf? I wrote about pitfalls of using Protobuf for Kafka before, and some of those concerns apply to caching too: schema management, versioning, and the requirement to maintain .proto files separately from your Go structs. For a cache layer, I want something that works directly with existing Go types.

msgp generates code from Go struct tags. No separate schema files, no extra compilation step beyond go generate. The generated code is fast because it produces direct binary encoding with no reflection, similar to how protobuf works at runtime but without the schema overhead.

Other options like gob are Go-specific and not particularly compact. encoding/binary needs manual marshaling. msgp sits right where I want it: generates fast code from existing structs, no schema files required.

The encoder and decoder.

The zstd encoder and decoder are safe for concurrent use but expensive to create, so initialize them once at the package level:

import "github.com/klauspost/compress/zstd"

var (
    zstdEncoder, _ = zstd.NewWriter(nil, zstd.WithEncoderLevel(zstd.SpeedFastest))
    zstdDecoder, _ = zstd.NewReader(nil)
)

I use zstd.SpeedFastest here because we are optimizing for cache throughput. The compression ratio difference between fastest and default level is small for structured data like this, but the CPU savings are noticeable at high request rates.

Now here are the four serialization functions we will benchmark:

// 1. Plain JSON
func encodeJSON(data Layouts) ([]byte, error) {
    return json.Marshal(data)
}

// 2. msgp only
func encodeMsgp(data *Layouts) ([]byte, error) {
    return data.MarshalMsg(nil)
}

// 3. JSON + zstd
func encodeJSONZstd(data Layouts) ([]byte, error) {
    jsonBytes, err := json.Marshal(data)
    if err != nil {
        return nil, err
    }
    return zstdEncoder.EncodeAll(jsonBytes, nil), nil
}

// 4. msgp + zstd
func encodeMsgpZstd(data *Layouts) ([]byte, error) {
    msgpBytes, err := data.MarshalMsg(nil)
    if err != nil {
        return nil, err
    }
    return zstdEncoder.EncodeAll(msgpBytes, nil), nil
}

And decoding:

func decodeJSON(b []byte) (Layouts, error) {
    var result Layouts
    return result, json.Unmarshal(b, &result)
}

func decodeMsgp(b []byte) (Layouts, error) {
    var result Layouts
    _, err := result.UnmarshalMsg(b)
    return result, err
}

func decodeJSONZstd(b []byte) (Layouts, error) {
    decompressed, err := zstdDecoder.DecodeAll(b, nil)
    if err != nil {
        return nil, err
    }
    var result Layouts
    return result, json.Unmarshal(decompressed, &result)
}

func decodeMsgpZstd(b []byte) (Layouts, error) {
    decompressed, err := zstdDecoder.DecodeAll(b, nil)
    if err != nil {
        return nil, err
    }
    var result Layouts
    _, err = result.UnmarshalMsg(decompressed)
    return result, err
}

Benchmarks.

Let's create a test dataset of 1000 records with realistic data and measure the output size and encoding speed for each approach:

func generateTestData(n int) Layouts {
    data := make(Layouts, n)
    for i := range n {
        data[i] = &Layout{
            BalanceInCents: int64(i*100 + 42),
            IdempotencyKey: int64(1000000 + i),
            Key:            float64(i) * 1.337,
            User: NestedLayout{
                ID:    int64(i + 1),
                Phone: 15551234567,
                Age:   25 + int32(i%40),
            },
            AreaID:     int32(i % 50),
            CreatedAt:  1700000000 + int32(i),
            UpdatedAt:  1700000000 + int32(i) + 3600,
            ID:         uint32(i + 1),
            Status:     uint16(i % 5),
            IsActive:   i%2 == 0,
            IsSpecial:  i%7 == 0,
            IsMigrated: i%3 == 0,
            TenantID:   int8(i % 10),
        }
    }
    return data
}

func BenchmarkEncode(b *testing.B) {
    data := generateTestData(1000)

    b.Run("JSON", func(b *testing.B) {
        for b.Loop() {
            _, _ = encodeJSON(data)
        }
    })

    b.Run("Msgp", func(b *testing.B) {
        for b.Loop() {
            _, _ = encodeMsgp(&data)
        }
    })

    b.Run("JSON+Zstd", func(b *testing.B) {
        for b.Loop() {
            _, _ = encodeJSONZstd(data)
        }
    })

    b.Run("Msgp+Zstd", func(b *testing.B) {
        for b.Loop() {
            _, _ = encodeMsgpZstd(&data)
        }
    })
}

And a separate test to print the actual byte sizes, which is really the number we care about:

func TestOutputSize(t *testing.T) {
    data := generateTestData(1000)

    jsonBytes, _ := encodeJSON(data)
    msgpBytes, _ := encodeMsgp(&data)
    jsonZstdBytes, _ := encodeJSONZstd(data)
    msgpZstdBytes, _ := encodeMsgpZstd(&data)

    t.Logf("JSON:       %d bytes", len(jsonBytes))
    t.Logf("Msgp:       %d bytes", len(msgpBytes))
    t.Logf("JSON+Zstd:  %d bytes", len(jsonZstdBytes))
    t.Logf("Msgp+Zstd:  %d bytes", len(msgpZstdBytes))
}

Here are the results from my machine (Go 1.25, Apple M4 Pro). Sizes first, since that is the whole point:

=== RUN   TestOutputSize
    t_test.go:  JSON:       256243 bytes
    t_test.go:  Msgp:       189709 bytes
    t_test.go:  JSON+Zstd:  22914 bytes
    t_test.go:  Msgp+Zstd:  27517 bytes
--- PASS: TestOutputSize (0.01s)

JSON produces ~256 KB for 1000 records. msgp alone drops that to ~190 KB, a 26% reduction. And here is where it gets interesting: JSON+zstd compresses down to ~23 KB, but msgp+zstd lands at ~27.5 KB. The msgp+zstd combination is larger than JSON+zstd. That wasn't what I expected.

Why msgp+zstd is bigger than JSON+zstd.

This surprised me at first, but it makes sense once you think about how zstd works. zstd is a dictionary-based compressor. It finds repeated byte sequences and replaces them with short back-references. JSON is full of exactly that kind of redundancy: the field names "balance_in_cents":, "idempotency_key":, "is_active": repeat verbatim for every record in the array. With 1000 records, the string "balance_in_cents" appears 1000 times. zstd sees that pattern and after the first occurrence, each repetition costs almost nothing.

msgp has already eliminated that redundancy. Field names are encoded as short binary keys, so there are no long repeated strings left for zstd to find. The msgp output is already compact, which paradoxically means zstd has less to work with. You end up with zstd's frame overhead on top of data that doesn't compress well.

So the tradeoff is not as simple as "stack both optimizations for maximum savings." If your goal is strictly the smallest possible cache footprint and you can afford the CPU cost, JSON+zstd actually wins on size. But that is not the full picture.

The speed side of things.

Here are the encoding benchmarks:

BenchmarkEncode/JSON-14          4254       274337 ns/op     262699 B/op     2 allocs/op
BenchmarkEncode/Msgp-14         37448        31810 ns/op     221185 B/op     1 allocs/op
BenchmarkEncode/JSON+Zstd-14     1960       557230 ns/op     591831 B/op     3 allocs/op
BenchmarkEncode/Msgp+Zstd-14     3906       300849 ns/op     417836 B/op     2 allocs/op

msgp encoding is 8.6x faster than JSON. But JSON+zstd, the size winner, is also the slowest option at 557 us/op, more than twice as slow as plain JSON. You pay for JSON's reflection-based marshaling first, then zstd's compression pass on top of a much larger input buffer.

msgp+zstd at 300 us/op is almost 2x faster than JSON+zstd while producing a comparable cache size (27.5 KB vs 22.9 KB). For a cache layer handling thousands of requests per second, that speed difference matters more than 4.6 KB.

And msgp alone at 31 us/op is almost 9x faster than plain JSON. If your cache data fits comfortably in Redis at 190 KB per user instead of 256 KB, that might be all you need, and you avoid the zstd dependency entirely.

The auto-generated msgp benchmarks confirm the per-struct performance as well:

BenchmarkMarshalMsgLayout-14     32107621     36.08 ns/op    224 B/op    1 allocs/op
BenchmarkAppendMsgLayout-14      76366846     15.95 ns/op      0 B/op    0 allocs/op
BenchmarkUnmarshalLayout-14      18507248     64.46 ns/op      0 B/op    0 allocs/op

36 nanoseconds to marshal a single struct with 13 fields and a nested sub-struct. Zero allocations on the append path. Generated code with no reflection does well here.

So which one should you pick?

It depends on what you are constrained by. Here is how I think about it:

If Redis memory is the bottleneck and you need the absolute smallest stored size, use JSON+zstd. ~23 KB per 1000 records, 91% reduction from plain JSON. You pay for it with 557 us per encode operation, which might be fine if your write rate is low.

If CPU and latency matter and you want fast serialization with a decent size reduction, use msgp alone. ~190 KB per 1000 records, 26% reduction, but 8.6x faster encoding with a single allocation. No compression dependency.

If you want a balance between size and speed, use msgp+zstd. ~27.5 KB per 1000 records, 89% reduction from plain JSON (close to JSON+zstd's 91%) at nearly 2x the encoding speed. This is probably the right default for most applications that need to optimize cache size without introducing a CPU bottleneck.

What does this mean in real numbers? If you are caching data for 100,000 active users with 1000 records each, JSON will eat ~24.4 GB of Redis memory. JSON+zstd brings that to ~2.2 GB. Msgp+zstd lands at ~2.6 GB, but encodes twice as fast. Either way, that is the difference between needing a large Redis cluster and getting by with a single instance.

A note on zstd singleton.

The first thing the klauspost/compress/zstd documentation warns about is not to create a new zstd.NewWriter or zstd.NewReader per request. Both are designed to be reused and are safe for concurrent access. Creating them is expensive because zstd initializes internal lookup tables and allocates buffers on construction. If you put zstd.NewWriter(nil) inside your request handler, you will burn CPU on initialization that has nothing to do with your actual data.

// NOT this per request
func handleRequest(data []byte) []byte {
    encoder, _ := zstd.NewWriter(nil) // expensive, don't do this
    defer encoder.Close()
    return encoder.EncodeAll(data, nil)
}

A cleaner approach is to wrap both into a struct that you initialize once and inject where needed:

type ZstdCompressor struct {
    encoder *zstd.Encoder
    decoder *zstd.Decoder
}
 
func NewZstdCompressor() (*ZstdCompressor, error) {
    encoder, err := zstd.NewWriter(nil,
        zstd.WithEncoderLevel(zstd.SpeedFastest),
        zstd.WithEncoderConcurrency(1),
        zstd.WithWindowSize(1<<20),
        zstd.WithLowerEncoderMem(true),
    )
    if err != nil {
        return nil, fmt.Errorf("zstd: failed to initialize zstd encoder: %w", err)
    }
 
    decoder, err := zstd.NewReader(nil,
        zstd.WithDecoderConcurrency(1),
    )
    if err != nil {
        _ = encoder.Close()
        return nil, fmt.Errorf("zstd: failed to initialize zstd decoder: %w", err)
    }
 
    return &ZstdCompressor{
        encoder: encoder,
        decoder: decoder,
    }, nil
}
 
func (z *ZstdCompressor) Compress(src []byte) []byte {
    return z.encoder.EncodeAll(src, nil)
}
 
func (z *ZstdCompressor) Decompress(src []byte) ([]byte, error) {
    result, err := z.decoder.DecodeAll(src, nil)
    if err != nil {
        return nil, fmt.Errorf("zstd: failed to decompress: %w", err)
    }
 
    return result, nil
}

This way you handle initialization errors properly instead of swallowing them with _, and the compressor can be passed around as a dependency. I'll explain the extra options in the next section.

Taming zstd memory usage.

After deploying the ZstdCompressor above with default options, I noticed RAM usage on the pods grew 2.5x over a couple of hours. Disabling the feature flag stopped new calls, but memory didn't drop. Only a pod restart freed it. That was a red flag.

The cause isn't a memory leak in the traditional sense. It's how the encoder manages its internal buffers. When you create an encoder with zstd.NewWriter(nil), it lazily initializes one set of internal buffers per concurrency slot. The default concurrency is GOMAXPROCS, so on a 4-core pod you get 4 internal encoders, on a 16-core machine you get 16. Each slot allocates window-sized buffers (default 8 MB), plus hash tables and scratch space. All of these are kept alive for reuse after the first EncodeAll call triggers their initialization. The Go garbage collector can't collect them because the singleton encoder holds references.

So if you have a 16-core pod, the encoder alone can hold 16 * ~10 MB = ~160 MB of buffers that will never be released until the encoder itself is garbage collected or the process exits. Under concurrent load, every slot gets initialized and stays initialized.

Three options on the encoder side fix this:

WithEncoderConcurrency(1) limits the encoder to a single internal buffer set. EncodeAll runs each call on one goroutine anyway, so multiple concurrent callers just queue. For a cache layer doing 50 RPS, this is more than enough. This alone can cut encoder memory by 4-16x depending on your CPU count.

WithWindowSize(1<<20) sets the window to 1 MB instead of the default 8 MB. For cache payloads under 200 KB, an 8 MB window is overkill. The window only needs to be larger than your largest input for compression to work at full ratio. 1 MB covers that with room to spare.

WithLowerEncoderMem(true) trades a small amount of compression speed for roughly half the buffer allocations. For SpeedFastest level with small payloads, the speed difference is negligible.

The decoder has a similar concurrency model. Default is min(GOMAXPROCS, 4) decoders, each with its own buffers. WithDecoderConcurrency(1) applies the same fix.

A generic cache helper.

If you want to use this approach across different types, you can write a generic helper using Go's type constraints. The trick is the two-type-parameter pattern: Go generics cannot call pointer-receiver interface methods on T directly, so you need a constraint that ties the pointer type to the value type.

type MsgpCodec[T any] interface {
    MarshalMsg([]byte) ([]byte, error)
    UnmarshalMsg([]byte) ([]byte, error)
    *T
}

func CacheSet[T any, PT MsgpCodec[T]](
    ctx context.Context,
    rdb *redis.Client,
    key string,
    value *T,
    ttl time.Duration,
) error {
    encoded, err := PT(value).MarshalMsg(nil)
    if err != nil {
        return err
    }
    compressed := zstdEncoder.EncodeAll(encoded, nil)
    return rdb.Set(ctx, key, compressed, ttl).Err()
}

func CacheGet[T any, PT MsgpCodec[T]](
    ctx context.Context,
    rdb *redis.Client,
    key string,
) (*T, error) {
    val, err := rdb.Get(ctx, key).Bytes()
    if err != nil {
        return nil, err
    }
    decompressed, err := zstdDecoder.DecodeAll(val, nil)
    if err != nil {
        return nil, err
    }
    result := PT(new(T))
    _, err = result.UnmarshalMsg(decompressed)
    if err != nil {
        return nil, err
    }
    return (*T)(result), nil
}

The call site looks a bit verbose because of Go's generics syntax:

err := CacheSet[Layouts, *Layouts](
    ctx, rdb, "user:123:accounts", &accounts, 10*time.Minute,
)

It's not pretty, but it is type safe and you only write the serialization logic once.

When this does not help.

Before you get excited about 90%+ reductions, I need to be clear about what kind of data benefits from this approach. It only makes sense for serialized JSON stored in Redis. If you are caching a plain string, a boolean flag, a counter, or any other scalar value, there is nothing to optimize. Redis already stores those efficiently. A SET user:123:name "John" is 4 bytes of payload. Running it through msgp + zstd would make it larger, not smaller, because of the encoding headers and compression frame overhead.

The wins come from structured data: arrays of objects, nested JSON documents, anything where encoding/json adds repeated field names, type coercion overhead, and structural characters like {, }, [, ], :, and ,. The more records in your array and the more fields in your struct, the bigger the savings. A single flat struct with 3 fields will barely compress. A thousand records with 13 fields each will compress dramatically, as we saw in the benchmarks above.

So keep your regular SET/GET for simple values. This approach is specifically for the cases where you are serializing Go structs or slices of structs into Redis as JSON blobs.

Caveats.

A few gotchas before you switch everything to msgp + zstd. First, debugging gets harder. You can't just redis-cli GET a key and read the value anymore. You need a small tool to decode and decompress the data. For development and debugging, I would recommend keeping a fallback to JSON or at least having a CLI utility that can decode your cached values.

Second, msg tags on every field are mandatory. Without them, msgp falls back to Go field names for serialization keys. This works until someone renames a struct field and silently breaks deserialization of all existing cached data. Use explicit msg:"field_name" tags and treat them like database column names: once set, they should not change.

Third, versioning your cache keys is good practice when changing serialization formats. If you switch from JSON to msgp, old cached values will fail to decode. Use a versioned key prefix like v2:user:123:accounts so old and new formats can coexist during rollout.

Fourth, zstd compression ratio depends on your data. Repetitive data like arrays of similar structs compresses well. A single small struct with unique values might not shrink much, and you will pay the CPU cost for no benefit. Test with your actual data before committing.

Fifth, this works because our struct uses simple built-in types. All fields here are int64, int32, float64, bool, and so on. msgp knows how to serialize those out of the box. If your struct contains fields from external libraries, like decimal.Decimalfrom shopspring/decimal or uuid.UUID from google/uuid, msgp won't know how to encode them. You would need to implement the msgp.Marshaler and msgp.Unmarshaler interfaces on those types yourself, or convert the fields to primitive types before serialization (for example, storing a Decimal as a string or as cents in int64). Not a dealbreaker, but worth knowing before you adopt msgp for structs with non-trivial field types.

When to bother with this.

Same as with struct padding optimizations: most applications do not need this. If your Redis usage is well within limits and you are not worried about memory costs, plain JSON works fine and is easier to debug.

But if you are running into Redis memory limits, paying for oversized instances, or caching data for millions of users, a 90% size reduction is hard to ignore. Whether you pick JSON+zstd for maximum compression or msgp+zstd for the speed/size balance depends on your workload. Measure both on your actual data. The answer might surprise you, as it surprised me.

How to read benchmarks and what else you can optimize in your Go application I described in my article Optimization Odyssey: pprof-ing & Benchmarking Golang App.

Sources:

tinylib/msgp: https://github.com/tinylib/msgp
klauspost/compress/zstd: https://github.com/klauspost/compress
MessagePack specification: https://msgpack.org/
Zstandard RFC 8878: https://datatracker.ietf.org/doc/html/rfc8878

3 Apr

2026