Long-term pitfalls of using Protobuf for Apache Kafka.

In my new blog post, I would like to discuss the nuances I encountered while working on a legacy project that uses Protobuf for messages in Apache Kafka. Specifically, the issues that arose years later from this decision, how the diversity of programming languages in the project complicated things, and how the untimely updating of gRPC tools further magnified the problem. The events and issues described here occurred between 2022 and 2023.

Input data.

We had two monoliths in NodeJS & Ruby, a few dozen containerized apps in Golang and Python, five thousand outdated Protobuf schemas, a salt shaker half full of gRPC protocols, and a whole galaxy of multi-colored serializers, deserializers, structured and unstructured data… and also a quart of Kafka queues, a quart of proto-Kafka messages, a case of critical bugs, a pint of raw VSOP protoc and two dozen highly outdated dependencies glued with DevOps love to over-complication.

Not that we needed all that for the team of 40 devs, but once you get locked into a serious legacy code collection, the tendency is to push it as far as you can.

What was the task?

We needed to update the protoc versions for five reasons:

For NodeJS, protoc and other gRPC tools were marked as deprecated and hadn't been updated in several years. New approaches have moved away from creating auto-generated files to working with Protobuf on the fly.
In NodeJS services, developers wanted to type their work in Protobuf using TypeScript, and the old solution was incapable of generating types.
We needed to access the option syntax in Protobuf (option <option_name> = <value>;) for Golang services (yes, our protoc was that old).
Our version of protoc didn't work on ARM processors, which made development challenging for employees using Macs.
It was one of the bottlenecks the company wanted to resolve in the process of updating old dependencies to reduce the number of CVEs in the images.

The root of the evil lay in the fact that the builder for Go binaries and the builder for NodeJS were moved out of the repositories into an external dependency controlled by DevOps. A dependency without proper versioning that was connected as a git submodule.

So, what does Kafka have to do with it?

The fact is that Protobuf was not only used for describing gRPC messages but also for Kafka messages. Often, the same contract applied in gRPC could be reused in Apache Kafka.

But there are a few key differences between how gRPC and Kafka handle messages:

gRPC:

Typical gRPC messages are always binary, serialized using Protocol Buffers by default.
gRPC tools (like protoc) generate code to serialize and deserialize messages automatically based on the defined Protobuf message types.
While binary is the default and highly recommended format for gRPC messages due to its efficiency, it's possible to use other serialization formats like JSON if needed. However, JSON serialization would require additional manual handling or third-party libraries.

Kafka:

Kafka messages are generally treated as binary payloads.
Kafka itself does not enforce any specific data format on the messages.
While Kafka itself is agnostic to the format of the messages it processes and stores, choosing an appropriate serialization format is crucial. Although Kafka messages are commonly serialized to binary for efficiency (to optimize for both size and speed) - alternate formats like JSON, Avro, or even plain text are entirely viable.

The flexibility in working with Kafka messages leads us to the fact that it does not have its own serializers in the way we are accustomed to with gRPC. The process of serialization and even the decision about its necessity lies entirely with the developers.

The naive approach.

Of course, the most naive approach would have been to simply update the protoc version and see what breaks. This was the quickest and most obvious option, and so we did just that.

What broke? Pretty much everything 🙂

Since I worked exclusively with NodeJS and Golang, I will speak about the problems that surfaced in my stack.

Golang

The syntax for generating Protobuf files changed significantly. Old syntax example:

protoc --proto_path="$dir" --go_out=plugins="grpc:$dir" "$file"

New syntax example:

protoc --go_out="$dir" --go_opt=paths=source_relative --go-grpc_out="$dir" --go-grpc_opt=paths=source_relative "$file"

Now, the grpc-go plugin is used for generating Protobuf. To generate Protobuf with a new plugin, it's now necessary to specify options for go, for example:

option go_package = "github.com/your_company/your_package";

Some methods from the generated proto files have changed, so code modifications were inevitable.

Without specifying go_package, Protobuf generates with an error. Adding option broke compatibility with the old static generator in NodeJS (it didn't understand the syntax), significantly complicating the task. There was an alternative way to specify the go package through the protoc console option:

--go_opt=Mprotos/buzz.proto=github.com/your_company/your_package

But no one could get it to work; the command constantly failed with a syntax error. Likely, we made a mistake somewhere, but after unsuccessful attempts, we moved on.

NodeJS

It was necessary to replace the npm library grpc with @grpc/grpc-js. The grpc-tools for NodeJS had not been maintained. It contains an outdated version of protoc.js and protobuf v3.19.1. All replacements for this utility used the same outdated version of protoc & protobuf under the hood, which at the time did not have an ARM build and had several other differences from the latest versions.

We started looking for a current replacement for grpc-tools for NodeJS and concluded that modern approaches in JS no longer offer file generation based on proto. Instead, they allow working on the fly, for example, protobuf.js and @grpc/proto-loader.

Static generation, using grpc-tools and protoc for generation and grpc or @grpc/grpc-js in runtime, is basically deprecated by underlying outdated binaries and lack of GitHub activity. Dynamic generation, on the other hand, using @grpc/proto-loader and protobuf.js has become an industry standard.

After several unsuccessful attempts to painlessly update grpc-tools for NodeJS to stick with the Static approach, we began exploring the Dynamic approach. Initially, everything went smoothly, working with protobuf.js was easy and straightforward. But after a few days of rewriting code to work with gRPC, we suddenly remembered… that our Apache Kafka also uses Protobuf alongside gRPC.

Specifically, the deserializeBinary and serializeBinary methods from the generated files. The amount of code that needed to be rewritten to move from a static to a dynamic approach exceeded ten thousand lines of code, which was beyond our capabilities given the allocated time and the general reluctance to move in that direction. In all the languages that we supported on the project and in all the languages the team had previously worked with, the static approach of generating code based on proto contracts was always applied.

What we have learned.

Updating the approach to working with gRPC on NodeJS will take time, but it doesn't pose any complexity. I've rewritten one contract, and the rest will go smoothly. However, the task with Kafka-protobuf seems to be more challenging.

Serializers and deserializers will have to be invented on our own. Those that can be borrowed from the generated code are not very readable and not at all versatile. As a lazy approach, of course, one could simply copy them and leave them somewhere inside the project, but then any changes to the existing structures would become complicated. We need to create a universal solution. The team did not consider moving away from protobuf in Kafka.

With Golang, things also did not go as smoothly. Due to our failures with passing the package name in go_opt, it will be necessary to add option go_package in all existing contracts, with subsequent updates for consumers (contracts were connected as git submodules). An extremely tedious and monotonous task.

Conclusion.

Untimely updating of dependencies and neglected technical debt lead to long-lasting complex consequences. There are no perfect projects or products without technical debt, but that doesn't mean it shouldn't be addressed.

For a development team that grew from ~20 people to ~40, the decision to develop everything on top of gRPC seems excessive. Despite all my love for gRPC, I believe that maintaining and developing service interactions through gRPC for a small team constantly prolonged the completion time for tasks. Using a simple HTTP router and describing the interaction schema through OpenAPI (e.g., Swagger) would have significantly simplified the development and support of the API for a small team, while also generating excellent documentation for internal consumers.

The latency between services was never a key problem in the speed of our product. The problems were more obvious and expected - a huge amount of N+1 issues, inefficient database operations etc.

As for the use of Protobuf within Apache Kafka, I don't think it was a mistake. Serializing messages into compact binary data had advantages for us, as we had more than once hit Kafka topic size limits. Protobuf proved itself well as a unified format for describing structures between different programming languages.

However, it's worth noting that the team avoided using types in Protobuf messages. All, or nearly all, fields were of type string. So, in the end, I believe we could have just used JSON and sent it as binary, without the hassle (although it would have been less compact than Protobuf). JSON schemas can also be described and stored separately. JSON can be validated against a schema in all programming languages we use with predictable results.

A key lesson from this experience, beyond the well-understood issues surrounding technical debt, is the importance of preparation when choosing to use Protobuf for Kafka messages. It's vital to anticipate the need for custom serialization and deserialization mechanisms well ahead of implementation. Even though Protobuf is not strictly tied to gRPC, most developers only apply it there, which undoubtedly leaves its mark on public libraries. Any use of Protobuf outside the context of gRPC should be considered unconventional, meaning you need to avoid the bottleneck of relying solely on conventional gRPC-associated tools.

31 Mar

2024