Temporal workflows in Golang: the three things that bite in prod.

I like Temporal. It's the rare orchestration tool that delivers on durable execution, and I've leaned on it for real work. But after running it in production for a while, I've noticed that the bugs that hurt are almost never the ones the Go compiler catches. They slip past go build, past go vet, past code review, and then they pick the worst possible moment to show up.

There are three classes of these that I keep running into, roughly in order of how often they ruin someone's afternoon:

Workers panicking on bad activity calls, because the SDK erases your types.
Workflows breaking replay determinism.
Event history bloating until a worker runs out of memory.

Two of the three now have tooling. One is pure architecture and no linter will save you from it. I'll go through all three: what causes it, why it's invisible until runtime, and what actually fixes it. Part of the first section is about a linter I wrote, but this post is really about the failure modes, not the tool.

1. The compiler can't see your activity calls.

Here's the signature behind most of the trouble:

func ExecuteActivity(ctx Context, activity interface{}, args ...interface{}) Future

Everything after ctx is untyped. The SDK has no idea what activity is or what arguments it expects, so it can't warn you when you get it wrong. Take a normal activity and a workflow that calls it with one argument missing:

func (a *Activities) Greet(ctx context.Context, name string) (string, error) {
    return "Hello, " + name, nil
}

func GreetWorkflow(ctx workflow.Context) (string, error) {
    ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: time.Minute,
    })

    var greeting string
    // Greet wants (name string). We pass nothing.
    err := workflow.ExecuteActivity(ctx, a.Greet).Get(ctx, &greeting)
    return greeting, err
}

This compiles. It looks fine in review. Then it runs, and the worker panics because Greet expected one argument and got zero.

Why a runtime panic is worse than it sounds.

When workflow code panics, what happens next depends on the worker's WorkflowPanicPolicy. The default is BlockWorkflow, and it does not fail the workflow. It puts the workflow task into a retry loop and leaves it there, on the assumption that you'll notice, fix the bug, and redeploy. The other option, FailWorkflow, fails the execution immediately, which is handy in development and dangerous in production, because one bad deploy can fail every open workflow at once.

So with the safe default, a missing argument doesn't fail one run and move on. It wedges every execution that reaches that line, and they all sit there retrying until you ship a fix. The workflow isn't dead, it's stuck, which is somehow more annoying.

This is a static fact about your code. There's no reason to learn about it from a stuck worker at 2am.

A linter for the type safety the SDK throws away.

This is the part I got tired of, so I wrote temporalcheck-lint, a golangci-lint module plugin that recovers the checks interface{} erases. The core analyzer is execargs. It resolves the real signature of the function behind ExecuteActivity, ExecuteChildWorkflow and ExecuteLocalActivity, then checks your call against it:

workflow.go:14  ExecuteActivity: activity "Greet" expects 1 argument, got 0 (arity)

Argument count is checked by default because it's never a false positive. The type-level checks are stricter and opt-in: strict-types flags a wrong argument type, strict-pointers flags T vs *T mismatches the DataConverter hides, strict-struct-shape flags the wrong struct, and strict-tests extends the same arity checking to your OnActivity / OnWorkflow mock matchers so the tests can't drift from the real signatures.

There's one thing to fix first. execargs can only check a target it can resolve, and a target named by its registered string is opaque:

// stringtarget flags this. The string can't be checked against a signature.
workflow.ExecuteActivity(ctx, "Greet", name)
// Use a function reference instead, which execargs can resolve.
workflow.ExecuteActivity(ctx, a.Greet, name)

So enable stringtarget, replace string targets with function references, and the rest of the plugin can suddenly see far more. The remaining analyzers cover the quieter mistakes:

optionsdiscard flags a WithActivityOptions / WithChildOptions call whose returned context you throw away, so the options never apply.
optionscontext flags using a context built by the wrong helper, like configuring child-workflow options and then running an activity with that context.
activitytimeout flags an ActivityOptions literal with no StartToCloseTimeout or ScheduleToCloseTimeout, which Temporal rejects at runtime.
futureget flags a Future.Get whose returned error you discard, silently swallowing an activity or child-workflow failure.
continueasnew flags a NewContinueAsNewError result that's dropped instead of returned, so the workflow quietly ends instead of continuing.
lossynumber flags any / map[string]any / []any parameters, where the JSON converter decodes every number as float64 and an int64 past 2^53 loses its low bits.
nonserializable flags chan and func parameters the DataConverter can't serialize at all.
workeroptions flags a worker.Options literal that sets a MaxConcurrentWorkflowTask* field to 1, which panics the worker on start.

Installing it is the standard golangci-lint module-plugin dance: a .custom-gcl.yml listing the plugin, golangci-lint custom to build a custom binary, then enable temporalcheck in .golangci.yml. The README has the full config with every strict toggle. The same binary plugs into GoLand and CI like the stock one.

That covers the type-safety class. It does nothing for the next one.

2. Replay determinism.

Temporal workflows have to be deterministic, and the reason is replay. The engine doesn't snapshot your workflow's memory. It records an event history and, whenever a worker needs to resume a workflow, it re-runs your workflow code from the top and feeds it that history to rebuild state. If your code does something that produces a different result the second time through, the replay diverges from the recorded history and you get a non-determinism error.

The usual culprits are the things that change between runs: time.Now, time.Sleep, math/rand, crypto/rand, reading the os standard streams. Plus a few Go constructs that are non-deterministic by nature: starting a goroutine, sending or receiving on a channel, ranging over a channel, and ranging over a map. The map one catches people, because Go randomizes map iteration order on purpose, so any logic that depends on that order replays differently.

workflowcheck for determinism across the call graph.

Temporal ships a static analyzer for exactly this, workflowcheck. It analyzes every function registered with RegisterWorkflow and reports non-deterministic code, and it follows the call graph. If a workflow calls Y which calls Z and Z ranges over a map, it reports the whole chain.

go install go.temporal.io/sdk/contrib/tools/workflowcheck@v0.5.0
workflowcheck ./...

The output is hierarchical, so you see why each call is a problem:

/path/to/workflow.go:12:2: MyWorkflow is non-deterministic, reason: calls non-deterministic function time.Now
  time.Now is non-deterministic, reason: declared non-deterministic
/path/to/workflow.go:12:2: MyWorkflow is non-deterministic, reason: calls non-deterministic function fmt.Printf
  fmt.Printf is non-deterministic, reason: accesses non-deterministic var os.Stdout

It works through go vet too, which is the setup I'd actually recommend you to use, since go vet output is cached and most editors surface it:

go vet -vettool $(which workflowcheck) ./...

False positives can be silenced inline with a //workflowcheck:ignore comment, or whitelisted globally in a config file when a flagged function is safe on the path you use.

There's one gap, and it's stated at the top of its own docs: workflowcheck does not catch global variable mutation. Reliably telling a deterministic global read apart from a non-deterministic mutation is hard in the general case, so it doesn't try. That's the one spot where my linter fills in. temporalcheck's workflowstate analyzer flags mutating a package-level variable from workflow code, and workflowlogger flags non-replay-aware logging (a plain log.Println re-emits on every replay, so one event shows up many times). The tradeoff is that those two only inspect the workflow function body and its closures, not helper calls. So the two tools cover opposite gaps: workflowcheck goes deep across the call graph on the determinism it knows about, and temporalcheck covers global mutation and replay logging that it skips.

Both of those classes are at least detectable. The third one isn't, and it's the one that actually took down a worker for me.

3. History size, and the OOM nobody warns you about.

This is the failure mode with no linter. It comes from the same replay mechanism that makes determinism matter, viewed from the other side.

When a worker needs to resume a workflow that isn't in its cache, because it was evicted, or the worker crashed and restarted, or it landed on a fresh worker, it downloads the entire event history and replays it from the beginning to rebuild state. That history lives in the worker's memory while it replays. The bigger the history, the more memory each replay holds, and the deserialized object graph in memory is larger than the bytes on the wire. Run a lot of workflows per worker, evict and reload a few of the fat ones at once, and the pod hits its memory limit and gets OOM-killed.

Temporal puts hard caps on history size precisely to stop this from getting unbounded. The numbers are worth knowing:

A single payload (each workflow and activity argument and return value) warns at 256KB and is capped at 2MB.
A single gRPC message, and a single event history transaction, are capped at 4MB. You can blow past this by scheduling many activities in one workflow task even when each individual payload is under 2MB.
The whole event history warns at 10MB (or 10,240 events) and is hard-terminated at 50MB (or 51,200 events). On Temporal Cloud these are non-configurable.

So the practical ceiling is 50MB of history, and in my experience the worker starts feeling it well before the server terminates the workflow. That lines up with the documented reason the cap exists: loading large histories into worker memory on replay is exactly what these limits are there to prevent.

What actually bloats the history.

Two things, and they compound.

The first is big payloads. Every activity argument and return value gets recorded in the event history. If an activity returns 500KB, you can run fewer than 100 of them before the history alone crosses 50MB, before counting any other events. Signal and Update inputs count too. The data you pass around is the data Temporal durably stores, forever, in history.

The second is too many events. Every activity, timer, signal, and child workflow adds events, each carrying its payload. A workflow that does a lot of small steps gets there a different way than one that does a few huge ones, but both walk toward the same 50MB wall.

I've written before about hitting serialization size limits with Protobuf on Kafka topics, and the instinct is the same here: the moment you're routinely passing multi-megabyte blobs through a system that records everything, you're going to hit a wall. Temporal just draws the wall at 50MB and enforces it.

The fix: keep big data out of history.

The pattern is to stop passing the data itself and start passing a reference to it. Store the blob somewhere else, pass a key through the workflow, and let the activity that needs the data fetch it by key. This is sometimes called the claim-check pattern.

Where you store it depends on the size and lifetime:

For large or durable blobs, put them in object storage like S3 and pass the URL or key. The 50MB-per-history problem becomes a tiny string per event.
For smaller, hotter, short-lived data, Redis is a better fit, with a TTL so it cleans itself up. If you want it compact in there, the msgp plus zstd approach I used for shrinking Redis cache drops structured payloads by around 90%.

// Instead of passing a 5MB report through the workflow...
func ProcessWorkflow(ctx workflow.Context, report Report) error { /* ... */ }

// ...store it once and pass the key.
func ProcessWorkflow(ctx workflow.Context, reportKey string) error {
    // activities fetch the blob by key from S3/Redis and return only small results
}

The activity does the upload at the edge and returns a key, the next activity downloads by key, and the workflow only ever sees identifiers. History stays small, replay gets cheap again, and the worker stops falling over. As a lighter measure, you can also compress payloads with a custom Data Converter before they ever reach history, which buys you headroom without restructuring anything, though it doesn't change the fundamental shape of the problem.

Where ContinueAsNew fits, and where it doesn't.

For workflows that bloat by event count rather than payload size, the long-running ones that loop forever, Temporal's answer is ContinueAsNew. It completes the current run and atomically starts a fresh one with the same Workflow ID, a new Run ID, and an empty history. You can decide when to call it with info.GetContinueAsNewSuggested() instead of hardcoding a threshold.

It works, but I want to be honest about the cost. ContinueAsNew is itself an action that consumes server and worker resources, and on Temporal Cloud it contributes to your overall usage and bill. The plumbing also doesn't scale gracefully: once you commit to it, every new timer, activity, or child workflow you add has to be threaded through the same continue-as-new bookkeeping, and that bookkeeping is some of the most intrusive code you'll write in a workflow. So I reach for it when a workflow really needs to run forever, not as a way to paper over giant payloads. For giant payloads, the claim-check pattern is the real fix. ContinueAsNew resets the event count, but if a single payload is the problem, a fresh history just fills up again.

Catching it at review time with an AI reviewer.

The catch is that none of this is something a linter can check, because whether you'll blow the 50MB cap depends on data volume the compiler never sees. A workflow that fans out one activity per user account is perfectly fine for a user with three accounts and a problem for a user with three thousand. The code looks identical either way.

What's worked for me is pushing this onto the AI code reviewer. I write the hard limits straight into the guardrails file the reviewer reads, the CLAUDE.md or AGENTS.md or whatever your tool picks up: single payload caps at 2MB, a transaction at 4MB, history warns at 10MB and dies at 50MB or 51,200 events. Then, and this is the part that actually does the work, I document the approximate data shapes inside the project itself. Things like "a user has on the order of N accounts of this type and status," or "reading accounts across all users lands somewhere in this range." Rough numbers, not exact, but enough to reason with.

With both the limits and the rough cardinalities written down, the reviewer can do the arithmetic on a pull request. It multiplies the documented per-record size by the expected count and tells you a given workflow is going to schedule tens of thousands of activities in one run, or that an activity return is going to push history past the warn threshold once real data hits it. Modern models are good at this kind of back-of-the-envelope estimate, better than I expected, and it has caught a few workflows for us before they shipped. It's not a guarantee. It is the difference between finding the problem as a review comment and finding it as a 2am OOM, which is the whole game.

Prefer the deterministic tool.

I want to be careful not to oversell that last trick, because it cuts against something I believe more strongly. A deterministic check beats an AI judgment every time you can have one. A linter gives the same answer on the same code today, next week, and in CI on a Friday night. An AI reviewer is a probabilistic helper that has good days and bad days, and the bad day is the day the risky workflow ships anyway. So the AI estimate is a stopgap for the one case that resists static analysis, the data-volume problem, and not a license to stop building tools for everything else.

If anything, the better use of AI here is the opposite of catching bugs at review time. It's building the thing that catches them deterministically. temporalcheck-lint exists in part because I could lean on AI to move through the analyzer scaffolding, the AST matching, and the test cases far faster than I would have alone. That's the leverage worth chasing: point the model at shaping a tool that then runs the same way forever, instead of asking it to re-derive the same judgment on every pull request. The moment you catch yourself piling rule after rule into an AGENTS.md and hoping the reviewer holds them all in mind, that's usually the signal the rule wants to be a linter instead.

Putting it together.

Three failure modes, and they fail in three different ways. The worker panics on a bad activity call. The replay diverges on non-deterministic code. The history bloats until a worker runs out of memory. None of them are exotic. They're the ordinary mistakes that happened to be invisible until production.

Two of them you can catch in CI today. Run temporalcheck-lint for the type-safety and panic-prevention class, and Temporal's workflowcheck for determinism across the call graph. They cover opposite gaps, so use both. The third one is a habit rather than a tool: treat event history as expensive, keep big data out of it by reference, reach for ContinueAsNew deliberately rather than reflexively, and write the limits and your rough data sizes into the file your AI reviewer reads so it can do the math you can't see at compile time.

The thing all three share is that the compiler is happy and the tests pass right up until the moment they don't. Temporal gives you durable execution, but durability means it remembers every argument you ever passed and replays every line you ever wrote. It pays to assume it will.

See you in the next post 👾

Sources:

temporalcheck-lint: https://github.com/samgozman/temporalcheck-lint
Temporal workflowcheck: https://pkg.go.dev/go.temporal.io/sdk/contrib/tools/workflowcheck
Workflow Execution limits: https://docs.temporal.io/workflow-execution/limits
Self-hosted Temporal Service defaults (payload, gRPC, history limits): https://docs.temporal.io/self-hosted-guide/defaults
Troubleshoot payload and gRPC message size limit errors: https://docs.temporal.io/troubleshooting/blob-size-limit-error
Managing very long-running Workflows with Temporal: https://temporal.io/blog/very-long-running-workflows
WorkflowPanicPolicy (worker package): https://pkg.go.dev/go.temporal.io/sdk/worker#WorkflowPanicPolicy

3 Jul

2026