[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$pC4ZxlFnru":3,"blog-post-temporal-workflows-in-golang-the-three-things-that-bite-in-production":4},"email-7jbmv38w",{"content":5,"created_at":6,"description":7,"id":8,"keywords":9,"reading_time":13,"slug":14,"title":15,"updated_at":16,"ok":17},"# Temporal workflows in Golang: the three things that bite in prod.\n\nI like Temporal. It's the rare orchestration tool that delivers on durable execution, and I've leaned on it for real work. But after running it in production for a while, I've noticed that the bugs that hurt are almost never the ones the Go compiler catches. They slip past `go build`, past `go vet`, past code review, and then they pick the worst possible moment to show up.\n\nThere are three classes of these that I keep running into, roughly in order of how often they ruin someone's afternoon:\n\n1. Workers panicking on bad activity calls, because the SDK erases your types.\n2. Workflows breaking replay determinism.\n3. Event history bloating until a worker runs out of memory.\n\nTwo of the three now have tooling. One is pure architecture and no linter will save you from it. I'll go through all three: what causes it, why it's invisible until runtime, and what actually fixes it. Part of the first section is about a linter I wrote, but this post is really about the failure modes, not the tool.\n\n## 1. The compiler can't see your activity calls.\n\nHere's the signature behind most of the trouble:\n\n```go\nfunc ExecuteActivity(ctx Context, activity interface{}, args ...interface{}) Future\n```\n\nEverything after `ctx` is untyped. The SDK has no idea what `activity` is or what arguments it expects, so it can't warn you when you get it wrong. Take a normal activity and a workflow that calls it with one argument missing:\n\n```go\nfunc (a *Activities) Greet(ctx context.Context, name string) (string, error) {\n    return \"Hello, \" + name, nil\n}\n\nfunc GreetWorkflow(ctx workflow.Context) (string, error) {\n    ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{\n        StartToCloseTimeout: time.Minute,\n    })\n\n    var greeting string\n    \u002F\u002F Greet wants (name string). We pass nothing.\n    err := workflow.ExecuteActivity(ctx, a.Greet).Get(ctx, &greeting)\n    return greeting, err\n}\n```\n\nThis compiles. It looks fine in review. Then it runs, and the worker panics because `Greet` expected one argument and got zero.\n\n### Why a runtime panic is worse than it sounds.\n\nWhen workflow code panics, what happens next depends on the worker's `WorkflowPanicPolicy`. The default is `BlockWorkflow`, and it does not fail the workflow. It puts the workflow task into a retry loop and leaves it there, on the assumption that you'll notice, fix the bug, and redeploy. The other option, `FailWorkflow`, fails the execution immediately, which is handy in development and dangerous in production, because one bad deploy can fail every open workflow at once.\n\nSo with the safe default, a missing argument doesn't fail one run and move on. It wedges **every execution that reaches that line**, and they all sit there retrying until you ship a fix. The workflow isn't dead, it's stuck, which is somehow more annoying.\n\nThis is a static fact about your code. There's no reason to learn about it from a stuck worker at 2am.\n\n### A linter for the type safety the SDK throws away.\n\nThis is the part I got tired of, so I wrote [temporalcheck-lint](https:\u002F\u002Fgithub.com\u002Fsamgozman\u002Ftemporalcheck-lint), a golangci-lint module plugin that recovers the checks `interface{}` erases. The core analyzer is `execargs`. It resolves the real signature of the function behind `ExecuteActivity`, `ExecuteChildWorkflow` and `ExecuteLocalActivity`, then checks your call against it:\n\n```\nworkflow.go:14  ExecuteActivity: activity \"Greet\" expects 1 argument, got 0 (arity)\n```\n\nArgument count is checked by default because it's never a false positive. The type-level checks are stricter and opt-in: `strict-types` flags a wrong argument type, `strict-pointers` flags `T` vs `*T` mismatches the `DataConverter` hides, `strict-struct-shape` flags the wrong struct, and `strict-tests` extends the same arity checking to your `OnActivity` \u002F `OnWorkflow` mock matchers so the tests can't drift from the real signatures.\n\nThere's one thing to fix first. `execargs` can only check a target it can resolve, and a target named by its registered string is opaque:\n\n```go\n\u002F\u002F stringtarget flags this. The string can't be checked against a signature.\nworkflow.ExecuteActivity(ctx, \"Greet\", name)\n\u002F\u002F Use a function reference instead, which execargs can resolve.\nworkflow.ExecuteActivity(ctx, a.Greet, name)\n```\n\nSo enable `stringtarget`, replace string targets with function references, and the rest of the plugin can suddenly see far more. The remaining analyzers cover the quieter mistakes:\n\n- `optionsdiscard` flags a `WithActivityOptions` \u002F `WithChildOptions` call whose returned context you throw away, so the options never apply.\n- `optionscontext` flags using a context built by the wrong helper, like configuring child-workflow options and then running an activity with that context.\n- `activitytimeout` flags an `ActivityOptions` literal with no `StartToCloseTimeout` or `ScheduleToCloseTimeout`, which Temporal rejects at runtime.\n- `futureget` flags a `Future.Get` whose returned error you discard, silently swallowing an activity or child-workflow failure.\n- `continueasnew` flags a `NewContinueAsNewError` result that's dropped instead of returned, so the workflow quietly ends instead of continuing.\n- `lossynumber` flags `any` \u002F `map[string]any` \u002F `[]any` parameters, where the JSON converter decodes every number as `float64` and an `int64` past 2^53 loses its low bits.\n- `nonserializable` flags `chan` and `func` parameters the `DataConverter` can't serialize at all.\n- `workeroptions` flags a `worker.Options` literal that sets a `MaxConcurrentWorkflowTask*` field to `1`, which panics the worker on start.\n\nInstalling it is the standard golangci-lint module-plugin dance: a `.custom-gcl.yml` listing the plugin, `golangci-lint custom` to build a custom binary, then enable `temporalcheck` in `.golangci.yml`. The README has the full config with every strict toggle. The same binary plugs into GoLand and CI like the stock one.\n\nThat covers the type-safety class. It does nothing for the next one.\n\n## 2. Replay determinism.\n\nTemporal workflows have to be deterministic, and the reason is replay. The engine doesn't snapshot your workflow's memory. It records an event history and, whenever a worker needs to resume a workflow, it re-runs your workflow code from the top and feeds it that history to rebuild state. If your code does something that produces a different result the second time through, the replay diverges from the recorded history and you get a non-determinism error.\n\nThe usual culprits are the things that change between runs: `time.Now`, `time.Sleep`, `math\u002Frand`, `crypto\u002Frand`, reading the `os` standard streams. Plus a few Go constructs that are non-deterministic by nature: starting a goroutine, sending or receiving on a channel, ranging over a channel, and ranging over a map. The map one catches people, because Go randomizes map iteration order on purpose, so any logic that depends on that order replays differently.\n\n### workflowcheck for determinism across the call graph.\n\nTemporal ships a static analyzer for exactly this, [`workflowcheck`](https:\u002F\u002Fpkg.go.dev\u002Fgo.temporal.io\u002Fsdk\u002Fcontrib\u002Ftools\u002Fworkflowcheck). It analyzes every function registered with `RegisterWorkflow` and reports non-deterministic code, and it follows the call graph. If a workflow calls `Y` which calls `Z` and `Z` ranges over a map, it reports the whole chain.\n\n```bash\ngo install go.temporal.io\u002Fsdk\u002Fcontrib\u002Ftools\u002Fworkflowcheck@v0.5.0\nworkflowcheck .\u002F...\n```\n\nThe output is hierarchical, so you see why each call is a problem:\n\n```\n\u002Fpath\u002Fto\u002Fworkflow.go:12:2: MyWorkflow is non-deterministic, reason: calls non-deterministic function time.Now\n  time.Now is non-deterministic, reason: declared non-deterministic\n\u002Fpath\u002Fto\u002Fworkflow.go:12:2: MyWorkflow is non-deterministic, reason: calls non-deterministic function fmt.Printf\n  fmt.Printf is non-deterministic, reason: accesses non-deterministic var os.Stdout\n```\n\nIt works through `go vet` too, which is the setup I'd actually recommend you to use, since `go vet` output is cached and most editors surface it:\n\n```bash\ngo vet -vettool $(which workflowcheck) .\u002F...\n```\n\nFalse positives can be silenced inline with a `\u002F\u002Fworkflowcheck:ignore` comment, or whitelisted globally in a config file when a flagged function is safe on the path you use.\n\nThere's one gap, and it's stated at the top of its own docs: **`workflowcheck` does not catch global variable mutation.** Reliably telling a deterministic global read apart from a non-deterministic mutation is hard in the general case, so it doesn't try. That's the one spot where my linter fills in. `temporalcheck`'s `workflowstate` analyzer flags mutating a package-level variable from workflow code, and `workflowlogger` flags non-replay-aware logging (a plain `log.Println` re-emits on every replay, so one event shows up many times). The tradeoff is that those two only inspect the workflow function body and its closures, not helper calls. So the two tools cover opposite gaps: `workflowcheck` goes deep across the call graph on the determinism it knows about, and `temporalcheck` covers global mutation and replay logging that it skips.\n\nBoth of those classes are at least detectable. The third one isn't, and it's the one that actually took down a worker for me.\n\n## 3. History size, and the OOM nobody warns you about.\n\nThis is the failure mode with no linter. It comes from the same replay mechanism that makes determinism matter, viewed from the other side.\n\nWhen a worker needs to resume a workflow that isn't in its cache, because it was evicted, or the worker crashed and restarted, or it landed on a fresh worker, it downloads the **entire event history** and replays it from the beginning to rebuild state. That history lives in the worker's memory while it replays. The bigger the history, the more memory each replay holds, and the deserialized object graph in memory is larger than the bytes on the wire. Run a lot of workflows per worker, evict and reload a few of the fat ones at once, and the pod hits its memory limit and gets OOM-killed.\n\nTemporal puts hard caps on history size precisely to stop this from getting unbounded. The numbers are worth knowing:\n\n- A single payload (each workflow and activity argument and return value) warns at **256KB** and is capped at **2MB**.\n- A single gRPC message, and a single event history transaction, are capped at **4MB**. You can blow past this by scheduling many activities in one workflow task even when each individual payload is under 2MB.\n- The whole event history warns at **10MB** (or 10,240 events) and is hard-terminated at **50MB** (or 51,200 events). On Temporal Cloud these are non-configurable.\n\nSo the practical ceiling is **50MB of history**, and in my experience the worker starts feeling it well before the server terminates the workflow. That lines up with the documented reason the cap exists: loading large histories into worker memory on replay is exactly what these limits are there to prevent.\n\n### What actually bloats the history.\n\nTwo things, and they compound.\n\nThe first is **big payloads**. Every activity argument and return value gets recorded in the event history. If an activity returns 500KB, you can run fewer than 100 of them before the history alone crosses 50MB, before counting any other events. Signal and Update inputs count too. The data you pass around is the data Temporal durably stores, forever, in history.\n\nThe second is **too many events**. Every activity, timer, signal, and child workflow adds events, each carrying its payload. A workflow that does a lot of small steps gets there a different way than one that does a few huge ones, but both walk toward the same 50MB wall.\n\nI've written before about [hitting serialization size limits with Protobuf on Kafka topics](https:\u002F\u002Fgozman.space\u002Fblog\u002Flong-term-pitfalls-of-using-protobuf-for-apache-kafka), and the instinct is the same here: the moment you're routinely passing multi-megabyte blobs through a system that records everything, you're going to hit a wall. Temporal just draws the wall at 50MB and enforces it.\n\n### The fix: keep big data out of history.\n\nThe pattern is to stop passing the data itself and start passing a reference to it. Store the blob somewhere else, pass a key through the workflow, and let the activity that needs the data fetch it by key. This is sometimes called the claim-check pattern.\n\nWhere you store it depends on the size and lifetime:\n\n- For large or durable blobs, put them in object storage like S3 and pass the URL or key. The 50MB-per-history problem becomes a tiny string per event.\n- For smaller, hotter, short-lived data, Redis is a better fit, with a TTL so it cleans itself up. If you want it compact in there, the [msgp plus zstd approach I used for shrinking Redis cache](https:\u002F\u002Fgozman.space\u002Fblog\u002Fshrinking-redis-cache-with-msgp-and-zstd-in-golang) drops structured payloads by around 90%.\n\n```go\n\u002F\u002F Instead of passing a 5MB report through the workflow...\nfunc ProcessWorkflow(ctx workflow.Context, report Report) error { \u002F* ... *\u002F }\n\n\u002F\u002F ...store it once and pass the key.\nfunc ProcessWorkflow(ctx workflow.Context, reportKey string) error {\n    \u002F\u002F activities fetch the blob by key from S3\u002FRedis and return only small results\n}\n```\n\nThe activity does the upload at the edge and returns a key, the next activity downloads by key, and the workflow only ever sees identifiers. History stays small, replay gets cheap again, and the worker stops falling over. As a lighter measure, you can also compress payloads with a custom Data Converter before they ever reach history, which buys you headroom without restructuring anything, though it doesn't change the fundamental shape of the problem.\n\n### Where ContinueAsNew fits, and where it doesn't.\n\nFor workflows that bloat by **event count** rather than payload size, the long-running ones that loop forever, Temporal's answer is `ContinueAsNew`. It completes the current run and atomically starts a fresh one with the same Workflow ID, a new Run ID, and an empty history. You can decide when to call it with `info.GetContinueAsNewSuggested()` instead of hardcoding a threshold.\n\nIt works, but I want to be honest about the cost. `ContinueAsNew` is itself an action that consumes server and worker resources, and on Temporal Cloud it contributes to your overall usage and bill. The plumbing also doesn't scale gracefully: once you commit to it, every new timer, activity, or child workflow you add has to be threaded through the same continue-as-new bookkeeping, and that bookkeeping is some of the most intrusive code you'll write in a workflow. So I reach for it when a workflow really needs to run forever, not as a way to paper over giant payloads. For giant payloads, the claim-check pattern is the real fix. `ContinueAsNew` resets the event count, but if a single payload is the problem, a fresh history just fills up again.\n\n### Catching it at review time with an AI reviewer.\n\nThe catch is that none of this is something a linter can check, because whether you'll blow the 50MB cap depends on data volume the compiler never sees. A workflow that fans out one activity per user account is perfectly fine for a user with three accounts and a problem for a user with three thousand. The code looks identical either way.\n\nWhat's worked for me is pushing this onto the AI code reviewer. I write the hard limits straight into the guardrails file the reviewer reads, the `CLAUDE.md` or `AGENTS.md` or whatever your tool picks up: single payload caps at 2MB, a transaction at 4MB, history warns at 10MB and dies at 50MB or 51,200 events. Then, and this is the part that actually does the work, I document the approximate data shapes inside the project itself. Things like \"a user has on the order of N accounts of this type and status,\" or \"reading accounts across all users lands somewhere in this range.\" Rough numbers, not exact, but enough to reason with.\n\nWith both the limits and the rough cardinalities written down, the reviewer can do the arithmetic on a pull request. It multiplies the documented per-record size by the expected count and tells you a given workflow is going to schedule tens of thousands of activities in one run, or that an activity return is going to push history past the warn threshold once real data hits it. Modern models are good at this kind of back-of-the-envelope estimate, better than I expected, and it has caught a few workflows for us before they shipped. It's not a guarantee. It is the difference between finding the problem as a review comment and finding it as a 2am OOM, which is the whole game.\n\n### Prefer the deterministic tool.\n\nI want to be careful not to oversell that last trick, because it cuts against something I believe more strongly. A deterministic check beats an AI judgment every time you can have one. A linter gives the same answer on the same code today, next week, and in CI on a Friday night. An AI reviewer is a probabilistic helper that has good days and bad days, and the bad day is the day the risky workflow ships anyway. So the AI estimate is a stopgap for the one case that resists static analysis, the data-volume problem, and not a license to stop building tools for everything else.\n\nIf anything, the better use of AI here is the opposite of catching bugs at review time. It's building the thing that catches them deterministically. temporalcheck-lint exists in part because I could lean on AI to move through the analyzer scaffolding, the AST matching, and the test cases far faster than I would have alone. That's the leverage worth chasing: point the model at shaping a tool that then runs the same way forever, instead of asking it to re-derive the same judgment on every pull request. The moment you catch yourself piling rule after rule into an `AGENTS.md` and hoping the reviewer holds them all in mind, that's usually the signal the rule wants to be a linter instead.\n\n## Putting it together.\n\nThree failure modes, and they fail in three different ways. The worker panics on a bad activity call. The replay diverges on non-deterministic code. The history bloats until a worker runs out of memory. None of them are exotic. They're the ordinary mistakes that happened to be invisible until production.\n\nTwo of them you can catch in CI today. Run [temporalcheck-lint](https:\u002F\u002Fgithub.com\u002Fsamgozman\u002Ftemporalcheck-lint) for the type-safety and panic-prevention class, and Temporal's [`workflowcheck`](https:\u002F\u002Fpkg.go.dev\u002Fgo.temporal.io\u002Fsdk\u002Fcontrib\u002Ftools\u002Fworkflowcheck) for determinism across the call graph. They cover opposite gaps, so use both. The third one is a habit rather than a tool: treat event history as expensive, keep big data out of it by reference, reach for `ContinueAsNew` deliberately rather than reflexively, and write the limits and your rough data sizes into the file your AI reviewer reads so it can do the math you can't see at compile time.\n\nThe thing all three share is that the compiler is happy and the tests pass right up until the moment they don't. Temporal gives you durable execution, but durability means it remembers every argument you ever passed and replays every line you ever wrote. It pays to assume it will.\n\nSee you in the next post 👾\n\nSources:\n\n- temporalcheck-lint: [https:\u002F\u002Fgithub.com\u002Fsamgozman\u002Ftemporalcheck-lint](https:\u002F\u002Fgithub.com\u002Fsamgozman\u002Ftemporalcheck-lint)\n- Temporal workflowcheck: [https:\u002F\u002Fpkg.go.dev\u002Fgo.temporal.io\u002Fsdk\u002Fcontrib\u002Ftools\u002Fworkflowcheck](https:\u002F\u002Fpkg.go.dev\u002Fgo.temporal.io\u002Fsdk\u002Fcontrib\u002Ftools\u002Fworkflowcheck)\n- Workflow Execution limits: [https:\u002F\u002Fdocs.temporal.io\u002Fworkflow-execution\u002Flimits](https:\u002F\u002Fdocs.temporal.io\u002Fworkflow-execution\u002Flimits)\n- Self-hosted Temporal Service defaults (payload, gRPC, history limits): [https:\u002F\u002Fdocs.temporal.io\u002Fself-hosted-guide\u002Fdefaults](https:\u002F\u002Fdocs.temporal.io\u002Fself-hosted-guide\u002Fdefaults)\n- Troubleshoot payload and gRPC message size limit errors: [https:\u002F\u002Fdocs.temporal.io\u002Ftroubleshooting\u002Fblob-size-limit-error](https:\u002F\u002Fdocs.temporal.io\u002Ftroubleshooting\u002Fblob-size-limit-error)\n- Managing very long-running Workflows with Temporal: [https:\u002F\u002Ftemporal.io\u002Fblog\u002Fvery-long-running-workflows](https:\u002F\u002Ftemporal.io\u002Fblog\u002Fvery-long-running-workflows)\n- WorkflowPanicPolicy (worker package): [https:\u002F\u002Fpkg.go.dev\u002Fgo.temporal.io\u002Fsdk\u002Fworker#WorkflowPanicPolicy](https:\u002F\u002Fpkg.go.dev\u002Fgo.temporal.io\u002Fsdk\u002Fworker#WorkflowPanicPolicy)","2026-07-03T13:37:29.369721Z","Temporal is great for orchestration, but a few classes of problem never show up until you're running it for real. Workers panic on bad activity calls because the SDK takes everything as interface{}. Workflows break replay determinism. And event history quietly bloats until a worker runs out of memory somewhere around 50MB. Here's what causes each one, the tooling that catches the first two, and the architecture that fixes the third.",10,[10,11,12],"temporal","golang","linter",741,"temporal-workflows-in-golang-the-three-things-that-bite-in-production","Temporal workflows in Golang: the 3 things that bite in production","2026-07-03T15:16:13.629244Z",true]