[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$pC4ZxlFnru":3,"blog-post-optimization-odyssey-profiling-and-benchmarking-golang-app-with-pprof":4},"email-3ed5lqvl",{"content":5,"created_at":6,"description":7,"id":8,"keywords":9,"reading_time":14,"slug":15,"title":16,"updated_at":17,"ok":18},"# Optimization Odyssey: pprof-ing & Benchmarking Golang App\n\nIn this post, I want to guide you through the whole process of profiling and benchmarking the Golang app. I will create a simple server app with the most naive implementation of it and will show you the step-by-step process of finding slow spots in your application with the help of the `pprof` profiling tool.\n\nBut before we get started, I want to emphasize something for you:\n\n> Never engage in premature optimizations.\n\nUsually, the naive approach to the problem is the best and safest. In the real world, outside Leetcode, your employer wants you to ship fast, not spending much time on optimizing nanosecond executions. Yes, there are some optimizations you still need to perform (like solving N+1 mysteries), but they are most likely to be business logic optimizations not covered in this post.\n\nDo fine-tune optimizations and benchmarking only when necessary or if you expect to see some massive load on this part of the code. Your 0.1 RPS app does not need this, believe me 🙂\n\n## Stage 1: The Naive Approach.\n\nOK, so before we deep dive into the pprof profiling, let's create our elementary project first. I will create some server app with just one route that will iterate over the large JSON array to return some summary. The dummy data array will consist of some fictional data objects of users' transactions and bank accounts; the app will need to iterate over the array and sum up those numbers. For the server, I will choose `gin` since it is the most popular tool currently, and for calculating money amounts I will choose `shopspring\u002Fdecimal`.\n\nEverything I mentioned here in this article is available for you in the GitHub repository: [samgozman\u002Fgolang-optimization-stages](https:\u002F\u002Fgithub.com\u002Fsamgozman\u002Fgolang-optimization-stages). At this point, you can safely skip Stage 1 and go directly to the next chapter.\n\nThe main function will be `ServeApp` and you can find the source code for it here: [samgozman\u002Fgolang-optimization-stages\u002Fblob\u002Fmain\u002Fstage1\u002Fmain.go](https:\u002F\u002Fgithub.com\u002Fsamgozman\u002Fgolang-optimization-stages\u002Fblob\u002Fmain\u002Fstage1\u002Fmain.go). Nothing special about it, just a classic gin server with context shutdown. Let's keep attention on the part that will do some work, in our case the router handler `GetJSONHandler`:\n\n```go\n\u002F\u002F GetJSONHandler is a simple handler that returns a JSON response with a message\nfunc GetJSONHandler(c *gin.Context) {\n  \u002F\u002F 1. Read the content of the file dummy_data.json\n  data := ReadFile(\"..\u002Fobject\u002Fdummy_data.json\")\n\n  \u002F\u002F 2. Return a JSON response with a message\n  users := ParseJSON(data)\n\n  \u002F\u002F 3. Calculate the total balances\n  currents, pendings := BalancesTotals(users)\n\n  \u002F\u002F 4. Calculate the total transactions\n  transactionsSum, transactionsCount := TransactionsTotals(users)\n\n  c.JSON(200, gin.H{\n    \"current\":            currents.String(),\n    \"pending\":            pendings.String(),\n    \"transactions_sum\":   transactionsSum.String(),\n    \"transactions_count\": transactionsCount,\n  })\n}\n```\n\nIt consists of 4 major functions: ReadFile, ParseJSON, BalancesTotals and TransactionsTotals.\n\n```go\n\u002F\u002F ReadFile reads a file and returns its content as a string\nfunc ReadFile(filePath string) []byte {\n  content, err := os.ReadFile(filePath)\n  if err != nil {\n    panic(fmt.Errorf(\"failed to read file: %w\", err))\n  }\n  return content\n}\n\nfunc ParseJSON(data []byte) []object.User {\n  var users []object.User\n  if err := json.Unmarshal(data, &users); err != nil {\n    panic(fmt.Errorf(\"failed to unmarshal JSON: %w\", err))\n  }\n\n  return users\n}\n\n\u002F\u002F BalancesTotals calculates the total balances of the users.\nfunc BalancesTotals(users []object.User) (currents decimal.Decimal, pendings decimal.Decimal) {\n  for _, user := range users {\n    current, _ := decimal.NewFromString(user.Balance.Current)\n    currents = currents.Add(current)\n\n    pending, _ := decimal.NewFromString(user.Balance.Pending)\n    pendings = pendings.Add(pending)\n  }\n\n  return\n}\n\n\u002F\u002F TransactionsTotals calculates the total transactions of the users.\nfunc TransactionsTotals(users []object.User) (sum decimal.Decimal, count int) {\n  var transactionsSum decimal.Decimal\n  var transactionsCount int\n\n  for _, user := range users {\n    for _, transaction := range user.Transactions {\n      amount, _ := decimal.NewFromString(transaction.Amount)\n      transactionsSum = transactionsSum.Add(amount)\n      transactionsCount++\n    }\n  }\n\n  return transactionsSum, transactionsCount\n}\n```\n\nDoesn't really matter what fields are in object `object.User` for this purpose. The main thing you should know is that it is a big structure and the array size is ~1.8 MB on the disk.\n\nI think that you can find already just a few spots that could be done more efficiently, like using a pointer to the structure, combining `BalancesTotals` and `TransactionsTotals`, reducing the Big O complexity (nested for-loops), or counting transactions with `len()`.\n\nIt would be fantastic if you could spot those things. However, it is not the best approach for an optimization process because you are not sure which changes will have the biggest impact. And do they have a positive impact at all? Will this be enough?\n\n### How to Im'pprof' our code: the guide into Golang profiling tool.\n\nTo start the optimization process, we first need to prioritize our targets. To do so, we need to find the slowest parts of our application and the most memory-inefficient parts of it. This where the `pprof` tool comes in.\n\n`pprof`, or performance profiler, is a tool that helps you to collect CPU profiles, traces, and memory heap profiles of your Golang application. It also comes with features to analyze and visualize generated `pprof` reports (but this will require [graphviz](http:\u002F\u002Fwww.graphviz.org\u002F) installed).\n\nThere are a few techniques of how you can capture profiles from your app:\n\n1. By manually calling the `pprof` tool in your code (in the benchmarking process)\n2. By running `pprof` API router alongside of your application, which is just **one** line of code with the help of standard library `net\u002Fhttp\u002Fpprof`.\n\nI will go with the first approach for this article because it can show you more of how you can use this tool. And you will not always be able to set up a `pprof` API comfortably in your application (if you are building some desktop application or a CLI tool for instance).\n\nNow, back to the code! We need to create some very basic tests for the app, but in the case of profiling, I would recommend going with the benchmark test type.\n\n```go\nfunc BenchmarkServeApp(b *testing.B) {\n  \u002F\u002F Start pprof profiling\n  if err := utils.StartPprof(); err != nil {\n    log.Fatal(err)\n  }\n\n  \u002F\u002F Create a context\n  ctx, cancel := context.WithCancel(context.Background())\n  defer cancel()\n\n  \u002F\u002F Run the server\n  go ServeApp(ctx)\n\n  \u002F\u002F Run the benchmark\n  for i := 0; i \u003C b.N; i++ {\n    _, err := http.Get(\"http:\u002F\u002Flocalhost:8080\u002Fjson\")\n    if err != nil {\n      b.Fatal(err)\n    }\n  }\n\n  \u002F\u002F Cancel the context\n  cancel()\n}\n```\n\nI have moved `pprof` initialization to the `utils.StartPprof`, we will return to it in a bit. Normally, we would have to create a benchmark just for the router handler function `GetJSONHandler`,  since it holds all the logic we are trying to optimize. But, for the educational approach, I will test the whole `ServeApp` - because you might want to replace `gin` router with something else to see if it improves any numbers.\n\nTo run the benchmark in Go, all you have to do, besides creating a test with `b *testing.B` as a parameter, is to run a console command:\n\n```bash\ngo test -bench=.\n```\n\nAs simple as that, but we will add some more parameters to it later on. Back to the profiling initialization function that I moved to have fewer repetitions between the steps in the repository for that post.\n\n```go\nvar cpuprofile = flag.String(\"cpuprofile\", \"\", \"write cpu profile to `file`\")\nvar memprofile = flag.String(\"memprofile\", \"\", \"write memory profile to `file`\")\n\n\u002F\u002F StartPprof starts pprof profiling\nfunc StartPprof() error {\n  flag.Parse()\n  if *cpuprofile != \"\" {\n    f, err := os.Create(*cpuprofile)\n    if err != nil {\n      return fmt.Errorf(\"could not create CPU profile: %w\", err)\n    }\n    if err := pprof.StartCPUProfile(f); err != nil {\n      _ = f.Close()\n      return fmt.Errorf(\"could not start CPU profile: %w\", err)\n    }\n    pprof.StopCPUProfile()\n    if err := f.Close(); err != nil {\n      return fmt.Errorf(\"could not close CPU profile file: %w\", err)\n    }\n  }\n\n  if *memprofile != \"\" {\n    f, err := os.Create(*memprofile)\n    if err != nil {\n      return fmt.Errorf(\"could not create memory profile: %w\", err)\n    }\n    runtime.GC() \u002F\u002F get up-to-date statistics\n    if err := pprof.WriteHeapProfile(f); err != nil {\n      _ = f.Close()\n      return fmt.Errorf(\"could not write memory profile: %w\", err)\n    }\n    if err := f.Close(); err != nil {\n      return fmt.Errorf(\"could not close memory profile file: %w\", err)\n    }\n  }\n\n  return nil\n}\n```\n\nOk, that's quite a lot of code! Remember, you can avoid writing this manually if you can \"afford\" to run a **profiling router** alongside your application. We are doing it this way not because we want just to optimize the app, but also to measure the results with controlled benchmarks.\n\nLet's take a look at this code. First, we are getting the `pprof` options from the CLI. In the `StartPprof` we are creating 2 separate files for binary output data from `pprof` (btw, it is not just binary, it is basically a compiled Protobufs). In this case, we are trying to capture CPU and memory heap profiles. Pprof has a few more options you can trace, but those are the most common and important in my opinion to start with.\n\nRemember that Golang has a garbage collector? Garbage collection is the process of finding and reclaiming memory that is no longer in use by your program. In Go, this is typically managed automatically, but here we are calling `runtime.GC()`, which is not the thing you will normally do in your application. The reason `runtime.GC()` is called before writing the memory profile is to ensure that the memory statistics are up-to-date. If there is any memory that can be reclaimed, `runtime.GC()` will do so, and then the memory profile will reflect the current state of memory usage after that cleanup.\n\nIn contrast, CPU profiling is about tracking where time is spent, rather than resources that are used. It records the function call stack at regular intervals during the period of profiling. This is why `runtime.GC()` is not called in the CPU profiling section - it's not relevant to the data being collected.\n\n### Capturing & reading performance profiles with pprof.\n\nEverything is set up and ready for the first test! Let's proceed with the actual bech & profiling:\n\n```bash\ngo test -v -cpuprofile=cpu.pprof -memprofile=mem.pprof -benchmem -bench=. -benchtime=1s -count=10\n```\n\nWhat it does is running our benchmark test with performance profile writing into `mem.pprof` and `cpu.pprof` files. The part I want you to keep attention to is  those 2 flags:\n\n- `-benchtime=1s` will ensure we will run as many iterations as possible within 1 second. This will help to get fast results regarding your function RPS capabilities. You can also lock the number of iterations of your code to run instead of a timeout by setting `5x` as an argument (to run 5 times, for example).\n- `-count=10` is how many times we will repeat our benchmark. For fast 1s benchmarks, I recommend you to run at least 10 times to get some average values. Depending on many things in your system, each run will not be the same. This is why it is recommended to run benchmarks in some dedicated server environment with as few background processes as possible to get reliable results.\n\nAfter running this command, you will get this output (your actual numbers will vary):\n\n```bash\nBenchmarkServeApp-10   75   15359607 ns\u002Fop   5022236 B\u002Fop   63424 allocs\u002Fop\nBenchmarkServeApp-10   79   15283220 ns\u002Fop   5019885 B\u002Fop   63420 allocs\u002Fop\n...\n```\n\nIt is a lot of information and is not that readable at the first glance (I don't know why the go-team forgot to add some headers here, but anyway). You can manually average down those values or to run longer tests (also benchmark iterations can be compared automatically by the CLI tool, but I will address it later). Let's take a look at how we can read those values:\n\n- **BenchmarkServeApp-10** - The `-10` indicates the env `GOMAXPROCS` value, which is the maximum number of CPUs that can be executing simultaneously.\n- **75, 79,** etc.: These are the number of iterations that the benchmark was able to run in the time specified by `-benchtime` (1 second in our case).\n- **15359607 ns\u002Fop, 15283220 ns\u002Fop,** etc.: This is the average time taken per operation, in nanoseconds. An \"operation\" is one iteration of your benchmark function.\n- **5022236 B\u002Fop, 5019885 B\u002Fop,** etc.: This is the average amount of memory allocated per operation, in bytes. This includes both heap and stack memory.\n- **63424 allocs\u002Fop, 63420 allocs\u002Fop,** etc.: This is the average number of memory allocations made per operation.\n\nNow that you understand how to read Golang benchmark values, it is time for the next part: reading `mem.pprof` and `cpu.pprof` profiles.\n\n> Small reminder - [graphviz](http:\u002F\u002Fwww.graphviz.org\u002F) installation required.\n\nI will start from the `mem.pprof`, since it is impossible to open them both with one command. To do so, run this command and head into the browser:\n\n```bash\ngo tool pprof -http=':8081' mem.pprof\n```\n\nOpen the URL provided in the console to access `pprof` UI. Before you will be overwhelmed with the chart, let's look at the menu. In the \"Sample\" section, there are 4 options available: `alloc_objects`, `alloc_space`, `inuse_objects` and `inuse_space`. I think it is obvious from the name what each sample group represents. The group that is interested the most to me is `alloc_space` - click on it.\n\n![pprof graph alloc_space](https:\u002F\u002Fgozman.space\u002Fimg\u002Fposts\u002Fpprof-may-2024\u002Fpprof-graph-1.png)\n\nWhat you see here is a call stack of functions used in our app and the libraries inside it. Just by looking at this graph view, you can already notice what functions allocated the most space during execution. We had only 4 functions in `GetJSONHandler` where `ParseJSON` with `ReadFile` is draining most of the memory.\n\n> Focus on optimizing function in the descending order of the resources they use.\n\nNext, in the \"View\" section of the menu, choose \"Flame Graph\" view. The default graph was great, but I prefer to use \"Flame Graph\" here (I found it more compact and readable).\n\n![pprof flame stage 1](https:\u002F\u002Fgozman.space\u002Fimg\u002Fposts\u002Fpprof-may-2024\u002Fpprof-flame-1.png)\n\nYou have probably noticed an alarming memory usage value, almost **4 GB** in our case! Oh, that's a lot! But don't be frightened, it is not a constant RAM pressure. \n\nThe `pprof` tool in Go provides a visualization of the memory usage of your program. When you see large numbers like 4 GB, it's showing the cumulative size of the objects allocated in your program, not the current memory usage. And, since we were running the command with the `-count=10` flag for 1 second each iteration, you see the cumulative size for 10 runs.\n\nThe pprof tool records every allocation made by your program. If your program allocates plenty of small objects that are quickly garbage collected, the total size of those allocations can be much larger than the actual memory usage of your program at any given moment.\n\nFor example, if your program allocates 1 MB of memory 4000 times, and each time it frees that memory before the next allocation, your program would never use more than 1 MB of memory at a time. However, `pprof` would report that your program allocated 4 GB of memory.\n\nSo, the large numbers you're seeing in `pprof` are not a measure of your program's memory usage, but rather a measure of its memory allocation. It's a useful tool for identifying parts of your code that are creating a lot of garbage for the garbage collector to clean up, which can slow down your program.\n\n---\n\nOk, that was a lot of information. But we are not done, not yet! Remember that we are still having the `cpu.pprof` to process? Stop the current `pprof` execution and open the next file:\n\n```bash\ngo tool pprof -http=':8081' cpu.pprof\n```\n\nFrom the first seconds, you can notice that we are having a larger and more complex graph. And you can see that your application was used ~51% of the CPU time… And the rest was utilized by Golang runtime and garbage collector. Let's switch the view to \"Flame Graph\" to look at our program (the sample setting is \"cpu\"):\n\n![pprof flame cpu stage 1](https:\u002F\u002Fgozman.space\u002Fimg\u002Fposts\u002Fpprof-may-2024\u002Fpprof-flame-cpu-1.png)\n\nSee how it differs from memory usage? Most of the CPU time was utilized on the `ParseJSON` function. Every other function that we wrote are having so insignificant an impact on the CPU, that we can't even set their names on the chart without clicking on the items. But is it a correct conclusion? No, it is not.\n\nWhile you can't see `ReadFile`, `BalancesTotals` & `TransactionsTotals` functions here, they are impacting your application performance in the most hidden way: through the runtime and garbage collector (GC). Our root application only occupies about **51****%** of CPU time. Every unnecessary memory allocation in nested loops, pointers misuse, redundant goroutines etc. will lead to pressure on the CPU via the runtime process and GC. It is the rest 49% of our program and there is no easy way to measure the impact here (and some processes we can't control), but we will use this information to compare the impact of our future optimizations.\n\n## Stage 2: Optimizing JSON parsing in Golang.\n\nBefore we continue, there is something I need you to modify in the `go test` command. Add at the end of it `> stage1.txt` to save the output to the text file instead of the console. We will use it later to assess our results between stages.\n\n```bash\ngo test -v -cpuprofile=cpu.pprof -memprofile=mem.pprof -benchmem -bench=. -benchtime=1s -count=10 > stage1.txt\n```\n\nYou also will need to install `benchstat` cmd if you would rather not compare benchmark results manually every single time [https:\u002F\u002Fcs.opensource.google\u002Fgo\u002Fx\u002Fperf](https:\u002F\u002Fcs.opensource.google\u002Fgo\u002Fx\u002Fperf).\n\n```bash\ngo install golang.org\u002Fx\u002Fperf\u002Fcmd\u002Fbenchstat@latest\n```\n\nFrom every performance metric that we've seen so far: JSON parsing is our main bottleneck. It is a well-known problem for applications that need to operate under high RPS load. Our first step here is to find a more efficient library to handle JSON tasks. I will save you some time and will go on with [go-json](https:\u002F\u002Fgithub.com\u002Fgoccy\u002Fgo-json), you will only need to replace import. You can also go with `sonic` library, but its performance is very dependent on the target system.\n\n```go\n-import \"encoding\u002Fjson\"\n+import \"github.com\u002Fgoccy\u002Fgo-json\"\n```\n\nNow it's time to run our performance test for the second time, but this time we will save it into `stage2.txt` file.\n\n```bash\ngo test -v -cpuprofile=cpu.pprof -memprofile=mem.pprof -benchmem -bench=. -benchtime=1s -count=10 > stage2.txt\n```\n\nJust by looking at the `stage2.txt` output, I can already tell that we made some huge difference in terms of performance. The number of iterations we were able to perform in one second significantly increased by **3-3.5x** times!\n\n```bash\nBenchmarkServeApp-10   229  4959000 ns\u002Fop  6174943 B\u002Fop   31765 allocs\u002Fop\nBenchmarkServeApp-10   247  4957436 ns\u002Fop  6168185 B\u002Fop   31759 allocs\u002Fop\n...\n```\n\nIt is time to compare results with our new tool. To run the `benchstat` we only need to provide those files:\n\n```bash\nbenchstat -table .fullname .\u002Fstage1\u002Fstage1.txt .\u002Fstage2\u002Fstage2.txt\n```\n\nNow let's look at the results\n\n```bash\nbenchstat -table .fullname .\u002Fstage1\u002Fstage1.txt .\u002Fstage2\u002Fstage2.txt\n.fullname: ServeApp-10\n            │ .\u002Fstage1\u002Fstage1.txt │         .\u002Fstage2\u002Fstage2.txt         │\n            │       sec\u002Fop        │   sec\u002Fop     vs base                │\nServeApp-10          15.569m ± 2%   5.221m ± 5%  -66.47% (p=0.000 n=10)\n\n            │ .\u002Fstage1\u002Fstage1.txt │         .\u002Fstage2\u002Fstage2.txt          │\n            │        B\u002Fop         │     B\u002Fop      vs base                │\nServeApp-10          4.785Mi ± 0%   5.732Mi ± 3%  +19.77% (p=0.000 n=10)\n\n            │ .\u002Fstage1\u002Fstage1.txt │         .\u002Fstage2\u002Fstage2.txt         │\n            │      allocs\u002Fop      │  allocs\u002Fop   vs base                │\nServeApp-10           63.42k ± 0%   31.75k ± 0%  -49.94% (p=0.000 n=10)\n```\n\nThe benchmark results show that after replacing the standard JSON decoder with `go-json`, the execution time per operation (sec\u002Fop) decreased by approximately 66.47%, indicating a significant improvement in performance. However, the memory usage per operation (B\u002Fop) increased by about 19.77%, and the number of allocations per operation (allocs\u002Fop) decreased by nearly 49.94%, suggesting that while the operation became faster, it also became slightly more memory-intensive.\n\n![pprof flame stage 2](https:\u002F\u002Fgozman.space\u002Fimg\u002Fposts\u002Fpprof-may-2024\u002Fpprof-flame-2.png)\n\nAfter looking at the updated pprof flame graph, you can see that `ParseJSON` function still takes up a lot of space here. You can also notice that the total cumulative memory usage skyrocketed from 4 GB to almost 20 GB! But keep in mind, that we were able to increase the number of operations per second by roughly 3 times and, well, B\u002Fop is up 19.77%, which was a tradeoff. This explains such a difference in cumulative memory usage.\n\nOverall, this was a giant step in our optimization journey! Let's continue forward.\n\n## Stage 3: Optimizing ReadFile operation.\n\nOur next step is to increase `ReadFile` performance. Instead of relying on `os.ReadFile` we can use the built-in `bufio` package. Buffio implements a buffered reader and writer of the same interface.\n\n```go\nfunc ReadFile(filePath string) (io.Reader, func()) {\n  file, err := os.Open(filePath)\n  if err != nil {\n    panic(fmt.Errorf(\"failed to open file: %w\", err))\n  }\n\n  return bufio.NewReader(file), func() {\n    _ = file.Close()\n  }\n}\n\nfunc ParseJSON(data io.Reader) []object.User {\n  var users []object.User\n\n  if err := json.NewDecoder(data).Decode(&users); err != nil {\n    panic(fmt.Errorf(\"failed to unmarshal JSON: %w\", err))\n  }\n\n  return users\n}\n\nfunc GetJSONHandler(c *gin.Context) {\n  data, cancel := ReadFile(\"..\u002Fobject\u002Fdummy_data.json\")\n\n  users := ParseJSON(data)\n  \u002F\u002F Close the file reader after reading the content.\n  \u002F\u002F Note: it can be deferred, but it's better to close it as soon as possible.\n  cancel()\n  \n  ...\n}\n```\n\nThe primary distinction between buffered I\u002FO and standard I\u002FO lies in how data is processed. Buffered I\u002FO operates by reading or writing data in blocks, rather than handling one byte at a time.\n\nLet's run our tests and compare the results!\n\n```bash\nBenchmarkServeApp-10     189   5937594 ns\u002Fop 4593636 B\u002Fop   31772 allocs\u002Fop\nBenchmarkServeApp-10     190   6154874 ns\u002Fop 4559697 B\u002Fop   31766 allocs\u002Fop\n...\n```\n\nWe can see that something is not right. The number of iterations decreased (from ~250 in Stage 2 to 190). Next, I will run a `benchstat` to compare results further:\n\n```bash\nbenchstat -table .fullname .\u002Fstage2\u002Fstage2.txt .\u002Fstage3\u002Fstage3.txt\n.fullname: ServeApp-10\n            │ .\u002Fstage2\u002Fstage2.txt │         .\u002Fstage3\u002Fstage3.txt         │\n            │       sec\u002Fop        │   sec\u002Fop     vs base                │\nServeApp-10           4.795m ± 2%   6.155m ± 3%  +28.37% (p=0.000 n=10)\n\n            │ .\u002Fstage2\u002Fstage2.txt │         .\u002Fstage3\u002Fstage3.txt          │\n            │        B\u002Fop         │     B\u002Fop      vs base                │\nServeApp-10          5.699Mi ± 2%   4.235Mi ± 3%  -25.70% (p=0.000 n=10)\n\n            │ .\u002Fstage2\u002Fstage2.txt │        .\u002Fstage3\u002Fstage3.txt         │\n            │      allocs\u002Fop      │  allocs\u002Fop   vs base               │\nServeApp-10           31.75k ± 0%   31.76k ± 0%  +0.04% (p=0.012 n=10)\n```\n\nThe execution time per operation (sec\u002Fop) increased by 28% (this explains the significant drop in iterations), however, the memory usage per operation (B\u002Fop) decreased by almost the same amount! So we are using less memory, but need more CPU to achieve this.\n\n> What we have learned here is that each optimization has its own drawbacks. You need to consider what you are trying to achieve - whether it's speeding up your application or increasing its performance. These objectives do not always go hand in hand.\n\nI will continue to focus on speeding up the application here, so we should devise a different approach suitable for our case. Since we are consistently serving only one file, and it remains intact at all times, we can cache it.\n\nThe best approach here will be to use `sync.Once` struct from the standard package. It will ensure that `ReadFile` saves the data just once.\n\n```go\nvar (\n  once    sync.Once\n  content []byte\n)\n\nfunc ReadFile(filePath string) []byte {\n    once.Do(func() {\n        var err error\n        content, err = os.ReadFile(filePath)\n        if err != nil {\n            panic(fmt.Errorf(\"failed to read file: %w\", err))\n        }\n    })\n    return content\n}\n```\n\nNeat and simple. We should run our benchmarks to see the impact:\n\n```bash\nBenchmarkServeApp-10     280   4208854 ns\u002Fop 4188958 B\u002Fop   31745 allocs\u002Fop\nBenchmarkServeApp-10     286   4146523 ns\u002Fop 4165174 B\u002Fop   31741 allocs\u002Fop\n...\n```\n\nOK, that was better, really better! The number of iterations increased from Stage 2 by about 10-12%. Let's look at the `benchstat`  as we did before:\n\n```bash\nbenchstat -table .fullname .\u002Fstage2\u002Fstage2.txt .\u002Fstage3\u002Fstage3.txt\n.fullname: ServeApp-10\n            │ .\u002Fstage2\u002Fstage2.txt │        .\u002Fstage3\u002Fstage3.txt         │\n            │       sec\u002Fop        │   sec\u002Fop     vs base               │\nServeApp-10           4.795m ± 2%   4.329m ± 3%  -9.71% (p=0.000 n=10)\n\n            │ .\u002Fstage2\u002Fstage2.txt │         .\u002Fstage3\u002Fstage3.txt          │\n            │        B\u002Fop         │     B\u002Fop      vs base                │\nServeApp-10          5.699Mi ± 2%   3.928Mi ± 2%  -31.07% (p=0.000 n=10)\n\n            │ .\u002Fstage2\u002Fstage2.txt │        .\u002Fstage3\u002Fstage3.txt         │\n            │      allocs\u002Fop      │  allocs\u002Fop   vs base               │\nServeApp-10           31.75k ± 0%   31.74k ± 0%  -0.03% (p=0.000 n=10)\n```\n\nAfter caching the file, the benchmarking results show a significant performance improvement. The operation time decreased by approximately 9.71%, the memory usage declined by around 31.07%, and the number of allocations per operation remained almost the same.\n\n![pprof flame stage 3](https:\u002F\u002Fgozman.space\u002Fimg\u002Fposts\u002Fpprof-may-2024\u002Fpprof-flame-3.png)\n\nNow you can't even find the `ReadFile` function from the `pprof` UI! Furthermore, good to mention that the total cumulative memory usage has noticeably reduced from 20 GB to 15 GB. I'm pleased with the results and will continue further!\n\n## Stage 4: Pointers & Loops.\n\nI think that the optimizations described in this chapter were the most obvious from the start. The first reason I kept them for so long is that you are most likely to use pointers automatically and can spot unnecessary loops on your own. The second reason is that while this approach matters, it will only help us just a tiny bit in comparison with the previous things.\n\nI will change 3 things in the code:\n\n1. Replace all `[]object.User` with `[]*object.User`. If you are new to this, you might want to do something like `*[]`, but it will make things worse in any use case. Copying pointers is a cheap operation.\n2. Combine `BalancesTotals` & `TransactionsTotals` into one function to remove redundant loops. Sidenote: you can also flatten the data, so you can remove a nested loop, but that will require a bit more time to cook.\n3. Count transactions with `len()`\n\nI think that is enough of the code being shown here already. If you're interested, you can see the full example in the GitHub repository: [samgozman\u002Fgolang-optimization-stages\u002Ftree\u002Fmain\u002Fstage3](https:\u002F\u002Fgithub.com\u002Fsamgozman\u002Fgolang-optimization-stages\u002Ftree\u002Fmain\u002Fstage3). Let's move to benchmarking once again…\n\n```bash\nBenchmarkServeApp-10     277   3953330 ns\u002Fop 3697193 B\u002Fop   31744 allocs\u002Fop\nBenchmarkServeApp-10     306   3929780 ns\u002Fop 3693402 B\u002Fop   31739 allocs\u002Fop\n...\n```\n\nJust a slight difference, `benchstat`  will show us a bit more:\n\n```bash\nbenchstat -table .fullname .\u002Fstage3\u002Fstage3.txt .\u002Fstage4\u002Fstage4.txt\n.fullname: ServeApp-10\n            │ .\u002Fstage3\u002Fstage3.txt │      .\u002Fstage4\u002Fstage4.txt      │\n            │       sec\u002Fop        │   sec\u002Fop     vs base          │\nServeApp-10           4.329m ± 3%   4.264m ± 7%  ~ (p=0.684 n=10)\n\n            │ .\u002Fstage3\u002Fstage3.txt │         .\u002Fstage4\u002Fstage4.txt          │\n            │        B\u002Fop         │     B\u002Fop      vs base                │\nServeApp-10          3.928Mi ± 2%   3.519Mi ± 0%  -10.42% (p=0.000 n=10)\n\n            │ .\u002Fstage3\u002Fstage3.txt │      .\u002Fstage4\u002Fstage4.txt      │\n            │      allocs\u002Fop      │  allocs\u002Fop   vs base          │\nServeApp-10           31.74k ± 0%   31.73k ± 0%  ~ (p=0.467 n=10)\n```\n\nThe memory usage declined by 10% from the previous iteration, everything else is the same. Most of the impact here comes from switching to pointers, the loop optimization hasn't affected the performance much.\n\n## Stage 5: Cache Cheating.\n\nAnd now it is the time for a little cheat which is only applicable to our code example! Remember that our router only serves one file every single time? It is not that uncommon, especially if your server does the job of serving front-end texts. In Stage 3 I introduced `sync.Once` for the `ReadFile` function. But we can use it for the whole handler logic!\n\n```go\nvar (\n  once     sync.Once\n  response gin.H\n)\n\nfunc GetJSONHandler(c *gin.Context) {\n  once.Do(func() {\n    \u002F\u002F 1. Read the content of the file dummy_data.json\n    data := ReadFile(\"..\u002Fobject\u002Fdummy_data.json\")\n\n    \u002F\u002F 2. Return a JSON response with a message\n    users := ParseJSON(data)\n\n    \u002F\u002F 3. Calculate the total balances and transactions\n    currents, pendings, transactionsSum, transactionsCount := GetTotals(users)\n\n    response = gin.H{\n      \"current\":            currents.String(),\n      \"pending\":            pendings.String(),\n      \"transactions_sum\":   transactionsSum.String(),\n      \"transactions_count\": transactionsCount,\n    }\n  })\n\n  c.JSON(200, response)\n}\n```\n\nI have moved `once.Do` from  `ReadFile` into `GetJSONHandler`. Now we only create this response once, which makes all our previous optimizations kinda useless. But I wanted to show them because this particular cheat can only be used in some very specific examples. In more complicated scenarios, you will probably cache responses in some Redis-like key-value storage via router middleware. Nevertheless, the global idea is the same: use cache.\n\n```bash\nBenchmarkServeApp-10    3146    466019 ns\u002Fop   24553 B\u002Fop     174 allocs\u002Fop\nBenchmarkServeApp-10    2426    511220 ns\u002Fop   22710 B\u002Fop     171 allocs\u002Fop\n...\n```\n\nAstonishing results!\n\n```bash\nbenchstat -table .fullname .\u002Fstage4\u002Fstage4.txt .\u002Fstage5\u002Fstage5.txt\n.fullname: ServeApp-10\n            │ .\u002Fstage4\u002Fstage4.txt │          .\u002Fstage5\u002Fstage5.txt          │\n            │       sec\u002Fop        │    sec\u002Fop      vs base                │\nServeApp-10          4264.5µ ± 7%   805.6µ ± 340%  -81.11% (p=0.000 n=10)\n\n            │ .\u002Fstage4\u002Fstage4.txt │          .\u002Fstage5\u002Fstage5.txt          │\n            │        B\u002Fop         │     B\u002Fop       vs base                │\nServeApp-10        3603.50Ki ± 0%   27.12Ki ± 12%  -99.25% (p=0.000 n=10)\n\n            │ .\u002Fstage4\u002Fstage4.txt │        .\u002Fstage5\u002Fstage5.txt         │\n            │      allocs\u002Fop      │ allocs\u002Fop   vs base                │\nServeApp-10          31733.5 ± 0%   174.0 ± 5%  -99.45% (p=0.000 n=10)\n```\n\nNow this router barely uses anything, it is just crazy! Yes, it is cheating, but the legal one in the case of this example. I suggest to look at our `alloc_space` graph for the last time, just for fun:\n\n![pprof flame stage 5](https:\u002F\u002Fgozman.space\u002Fimg\u002Fposts\u002Fpprof-may-2024\u002Fpprof-flame-5.png)\n\nIt is difficult to notice any function on this graph that we've been optimizing for all this time! The total cumulative memory usage is now 1\u002F8 of its original value. The same applies to the CPU profile as well.\n\n![pprof flame cpu stage 5](https:\u002F\u002Fgozman.space\u002Fimg\u002Fposts\u002Fpprof-may-2024\u002Fpprof-flame-cpu-5.png)\n\n## Summary.\n\nIt was too much information for one blog post, but I'm glad I didn't split it in half! We moved from our first naive approach, which could hold up to 78 requests per second, to the optimal approach in stage 4 which helped to increase RPS up to 279 on average. And in stage 5 we've used some illegal cache cheating to move the plank even higher to up to  1487  RPS.\n\nEverything that we've covered here is about how to approach the optimization process, what steps to take, and how to use `pprof` and benchmarks in Golang.\n\nSome tips and notes:\n\n- Never engage in premature optimizations\n- Read more about `pprof` in the README file: [github.com\u002Fgoogle\u002Fpprof](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fpprof\u002Fblob\u002Fmain\u002Fdoc\u002FREADME.md)\n- You can optimize work with large objects by using `sync.Pool` method. It has its drawbacks, but will help to reduce the number of allocations in some cases\n- The `sync.Once` function will lead us to the usage of global variables, which is an antipattern in most languages. Use this thing wisely and never export those variables to the outside\n- If you want to bench some part of the application inside the test, be sure not to use any mocks in it, as it will lead to some troubles: `testing.B` interface is not expected by most mocking techniques (`testing.T`)\n- If your benchmark setup process takes significant time before the actual testing - use `b.ResetTimer()` function to reset timer statistics\n- Use `net\u002Fhttp\u002Fpprof` package to run `pprof` API alongside your application (this can help to \"debug\"  some problems on production)\n- When optimizing your code, strive not to sacrifice readability\n\nI hope you found this information useful and see you in the next post! Subscribe for future updates :)","2024-05-06T07:39:58.669805Z","A guide through the whole process of profiling and benchmarking the Golang app in 5 stages. The step-by-step process of finding slow spots in your application with the help of the pprof profiling tool. Using various optimization techniques, I will show you how to increase the RPS of a simple router by 20 times.",4,[10,11,12,13],"profiling","pprof","golang","optimization",1203,"optimization-odyssey-profiling-and-benchmarking-golang-app-with-pprof","Optimization Odyssey: pprof-ing & Benchmarking Golang App","2024-05-07T06:59:45.704387Z",true]