Optimization Odyssey: pprof-ing & Benchmarking Golang App

In this post, I want to guide you through the whole process of profiling and benchmarking the Golang app. I will create a simple server app with the most naive implementation of it and will show you the step-by-step process of finding slow spots in your application with the help of the pprof profiling tool.

But before we get started, I want to emphasize something for you:

Never engage in premature optimizations.

Usually, the naive approach to the problem is the best and safest. In the real world, outside Leetcode, your employer wants you to ship fast, not spending much time on optimizing nanosecond executions. Yes, there are some optimizations you still need to perform (like solving N+1 mysteries), but they are most likely to be business logic optimizations not covered in this post.

Do fine-tune optimizations and benchmarking only when necessary or if you expect to see some massive load on this part of the code. Your 0.1 RPS app does not need this, believe me 🙂

Stage 1: The Naive Approach.

OK, so before we deep dive into the pprof profiling, let's create our elementary project first. I will create some server app with just one route that will iterate over the large JSON array to return some summary. The dummy data array will consist of some fictional data objects of users' transactions and bank accounts; the app will need to iterate over the array and sum up those numbers. For the server, I will choose gin since it is the most popular tool currently, and for calculating money amounts I will choose shopspring/decimal.

Everything I mentioned here in this article is available for you in the GitHub repository: samgozman/golang-optimization-stages. At this point, you can safely skip Stage 1 and go directly to the next chapter.

The main function will be ServeApp and you can find the source code for it here: samgozman/golang-optimization-stages/blob/main/stage1/main.go. Nothing special about it, just a classic gin server with context shutdown. Let's keep attention on the part that will do some work, in our case the router handler GetJSONHandler:

// GetJSONHandler is a simple handler that returns a JSON response with a message
func GetJSONHandler(c *gin.Context) {
  // 1. Read the content of the file dummy_data.json
  data := ReadFile("../object/dummy_data.json")

  // 2. Return a JSON response with a message
  users := ParseJSON(data)

  // 3. Calculate the total balances
  currents, pendings := BalancesTotals(users)

  // 4. Calculate the total transactions
  transactionsSum, transactionsCount := TransactionsTotals(users)

  c.JSON(200, gin.H{
    "current":            currents.String(),
    "pending":            pendings.String(),
    "transactions_sum":   transactionsSum.String(),
    "transactions_count": transactionsCount,
  })
}

It consists of 4 major functions: ReadFile, ParseJSON, BalancesTotals and TransactionsTotals.

// ReadFile reads a file and returns its content as a string
func ReadFile(filePath string) []byte {
  content, err := os.ReadFile(filePath)
  if err != nil {
    panic(fmt.Errorf("failed to read file: %w", err))
  }
  return content
}

func ParseJSON(data []byte) []object.User {
  var users []object.User
  if err := json.Unmarshal(data, &users); err != nil {
    panic(fmt.Errorf("failed to unmarshal JSON: %w", err))
  }

  return users
}

// BalancesTotals calculates the total balances of the users.
func BalancesTotals(users []object.User) (currents decimal.Decimal, pendings decimal.Decimal) {
  for _, user := range users {
    current, _ := decimal.NewFromString(user.Balance.Current)
    currents = currents.Add(current)

    pending, _ := decimal.NewFromString(user.Balance.Pending)
    pendings = pendings.Add(pending)
  }

  return
}

// TransactionsTotals calculates the total transactions of the users.
func TransactionsTotals(users []object.User) (sum decimal.Decimal, count int) {
  var transactionsSum decimal.Decimal
  var transactionsCount int

  for _, user := range users {
    for _, transaction := range user.Transactions {
      amount, _ := decimal.NewFromString(transaction.Amount)
      transactionsSum = transactionsSum.Add(amount)
      transactionsCount++
    }
  }

  return transactionsSum, transactionsCount
}

Doesn't really matter what fields are in object object.User for this purpose. The main thing you should know is that it is a big structure and the array size is ~1.8 MB on the disk.

I think that you can find already just a few spots that could be done more efficiently, like using a pointer to the structure, combining BalancesTotals and TransactionsTotals, reducing the Big O complexity (nested for-loops), or counting transactions with len().

It would be fantastic if you could spot those things. However, it is not the best approach for an optimization process because you are not sure which changes will have the biggest impact. And do they have a positive impact at all? Will this be enough?

How to Im'pprof' our code: the guide into Golang profiling tool.

To start the optimization process, we first need to prioritize our targets. To do so, we need to find the slowest parts of our application and the most memory-inefficient parts of it. This where the pprof tool comes in.

pprof, or performance profiler, is a tool that helps you to collect CPU profiles, traces, and memory heap profiles of your Golang application. It also comes with features to analyze and visualize generated pprof reports (but this will require graphviz installed).

There are a few techniques of how you can capture profiles from your app:

  1. By manually calling the pprof tool in your code (in the benchmarking process)
  2. By running pprof API router alongside of your application, which is just one line of code with the help of standard library net/http/pprof.

I will go with the first approach for this article because it can show you more of how you can use this tool. And you will not always be able to set up a pprof API comfortably in your application (if you are building some desktop application or a CLI tool for instance).

Now, back to the code! We need to create some very basic tests for the app, but in the case of profiling, I would recommend going with the benchmark test type.

func BenchmarkServeApp(b *testing.B) {
  // Start pprof profiling
  if err := utils.StartPprof(); err != nil {
    log.Fatal(err)
  }

  // Create a context
  ctx, cancel := context.WithCancel(context.Background())
  defer cancel()

  // Run the server
  go ServeApp(ctx)

  // Run the benchmark
  for i := 0; i < b.N; i++ {
    _, err := http.Get("http://localhost:8080/json")
    if err != nil {
      b.Fatal(err)
    }
  }

  // Cancel the context
  cancel()
}

I have moved pprof initialization to the utils.StartPprof, we will return to it in a bit. Normally, we would have to create a benchmark just for the router handler function GetJSONHandler, since it holds all the logic we are trying to optimize. But, for the educational approach, I will test the whole ServeApp - because you might want to replace gin router with something else to see if it improves any numbers.

To run the benchmark in Go, all you have to do, besides creating a test with b *testing.B as a parameter, is to run a console command:

go test -bench=.

As simple as that, but we will add some more parameters to it later on. Back to the profiling initialization function that I moved to have fewer repetitions between the steps in the repository for that post.

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to `file`")
var memprofile = flag.String("memprofile", "", "write memory profile to `file`")

// StartPprof starts pprof profiling
func StartPprof() error {
  flag.Parse()
  if *cpuprofile != "" {
    f, err := os.Create(*cpuprofile)
    if err != nil {
      return fmt.Errorf("could not create CPU profile: %w", err)
    }
    if err := pprof.StartCPUProfile(f); err != nil {
      _ = f.Close()
      return fmt.Errorf("could not start CPU profile: %w", err)
    }
    pprof.StopCPUProfile()
    if err := f.Close(); err != nil {
      return fmt.Errorf("could not close CPU profile file: %w", err)
    }
  }

  if *memprofile != "" {
    f, err := os.Create(*memprofile)
    if err != nil {
      return fmt.Errorf("could not create memory profile: %w", err)
    }
    runtime.GC() // get up-to-date statistics
    if err := pprof.WriteHeapProfile(f); err != nil {
      _ = f.Close()
      return fmt.Errorf("could not write memory profile: %w", err)
    }
    if err := f.Close(); err != nil {
      return fmt.Errorf("could not close memory profile file: %w", err)
    }
  }

  return nil
}

Ok, that's quite a lot of code! Remember, you can avoid writing this manually if you can "afford" to run a profiling router alongside your application. We are doing it this way not because we want just to optimize the app, but also to measure the results with controlled benchmarks.

Let's take a look at this code. First, we are getting the pprof options from the CLI. In the StartPprof we are creating 2 separate files for binary output data from pprof (btw, it is not just binary, it is basically a compiled Protobufs). In this case, we are trying to capture CPU and memory heap profiles. Pprof has a few more options you can trace, but those are the most common and important in my opinion to start with.

Remember that Golang has a garbage collector? Garbage collection is the process of finding and reclaiming memory that is no longer in use by your program. In Go, this is typically managed automatically, but here we are calling runtime.GC(), which is not the thing you will normally do in your application. The reason runtime.GC() is called before writing the memory profile is to ensure that the memory statistics are up-to-date. If there is any memory that can be reclaimed, runtime.GC() will do so, and then the memory profile will reflect the current state of memory usage after that cleanup.

In contrast, CPU profiling is about tracking where time is spent, rather than resources that are used. It records the function call stack at regular intervals during the period of profiling. This is why runtime.GC() is not called in the CPU profiling section - it's not relevant to the data being collected.

Capturing & reading performance profiles with pprof.

Everything is set up and ready for the first test! Let's proceed with the actual bech & profiling:

go test -v -cpuprofile=cpu.pprof -memprofile=mem.pprof -benchmem -bench=. -benchtime=1s -count=10

What it does is running our benchmark test with performance profile writing into mem.pprof and cpu.pprof files. The part I want you to keep attention to is those 2 flags:

  • -benchtime=1s will ensure we will run as many iterations as possible within 1 second. This will help to get fast results regarding your function RPS capabilities. You can also lock the number of iterations of your code to run instead of a timeout by setting 5x as an argument (to run 5 times, for example).
  • -count=10 is how many times we will repeat our benchmark. For fast 1s benchmarks, I recommend you to run at least 10 times to get some average values. Depending on many things in your system, each run will not be the same. This is why it is recommended to run benchmarks in some dedicated server environment with as few background processes as possible to get reliable results.

After running this command, you will get this output (your actual numbers will vary):

BenchmarkServeApp-10   75   15359607 ns/op   5022236 B/op   63424 allocs/op
BenchmarkServeApp-10   79   15283220 ns/op   5019885 B/op   63420 allocs/op
...

It is a lot of information and is not that readable at the first glance (I don't know why the go-team forgot to add some headers here, but anyway). You can manually average down those values or to run longer tests (also benchmark iterations can be compared automatically by the CLI tool, but I will address it later). Let's take a look at how we can read those values:

  • BenchmarkServeApp-10 - The -10 indicates the env GOMAXPROCS value, which is the maximum number of CPUs that can be executing simultaneously.
  • 75, 79, etc.: These are the number of iterations that the benchmark was able to run in the time specified by -benchtime (1 second in our case).
  • 15359607 ns/op, 15283220 ns/op, etc.: This is the average time taken per operation, in nanoseconds. An "operation" is one iteration of your benchmark function.
  • 5022236 B/op, 5019885 B/op, etc.: This is the average amount of memory allocated per operation, in bytes. This includes both heap and stack memory.
  • 63424 allocs/op, 63420 allocs/op, etc.: This is the average number of memory allocations made per operation.

Now that you understand how to read Golang benchmark values, it is time for the next part: reading mem.pprof and cpu.pprof profiles.

Small reminder - graphviz installation required.

I will start from the mem.pprof, since it is impossible to open them both with one command. To do so, run this command and head into the browser:

go tool pprof -http=':8081' mem.pprof

Open the URL provided in the console to access pprof UI. Before you will be overwhelmed with the chart, let's look at the menu. In the "Sample" section, there are 4 options available: alloc_objects, alloc_space, inuse_objects and inuse_space. I think it is obvious from the name what each sample group represents. The group that is interested the most to me is alloc_space - click on it.

pprof graph alloc_space

What you see here is a call stack of functions used in our app and the libraries inside it. Just by looking at this graph view, you can already notice what functions allocated the most space during execution. We had only 4 functions in GetJSONHandler where ParseJSON with ReadFile is draining most of the memory.

Focus on optimizing function in the descending order of the resources they use.

Next, in the "View" section of the menu, choose "Flame Graph" view. The default graph was great, but I prefer to use "Flame Graph" here (I found it more compact and readable).

pprof flame stage 1

You have probably noticed an alarming memory usage value, almost 4 GB in our case! Oh, that's a lot! But don't be frightened, it is not a constant RAM pressure.

The pprof tool in Go provides a visualization of the memory usage of your program. When you see large numbers like 4 GB, it's showing the cumulative size of the objects allocated in your program, not the current memory usage. And, since we were running the command with the -count=10 flag for 1 second each iteration, you see the cumulative size for 10 runs.

The pprof tool records every allocation made by your program. If your program allocates plenty of small objects that are quickly garbage collected, the total size of those allocations can be much larger than the actual memory usage of your program at any given moment.

For example, if your program allocates 1 MB of memory 4000 times, and each time it frees that memory before the next allocation, your program would never use more than 1 MB of memory at a time. However, pprof would report that your program allocated 4 GB of memory.

So, the large numbers you're seeing in pprof are not a measure of your program's memory usage, but rather a measure of its memory allocation. It's a useful tool for identifying parts of your code that are creating a lot of garbage for the garbage collector to clean up, which can slow down your program.


Ok, that was a lot of information. But we are not done, not yet! Remember that we are still having the cpu.pprof to process? Stop the current pprof execution and open the next file:

go tool pprof -http=':8081' cpu.pprof

From the first seconds, you can notice that we are having a larger and more complex graph. And you can see that your application was used ~51% of the CPU time… And the rest was utilized by Golang runtime and garbage collector. Let's switch the view to "Flame Graph" to look at our program (the sample setting is "cpu"):

pprof flame cpu stage 1

See how it differs from memory usage? Most of the CPU time was utilized on the ParseJSON function. Every other function that we wrote are having so insignificant an impact on the CPU, that we can't even set their names on the chart without clicking on the items. But is it a correct conclusion? No, it is not.

While you can't see ReadFile, BalancesTotals & TransactionsTotals functions here, they are impacting your application performance in the most hidden way: through the runtime and garbage collector (GC). Our root application only occupies about 51**%** of CPU time. Every unnecessary memory allocation in nested loops, pointers misuse, redundant goroutines etc. will lead to pressure on the CPU via the runtime process and GC. It is the rest 49% of our program and there is no easy way to measure the impact here (and some processes we can't control), but we will use this information to compare the impact of our future optimizations.

Stage 2: Optimizing JSON parsing in Golang.

Before we continue, there is something I need you to modify in the go test command. Add at the end of it > stage1.txt to save the output to the text file instead of the console. We will use it later to assess our results between stages.

go test -v -cpuprofile=cpu.pprof -memprofile=mem.pprof -benchmem -bench=. -benchtime=1s -count=10 > stage1.txt

You also will need to install benchstat cmd if you would rather not compare benchmark results manually every single time https://cs.opensource.google/go/x/perf.

go install golang.org/x/perf/cmd/benchstat@latest

From every performance metric that we've seen so far: JSON parsing is our main bottleneck. It is a well-known problem for applications that need to operate under high RPS load. Our first step here is to find a more efficient library to handle JSON tasks. I will save you some time and will go on with go-json, you will only need to replace import. You can also go with sonic library, but its performance is very dependent on the target system.

-import "encoding/json"
+import "github.com/goccy/go-json"

Now it's time to run our performance test for the second time, but this time we will save it into stage2.txt file.

go test -v -cpuprofile=cpu.pprof -memprofile=mem.pprof -benchmem -bench=. -benchtime=1s -count=10 > stage2.txt

Just by looking at the stage2.txt output, I can already tell that we made some huge difference in terms of performance. The number of iterations we were able to perform in one second significantly increased by 3-3.5x times!

BenchmarkServeApp-10   229  4959000 ns/op  6174943 B/op   31765 allocs/op
BenchmarkServeApp-10   247  4957436 ns/op  6168185 B/op   31759 allocs/op
...

It is time to compare results with our new tool. To run the benchstat we only need to provide those files:

benchstat -table .fullname ./stage1/stage1.txt ./stage2/stage2.txt

Now let's look at the results

benchstat -table .fullname ./stage1/stage1.txt ./stage2/stage2.txt
.fullname: ServeApp-10
 ./stage1/stage1.txt         ./stage2/stage2.txt       sec/op   sec/op     vs baseServeApp-10          15.569m ± 2%   5.221m ± 5%  -66.47% (p=0.000 n=10)

 ./stage1/stage1.txt         ./stage2/stage2.txt        B/op     B/op      vs baseServeApp-10          4.785Mi ± 0%   5.732Mi ± 3%  +19.77% (p=0.000 n=10)

 ./stage1/stage1.txt         ./stage2/stage2.txt      allocs/op  allocs/op   vs baseServeApp-10           63.42k ± 0%   31.75k ± 0%  -49.94% (p=0.000 n=10)

The benchmark results show that after replacing the standard JSON decoder with go-json, the execution time per operation (sec/op) decreased by approximately 66.47%, indicating a significant improvement in performance. However, the memory usage per operation (B/op) increased by about 19.77%, and the number of allocations per operation (allocs/op) decreased by nearly 49.94%, suggesting that while the operation became faster, it also became slightly more memory-intensive.

pprof flame stage 2

After looking at the updated pprof flame graph, you can see that ParseJSON function still takes up a lot of space here. You can also notice that the total cumulative memory usage skyrocketed from 4 GB to almost 20 GB! But keep in mind, that we were able to increase the number of operations per second by roughly 3 times and, well, B/op is up 19.77%, which was a tradeoff. This explains such a difference in cumulative memory usage.

Overall, this was a giant step in our optimization journey! Let's continue forward.

Stage 3: Optimizing ReadFile operation.

Our next step is to increase ReadFile performance. Instead of relying on os.ReadFile we can use the built-in bufio package. Buffio implements a buffered reader and writer of the same interface.

func ReadFile(filePath string) (io.Reader, func()) {
  file, err := os.Open(filePath)
  if err != nil {
    panic(fmt.Errorf("failed to open file: %w", err))
  }

  return bufio.NewReader(file), func() {
    _ = file.Close()
  }
}

func ParseJSON(data io.Reader) []object.User {
  var users []object.User

  if err := json.NewDecoder(data).Decode(&users); err != nil {
    panic(fmt.Errorf("failed to unmarshal JSON: %w", err))
  }

  return users
}

func GetJSONHandler(c *gin.Context) {
  data, cancel := ReadFile("../object/dummy_data.json")

  users := ParseJSON(data)
  // Close the file reader after reading the content.
  // Note: it can be deferred, but it's better to close it as soon as possible.
  cancel()
  
  ...
}

The primary distinction between buffered I/O and standard I/O lies in how data is processed. Buffered I/O operates by reading or writing data in blocks, rather than handling one byte at a time.

Let's run our tests and compare the results!

BenchmarkServeApp-10     189   5937594 ns/op 4593636 B/op   31772 allocs/op
BenchmarkServeApp-10     190   6154874 ns/op 4559697 B/op   31766 allocs/op
...

We can see that something is not right. The number of iterations decreased (from ~250 in Stage 2 to 190). Next, I will run a benchstat to compare results further:

benchstat -table .fullname ./stage2/stage2.txt ./stage3/stage3.txt
.fullname: ServeApp-10
 ./stage2/stage2.txt         ./stage3/stage3.txt       sec/op   sec/op     vs baseServeApp-10           4.795m ± 2%   6.155m ± 3%  +28.37% (p=0.000 n=10)

 ./stage2/stage2.txt         ./stage3/stage3.txt        B/op     B/op      vs baseServeApp-10          5.699Mi ± 2%   4.235Mi ± 3%  -25.70% (p=0.000 n=10)

 ./stage2/stage2.txt        ./stage3/stage3.txt      allocs/op  allocs/op   vs baseServeApp-10           31.75k ± 0%   31.76k ± 0%  +0.04% (p=0.012 n=10)

The execution time per operation (sec/op) increased by 28% (this explains the significant drop in iterations), however, the memory usage per operation (B/op) decreased by almost the same amount! So we are using less memory, but need more CPU to achieve this.

What we have learned here is that each optimization has its own drawbacks. You need to consider what you are trying to achieve - whether it's speeding up your application or increasing its performance. These objectives do not always go hand in hand.

I will continue to focus on speeding up the application here, so we should devise a different approach suitable for our case. Since we are consistently serving only one file, and it remains intact at all times, we can cache it.

The best approach here will be to use sync.Once struct from the standard package. It will ensure that ReadFile saves the data just once.

var (
  once    sync.Once
  content []byte
)

func ReadFile(filePath string) []byte {
    once.Do(func() {
        var err error
        content, err = os.ReadFile(filePath)
        if err != nil {
            panic(fmt.Errorf("failed to read file: %w", err))
        }
    })
    return content
}

Neat and simple. We should run our benchmarks to see the impact:

BenchmarkServeApp-10     280   4208854 ns/op 4188958 B/op   31745 allocs/op
BenchmarkServeApp-10     286   4146523 ns/op 4165174 B/op   31741 allocs/op
...

OK, that was better, really better! The number of iterations increased from Stage 2 by about 10-12%. Let's look at the benchstat as we did before:

benchstat -table .fullname ./stage2/stage2.txt ./stage3/stage3.txt
.fullname: ServeApp-10
 ./stage2/stage2.txt        ./stage3/stage3.txt       sec/op   sec/op     vs baseServeApp-10           4.795m ± 2%   4.329m ± 3%  -9.71% (p=0.000 n=10)

 ./stage2/stage2.txt         ./stage3/stage3.txt        B/op     B/op      vs baseServeApp-10          5.699Mi ± 2%   3.928Mi ± 2%  -31.07% (p=0.000 n=10)

 ./stage2/stage2.txt        ./stage3/stage3.txt      allocs/op  allocs/op   vs baseServeApp-10           31.75k ± 0%   31.74k ± 0%  -0.03% (p=0.000 n=10)

After caching the file, the benchmarking results show a significant performance improvement. The operation time decreased by approximately 9.71%, the memory usage declined by around 31.07%, and the number of allocations per operation remained almost the same.

pprof flame stage 3

Now you can't even find the ReadFile function from the pprof UI! Furthermore, good to mention that the total cumulative memory usage has noticeably reduced from 20 GB to 15 GB. I'm pleased with the results and will continue further!

Stage 4: Pointers & Loops.

I think that the optimizations described in this chapter were the most obvious from the start. The first reason I kept them for so long is that you are most likely to use pointers automatically and can spot unnecessary loops on your own. The second reason is that while this approach matters, it will only help us just a tiny bit in comparison with the previous things.

I will change 3 things in the code:

  1. Replace all []object.User with []*object.User. If you are new to this, you might want to do something like *[], but it will make things worse in any use case. Copying pointers is a cheap operation.
  2. Combine BalancesTotals & TransactionsTotals into one function to remove redundant loops. Sidenote: you can also flatten the data, so you can remove a nested loop, but that will require a bit more time to cook.
  3. Count transactions with len()

I think that is enough of the code being shown here already. If you're interested, you can see the full example in the GitHub repository: samgozman/golang-optimization-stages/tree/main/stage3. Let's move to benchmarking once again…

BenchmarkServeApp-10     277   3953330 ns/op 3697193 B/op   31744 allocs/op
BenchmarkServeApp-10     306   3929780 ns/op 3693402 B/op   31739 allocs/op
...

Just a slight difference, benchstat will show us a bit more:

benchstat -table .fullname ./stage3/stage3.txt ./stage4/stage4.txt
.fullname: ServeApp-10
 ./stage3/stage3.txt      ./stage4/stage4.txt       sec/op   sec/op     vs baseServeApp-10           4.329m ± 3%   4.264m ± 7%  ~ (p=0.684 n=10)

 ./stage3/stage3.txt         ./stage4/stage4.txt        B/op     B/op      vs baseServeApp-10          3.928Mi ± 2%   3.519Mi ± 0%  -10.42% (p=0.000 n=10)

 ./stage3/stage3.txt      ./stage4/stage4.txt      allocs/op  allocs/op   vs baseServeApp-10           31.74k ± 0%   31.73k ± 0%  ~ (p=0.467 n=10)

The memory usage declined by 10% from the previous iteration, everything else is the same. Most of the impact here comes from switching to pointers, the loop optimization hasn't affected the performance much.

Stage 5: Cache Cheating.

And now it is the time for a little cheat which is only applicable to our code example! Remember that our router only serves one file every single time? It is not that uncommon, especially if your server does the job of serving front-end texts. In Stage 3 I introduced sync.Once for the ReadFile function. But we can use it for the whole handler logic!

var (
  once     sync.Once
  response gin.H
)

func GetJSONHandler(c *gin.Context) {
  once.Do(func() {
    // 1. Read the content of the file dummy_data.json
    data := ReadFile("../object/dummy_data.json")

    // 2. Return a JSON response with a message
    users := ParseJSON(data)

    // 3. Calculate the total balances and transactions
    currents, pendings, transactionsSum, transactionsCount := GetTotals(users)

    response = gin.H{
      "current":            currents.String(),
      "pending":            pendings.String(),
      "transactions_sum":   transactionsSum.String(),
      "transactions_count": transactionsCount,
    }
  })

  c.JSON(200, response)
}

I have moved once.Do from ReadFile into GetJSONHandler. Now we only create this response once, which makes all our previous optimizations kinda useless. But I wanted to show them because this particular cheat can only be used in some very specific examples. In more complicated scenarios, you will probably cache responses in some Redis-like key-value storage via router middleware. Nevertheless, the global idea is the same: use cache.

BenchmarkServeApp-10    3146    466019 ns/op   24553 B/op     174 allocs/op
BenchmarkServeApp-10    2426    511220 ns/op   22710 B/op     171 allocs/op
...

Astonishing results!

benchstat -table .fullname ./stage4/stage4.txt ./stage5/stage5.txt
.fullname: ServeApp-10
 ./stage4/stage4.txt          ./stage5/stage5.txt       sec/op    sec/op      vs baseServeApp-10          4264.5µ ± 7%   805.6µ ± 340%  -81.11% (p=0.000 n=10)

 ./stage4/stage4.txt          ./stage5/stage5.txt        B/op     B/op       vs baseServeApp-10        3603.50Ki ± 0%   27.12Ki ± 12%  -99.25% (p=0.000 n=10)

 ./stage4/stage4.txt        ./stage5/stage5.txt      allocs/op allocs/op   vs baseServeApp-10          31733.5 ± 0%   174.0 ± 5%  -99.45% (p=0.000 n=10)

Now this router barely uses anything, it is just crazy! Yes, it is cheating, but the legal one in the case of this example. I suggest to look at our alloc_space graph for the last time, just for fun:

pprof flame stage 5

It is difficult to notice any function on this graph that we've been optimizing for all this time! The total cumulative memory usage is now 1/8 of its original value. The same applies to the CPU profile as well.

pprof flame cpu stage 5

Summary.

It was too much information for one blog post, but I'm glad I didn't split it in half! We moved from our first naive approach, which could hold up to 78 requests per second, to the optimal approach in stage 4 which helped to increase RPS up to 279 on average. And in stage 5 we've used some illegal cache cheating to move the plank even higher to up to 1487 RPS.

Everything that we've covered here is about how to approach the optimization process, what steps to take, and how to use pprof and benchmarks in Golang.

Some tips and notes:

  • Never engage in premature optimizations
  • Read more about pprof in the README file: github.com/google/pprof
  • You can optimize work with large objects by using sync.Pool method. It has its drawbacks, but will help to reduce the number of allocations in some cases
  • The sync.Once function will lead us to the usage of global variables, which is an antipattern in most languages. Use this thing wisely and never export those variables to the outside
  • If you want to bench some part of the application inside the test, be sure not to use any mocks in it, as it will lead to some troubles: testing.B interface is not expected by most mocking techniques (testing.T)
  • If your benchmark setup process takes significant time before the actual testing - use b.ResetTimer() function to reset timer statistics
  • Use net/http/pprof package to run pprof API alongside your application (this can help to "debug" some problems on production)
  • When optimizing your code, strive not to sacrifice readability

I hope you found this information useful and see you in the next post! Subscribe for future updates :)

6 May

2024