WebP Cloud Services Blog

Go 1.21 GA's PGO Optimization That Can't Be Understood from a Single Article - An Attempt on a WebP Server Go

这篇文章有简体中文版本,在: 一文读不懂的 Go 1.21 GA 的 PGO 优化——一次在 WebP Server Go 上的尝试

PGO

PGO, short for Profile-guided optimization, is described on Wikipedia as:

Profile-guided optimization (PGO, sometimes pronounced as pogo[1]), also known as profile-directed feedback (PDF),[2] and feedback-directed optimization (FDO)[3] is a compiler optimization technique in computer programming that uses profiling to improve program runtime performance.

Go 1.21 introduced PGO functionality, and currently, it includes only two optimizations: “Function Inlining” and “Devirtualization.” Regarding these two features, since we are not language experts, we recommend that capable readers explore the relevant content on their own. We have also provided links to some articles we believe are well-written (at least they don’t appear to be aimed at exploiting the general public) in the References section at the end of this article.

As it happens, we have also upgraded our WebP Server Go to version 1.21. Therefore, we decided to use this project to test whether PGO could improve our project’s performance.

Go PGO

Regarding PGO, the Go official documentation has the following documents/blogs:

In summary, if you want to use PGO with Go, follow these steps:

  • Find a way to run your program in your production environment with pprof enabled (Note: Version 1.21 currently only supports collecting information from net/http/pprof and runtime/pprof).
  • Rename the collected pprof files to default.pgo and place them in your project directory (or you can use a different name but need to specify the pprof file when building).
  • Rebuild your project.

WebP Server Go

According to the Go documentation’s requirements for Profile-guided optimization in Go 1.21, before using PGO, you should collect profile information in your production environment as much as possible to assist in compilation optimization. Since we’ve created WebP Server Go, we are quite familiar with what the production environment for this program looks like. We even created a repository, https://github.com/webp-sh/webp_bench/, to easily generate CPU/RAM usage graphs for different versions. This helps us identify potential performance regressions and memory usage (leakage) issues between different versions. For example, the graph for version 0.9.1 is as follows:

And for version 0.9.0:

From these graphs, we can see that version 0.9.1 consumes more memory and CPU time when converting the same image compared to version 0.9.0. Of course, we have conducted similar tests for all current versions of WebP Server Go, and you can find the complete report in our documentation: https://docs.webp.sh/benchmark/.

It was later discovered that this was due to changes in the ReductionEffort, which highlights the importance of continuous testing of the program.

Now, let’s briefly outline how WebP Server Go works:

  1. A request is sent to WebP Server Go, for example, GET /123.jpg.
  2. WebP Server Go first checks if the .jpg part is allowed in the configuration file.
    1. If it is allowed, then it checks if the local 123.jpg file exists. If it exists, it checks if there is a corresponding WebP format image that has been converted.
      1. If there is a WebP format image and it is smaller in size than the original image, it outputs the WebP format.
      2. If the WebP format image is larger, it outputs the original image.
      3. If the WebP format image does not exist, it converts it and outputs the smaller-sized file (as above).
    2. If it is not allowed, it directly returns “File extension not allowed!”

We want to determine if some of the code in this process can be automatically optimized by PGO, so the first step is to incorporate the pprof section into the program. The purpose is to generate a Profile file for subsequent optimization. Generally, we use Go’s net/http/pprof or runtime/pprof.

net/http/pprof

If your application is an HTTP server, then

import _ "net/http/pprof"

is imported as a side effect, so your web application will automatically include an /debug/pprof/profile accessible endpoint. Typically, this should suffice!

However, in the context of a WebP server in Go, we use * route matching, which means that directly applying net/http/pprof will cause /debug/pprof/profile to never be matched. In this case, we need to think of other solutions.

fiber

If you are using Fiber, it provides a convenient middleware for this.

import (
  "github.com/gofiber/fiber/v2"
  "github.com/gofiber/fiber/v2/middleware/pprof"
...
)

func main() {

...
	app.Use(pprof.New())
...
}

Manually Handling Routing Issues

Alternatively, if you prefer more control:

app.Get("/debug/pprof", adaptor.HTTPHandlerFunc(pprof.Index))
app.Get("/*", handler.Convert)

runtime/pprof

If your application is not an HTTP server, or you don’t want to expose debug information, or you want to measure a specific function rather than the entire application, you should use runtime/pprof. runtime/pprof requires us to start profiling using pprof.StartCPUProfile(f) and stop it using pprof.StopCPUProfile(). Here’s an example:

f, _ := os.Create("cpu.pprof")
pprof.StartCPUProfile(f)
// Some code for testing...

pprof.StopCPUProfile()
f.Close()

In the context of Fiber, it can be done very flexibly without exposing routes. You can also control the profile file through signals:

f, _ := os.Create("cpu.pprof")
pprof.StartCPUProfile(f)
signals := make(chan os.Signal, 1)
signal.Notify(signals, syscall.SIGINT, syscall.SIGTERM, syscall.SIGUSR1)

go func() {
	s := <-signals
	fmt.Println("Stopping...", s)
	pprof.StopCPUProfile()
	f.Close()
	app.Shutdown()
}()

After adding the above pprof code, we could use make to build a binary. After building the binary, we modified the https://github.com/webp-sh/webp_bench/blob/master/bench.sh script to remove the build part. The modified code is as follows:

#!/bin/bash

# Usage example: ./bench.sh 0.9.0
version="$1"
# Check if the version was provided
if [ -z "$version" ]; then
    echo "Please provide the version as the first command-line argument."
    exit 1
fi

## Prepare env
rm -rf exhaust metadata remote-raw
mkdir -p benchmark
find ./pics -type f -exec basename {} \\; | xargs -I {} echo "curl -s -H 'Accept: image/webp' http://localhost:3333/{}" > curl.sh
chmod +x curl.sh

## Build Start WebP Server Go
LD_PRLOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./webp-server-linux-amd64 &

# Step 2: Find the PID of the server process
server_pid=$(ps -aux | grep "webp-server-linux-amd64" | grep -v grep | awk '{print $2}')

# Step 3: Check if the PID was found and execute psrecord if found
if [ -n "$server_pid" ]; then
    # Get the version from the first command-line argument
    psrecord $server_pid --plot "./benchmark/$version.png" &
    # Execute the curl script
    ./curl.sh
    # Get server_pid RAM usage and running time
    # 0.9.0 1200 2:01
    # version RAM(MB) Time(MM:SS)
    echo "$version $(($(ps -p $server_pid -o rss=) / 1024)) $(ps -p $server_pid -o etime=)" >> ./benchmark/data.txt
else
    echo "Server process not found."
fi

# Kill WebP Server Go and psrecord
kill $server_pid
rm -rf exhaust metadata remote-raw curl.sh

The above code does the following:

  1. Cleans the local cache image directory.
  2. Generates a serial cURL command for a test set of images of about 2.4G under the ./pics directory (used for subsequent WebP Server Go testing).
  3. Starts the WebP Server Go program with the local filename webp-server-linux-amd64.
  4. Attaches psrecord to the webp-server-linux-amd64 process to collect CPU information and memory usage.
  5. Executes the cURL command in step 2.
  6. Collects the program’s CPU information and memory usage after the test is completed.
  7. Stops the WebP Server Go program and removes the files created in step 1.

Let’s run it:

bash bench-new.sh baseline-1
INFO[2023-09-08 21:15:17][47:main.setupLogger] WebP Server Go ready.

▌ ▌   ▌  ▛▀▖ ▞▀▖                ▞▀▖
▌▖▌▞▀▖▛▀▖▙▄▘ ▚▄ ▞▀▖▙▀▖▌ ▌▞▀▖▙▀▖ ▌▄▖▞▀▖
▙▚▌▛▀ ▌ ▌▌   ▖ ▌▛▀ ▌  ▐▐ ▛▀ ▌   ▌ ▌▌ ▌
▘ ▘▝▀▘▀▀ ▘   ▝▀ ▝▀▘▘   ▘ ▝▀▘▘   ▝▀ ▝▀

WebP Server Go - v0.9.9
Develop by WebP Server team. https://github.com/webp-sh

WebP Server Go is Running on http://127.0.0.1:3333
WARN[2023-09-08 21:15:17][36:webp_server_go/helper.ReadMetadata] can't read metadata: open metadata/local/c1b6abd2b92cef19.json: no such file or directory
Attaching to process 3793955
INFO[2023-09-08 21:15:18][295:webp_server_go/encoder.convertLog] WebP@80%: pics/DSC05982.jpg->exhaust/local/c1b6abd2b92cef19.webp 10093215->1560684 15.46% deflated
...
...
- [2023-09-08 21:17:00]  GET      /DSC06071.jpg  200   curl/7.82.0

Process finished (104.92 seconds)

The corresponding graph is as follows:

Now that we know this benchmark took approximately 104.92 seconds, we can collect some information via the /debug/pprof/ interface. We’ll run the benchmark once more, but this time we’ll use another terminal to save the pprof information using cURL:

curl -o cpu.pprof "http://localhost:3333/debug/pprof/profile?seconds=80"

We are collecting information for 80 seconds, which should be obtained before the benchmark completes.

The obtained file looks like this:

du -ch cpu.pprof 
28K	cpu.pprof
28K	total

go tool pprof  cpu.pprof
File: webp-server-linux-amd64
Build ID: e5de6a2d46540736f77d42a119799b2504d38a82
Type: cpu
Time: Sep 8, 2023 at 9:18pm (CST)
Duration: 80.11s, Total samples = 104.71s (130.70%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 102.99s, 98.36% of 104.71s total
Dropped 189 nodes (cum <= 0.52s)
Showing top 10 nodes out of 41
      flat  flat%   sum%        cum   cum%
    57.70s 55.10% 55.10%     57.70s 55.10%  runtime.cgocall
    26.74s 25.54% 80.64%     26.74s 25.54%  [libc.so.6]
    15.18s 14.50% 95.14%     15.18s 14.50%  [libjpeg.so.62.3.0]
     1.03s  0.98% 96.12%      1.03s  0.98%  [libglib-2.0.so.0.7200.3]
     1.02s  0.97% 97.10%      1.02s  0.97%  [libvips.so.42.14.2]
     0.75s  0.72% 97.81%      0.75s  0.72%  runtime/internal/syscall.Syscall6
     0.57s  0.54% 98.36%      0.57s  0.54%  runtime.memclrNoHeapPointers
         0     0% 98.36%     57.44s 54.86%  github.com/davidbyttow/govips/v2/vips.(*ImageRef).ExportWebp
         0     0% 98.36%     57.33s 54.75%  github.com/davidbyttow/govips/v2/vips._Cfunc_save_to_buffer
         0     0% 98.36%     57.44s 54.86%  github.com/davidbyttow/govips/v2/vips.vipsSaveToBuffer

Next, we rename this cpu.pprof file to default.pgo and place it in the modified source code directory of WebP Server Go. Afterward, we rebuild the binary.

You’ll notice that the build time is slightly longer this time.

You can check if this build includes the PGO file using:

go version -m ./builds/webp-server-linux-amd64

./builds/webp-server-linux-amd64: go1.21.0
	path	webp_server_go
	mod	webp_server_go	(devel)	
	dep	github.com/andybalholm/brotli	v1.0.5	h1:8uQZIdzKmjc/iuPu7O2ioW48L81FgatrcpfFmiq/cCs=
...
...
	build	vcs.modified=true
	build	-pgo=/home/Nova/bench/webp_server_go/default.pgo

Because we know that the encoding process may not see much improvement, let’s directly test the non-encoding processes, which are the two steps mentioned in the previous section, “How WebP Server Go Works”:

ab -k -c 5 -n 20000 http://127.0.0.1:3333/

First, we’ll start the old binary and run the benchmark three times, and here are the results:

Concurrency Level:      5
Time taken for tests:   0.431 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    46352.31 [#/sec] (mean)
Time per request:       0.108 [ms] (mean)
Time per request:       0.022 [ms] (mean, across all concurrent requests)
Transfer rate:          9188.98 [Kbytes/sec] received

Concurrency Level:      5
Time taken for tests:   0.454 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    44060.24 [#/sec] (mean)
Time per request:       0.113 [ms] (mean)
Time per request:       0.023 [ms] (mean, across all concurrent requests)
Transfer rate:          8734.60 [Kbytes/sec] received

Concurrency Level:      5
Time taken for tests:   0.426 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    46912.68 [#/sec] (mean)
Time per request:       0.107 [ms] (mean)
Time per request:       0.021 [ms] (mean, across all concurrent requests)
Transfer rate:          9300.07 [Kbytes/sec] received

Based on the “Time per request,” the average of the three tests is 0.022 ms.

Now, let’s start the new binary and run the benchmark three times:

Concurrency Level:      5
Time taken for tests:   0.424 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    47198.64 [#/sec] (mean)
Time per request:       0.106 [ms] (mean)
Time per request:       0.021 [ms] (mean, across all concurrent requests)
Transfer rate:          9356.76 [Kbytes/sec] received

Concurrency Level:      5
Time taken for tests:   0.424 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    47140.68 [#/sec] (mean)
Time per request:       0.106 [ms] (mean)
Time per request:       0.

021 [ms] (mean, across all concurrent requests)
Transfer rate:          9345.27 [Kbytes/sec] received

Concurrency Level:      5
Time taken for tests:   0.405 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    49411.26 [#/sec] (mean)
Time per request:       0.101 [ms] (mean)
Time per request:       0.020 [ms] (mean, across all concurrent requests)
Transfer rate:          9795.40 [Kbytes/sec] received

Based on the “Time per request” metric, the average result of the three tests mentioned earlier is approximately 0.020 [ms]. At this point, the benchmark script generates the following graph:

Graph

The time has decreased from the previously mentioned 104 seconds to 99 seconds, showing an improvement of approximately 4%.

Afterward, we compared the data from two runs using the webp_bench repository with the command go tool pprof -diff_base cpu-old.pprof cpu.pprof:

File: webp-server-linux-amd64
Build ID: 76103c6f56c404678a45a9e9e42ecee8765f3110
Type: cpu
Time: Sep 8, 2023, at 10:17 pm (CST)
Duration: 160.23s, Total samples = 106.09s (66.21%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top -cum
Showing nodes accounting for -0.96s, 0.9% of 106.09s total
Showing top 10 nodes out of 256
      flat  flat%   sum%        cum   cum%
         0     0%     0%     -0.76s  0.72%  runtime._ExternalCode
    -0.53s   0.5%   0.5%     -0.53s   0.5%  [libc.so.6]
    -0.25s  0.24%  0.74%     -0.25s  0.24%  [libglib-2.0.so.0.7200.3]
         0     0%  0.74%     -0.19s  0.18%  os.ReadFile
    -0.18s  0.17%   0.9%     -0.18s  0.17%  runtime.memclrNoHeapPointers
         0     0%   0.9%     -0.18s  0.17%  runtime.memclrNoHeapPointersChunked
         0     0%   0.9%     -0.15s  0.14%  runtime.makeslice
         0     0%   0.9%     -0.14s  0.13%  github.com/gofiber/fiber/v2.(*Ctx).Next
         0     0%   0.9%     -0.14s  0.13%  github.com/gofiber/fiber/v2/middleware/etag.New.func1
         0     0%   0.9%     -0.14s  0.13%  github.com/gofiber/fiber/v2/middleware/pprof.New.func1
(pprof) top
Showing nodes accounting for -1.03s, 0.97% of 106.09s total
Showing top 10 nodes out of 256
      flat  flat%   sum%        cum   cum%
    -0.53s   0.5%   0.5%     -0.53s   0.5%  [libc.so.6]
    -0.25s  0.24%  0.74%     -0.25s  0.24%  [libglib-2.0.so.0.7200.3]
    -0.18s  0.17%   0.9%     -0.18s  0.17%  runtime.memclrNoHeapPointers
     0.11s   0.1%   0.8%      0.11s   0.1%  [libvips.so.42.14.2]
    -0.09s 0.085%  0.89%     -0.12s  0.11%  runtime.scanobject
    -0.08s 0.075%  0.96%     -0.08s 0.075%  [libjpeg.so.62.3.0]
    -0.06s 0.057%  1.02%     -0.05s 0.047%  runtime.cgocall
     0.04s 0.038%  0.98%      0.04s 0.038%  runtime.futex
     0.04s 0.038%  0.94%      0.04s 0.038%  runtime/internal/syscall.Syscall6
    -0.03s 0.028%  0.97%     -0.03s 0.028%  runtime.memmove

Summary

The above text represents our attempt on WebP Server Go after discovering Go 1.21’s PGO GA. While we achieved some performance improvements, we still aren’t entirely certain if these improvements are due to PGO or caused by system errors (despite multiple tests to minimize errors).

Although PGO has been GA, according to the official documentation, the potential optimizations that can be achieved are quite limited. For programs like WebP Server Go, which have lightweight business logic and heavy reliance on low-level libvips code, the improvements may not be very significant.

This is the reason behind the title of this article, “Understanding Go 1.21 GA’s PGO Optimization in a Single Read,” as we lack experience from Tsinghua, private equity trading, and internet giants. We also don’t run a “WeChat Official account” to exploit readers, and we won’t simply translate the Go official blog and publish an article with a similar title like “Understanding PGO Performance Optimization Introduced in Go 1.20.”

References

  1. https://go.dev/blog/pgo
  2. https://zhuanlan.zhihu.com/p/609529412 (Webpage Mirror: https://archive.ph/dF2LN)

The WebP Cloud Services team is a small team of three individuals from Shanghai and Helsingborg. Since we are not funded and have no profit pressure, we remain committed to doing what we believe is right. We strive to do our best within the scope of our resources and capabilities. We also engage in various activities without affecting the services we provide to the public, and we continuously explore novel ideas in our products.

If you find this service interesting, feel free to log in to the WebP Cloud Dashboard to experience it. If you’re curious about other magical features it offers, take a look at our WebP Cloud Services Docs. We hope everyone enjoys using it!


Discuss on Hacker News