一文读不懂的 Go 1.21 GA 的 PGO 优化——一次在 WebP Server Go 上的尝试

Sep 15, 2023

This article is also available in English, at Go 1.21 GA’s PGO Optimization That Can’t Be Understood from a Single Article - An Attempt on a WebP Server Go.

PGO

PGO ，全称 Profile-guided optimization，在 Wikipedia 上的介绍是：

Profile-guided optimization (PGO, sometimes pronounced as pogo[1]), also known as profile-directed feedback (PDF),[2] and feedback-directed optimization (FDO)[3] is a compiler optimization technique in computer programming that uses profiling to improve program runtime performance.

Go 1.21 中 GA 了 PGO 的功能，目前的优化项只有「函数 Inline 」和「去虚拟化」这两个，关于这两个功能由于我们也不是语言专家不好班门弄斧，所以建议有能力的读者自行学习相关内容，我们在文末的 References 中也链接了一些我们认为写的还不错的（至少看上去不是开公众号割韭菜的）文章。

正好我们 WebP Server Go 也升级到了 1.21 版本，所以我们决定用这个项目来测试看 PGO 是否能对于项目有所提升。

Go PGO

关于 PGO ，在 Go 的官方资料中有如下文档/博客有介绍：

简单来说如果在 Go 下要使用 PGO 的话步骤如下：

想办法在自己的生产环境跑的程序中开 pprof 收集信息（且 1.21 版本目前只支持来自 net/http/pprof 和 runtime/pprof 采集的信息）
将收集到的 pprof 文件改名为 default.pgo 放在项目目录下（或者也可以是别的名字，需要在构建时带上 -pgo 指定 pprof 文件）
重新构建项目

WebP Server Go

根据 Go 文档 Profile-guided optimization in Go 1.21 的要求，在 PGO 前需要尽可能在生产环境收集 profile 信息来辅助编译优化，由于我们创造了 WebP Server Go，我们可太熟悉这个程序生产环境是怎么样的了，甚至我们还写了个仓库 https://github.com/webp-sh/webp_bench/ 用来方便地生成不同版本的 CPU/RAM 使用率图，用来方便我们排查不同版本之间可能的性能回退和内存使用（泄漏）问题，例如 0.9.1 版本的图如下：

0.9.0 的图如下：

从图中我们可以看出 0.9.1 相比较 0.9.0 版本在转换相同的图片的时候消耗了更多的内存和 CPU 时间（多么失败的一个版本哇）。当然，我们对目前所有的 WebP Server Go 版本都做了类似的测试，完整的报告在我们的文档 https://docs.webp.sh/benchmark/ 看到。

后续发现是由于修改了 ReductionEffort 导致的，这也充分说明了持续对程序进行测试的重要性。

我们来简单梳理一下 WebP Server Go 是怎么工作的：

一个请求发到 WebP Server Go 上，比如 GET /123.jpg
WebP Server Go 先判断 .jpg 部分是否在配置文件允许的文件后缀中
1. 如果在的话，那么判断本地是否存在 123.jpg 文件，如果存在的话就会判断是否有转换好的对应的 WebP 格式的图片
  1. 如果有 WebP 格式图片且体积比原图小的话就输出 WebP 格式
  2. 如果 WebP 格式更大的话就输出原图
  3. 如果 WebP 格式图片不存在的话就转换并判断体积后输出体积小的文件（同上）
2. 如果不在的话，直接返回 File extension not allowed!

我们想知道这个流程中的一些代码能不能被 PGO 自动优化，所以第一步我们在程序中加入 pprof 的部分，目的说用于生成Profile文件用于后续优化。一般来说我们要利用Go的 net/http/pprof 或 runtime/pprof

net/http/pprof

如果你的应用是 HTTP Server，那么

import _ "net/http/pprof"

以副作用的形式导入，这样你的 Web 应用就会自动带一个 /debug/pprof/profile 可供访问。通常这样就足够了！

但是 WebP Server Go 的场景下，我们使用了 * 路由匹配，那么意味着直接应用 net/http/pprof 会导致 /debug/pprof/profile 永远不会被匹配到。此时我们就要想些别的办法了

fiber

如果你用 fiber 的话，fiber 提供了一个很好用的中间件

import (
  "github.com/gofiber/fiber/v2"
  "github.com/gofiber/fiber/v2/middleware/pprof"
...
)

func main() {

...
	app.Use(pprof.New())
...
}

手动处理路由问题

或者，你更想掌控一切

app.Get("/debug/pprof", adaptor.HTTPHandlerFunc(pprof.Index))
app.Get("/*", handler.Convert)

runtime/pprof

如果你的应用不是 HTTP Server，或者不想暴露 Debug 信息，或者想单独衡量某个函数，而不是整个应用等，那么此时就要使用 runtime/pprof 了。runtime/pprof 要求我们通过 pprof.StartCPUProfile(f) 开启，pprof.StopCPUProfile() 结束。举例如下

f, _ := os.Create("cpu.pprof")
pprof.StartCPUProfile(f)
// some code for testing...

pprof.StopCPUProfile()
f.Close()

在 fiber 的场景下，我们是这么搞的，非常灵活，不仅不会暴露路由，还可以通过信号控制 profile 文件

f, _ := os.Create("cpu.pprof")
pprof.StartCPUProfile(f)
signals := make(chan os.Signal, 1)
signal.Notify(signals, syscall.SIGINT, syscall.SIGTERM, syscall.SIGUSR1)

go func() {
	s := <-signals
	fmt.Println("Stopping...", s)
	pprof.StopCPUProfile()
	f.Close()
	app.Shutdown()
}()

加入了上面的 pprof 代码之后我们便可以使用 make 来构建一个 binary 了，构建完 binary 之后我们魔改一下 https://github.com/webp-sh/webp_bench/blob/master/bench.sh 代码，删除构建的部分，修改后的代码如下：

#!/bin/bash

# Usage example: ./bench.sh 0.9.0
version="$1"
# Check if the version was provided
if [ -z "$version" ]; then
    echo "Please provide the version as the first command-line argument."
    exit 1
fi

## Prepare env
rm -rf exhaust metadata remote-raw
mkdir -p benchmark
find ./pics -type f -exec basename {} \\; | xargs -I {} echo "curl -s -H 'Accept: image/webp' http://localhost:3333/{}" > curl.sh
chmod +x curl.sh

## Build Start WebP Server Go
LD_PRLOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 ./webp-server-linux-amd64 &

# Step 2: Find the PID of the server process
server_pid=$(ps -aux | grep "webp-server-linux-amd64" | grep -v grep | awk '{print $2}')

# Step 3: Check if the PID was found and execute psrecord if found
if [ -n "$server_pid" ]; then
    # Get the version from the first command-line argument
    psrecord $server_pid --plot "./benchmark/$version.png" &
    # Execute the curl script
    ./curl.sh
    # Get server_pid RAM usage and running time
    # 0.9.0 1200 2:01
    # version RAM(MB) Time(MM:SS)
    echo "$version $(($(ps -p $server_pid -o rss=) / 1024)) $(ps -p $server_pid -o etime=)" >> ./benchmark/data.txt
else
    echo "Server process not found."
fi

# Kill WebP Server Go and psrecord
kill $server_pid
rm -rf exhaust metadata remote-raw curl.sh

上面的代码会做以下几件事情：

清理本地的缓存图片目录
针对 ./pics 目录下 2.4G 左右的图片测试集生成串行的 cURL 指令（用于后续发给 WebP Server Go）
启动本地文件名为 webp-server-linux-amd64 的 WebP Server Go 程序
将 psrecord 给 attach 到 webp-server-linux-amd64 进程上用于收集 CPU 信息和内存使用信息
在 cURL 指令序列完成之后停止整个程序并杀掉 webp-server-linux-amd64 进行（psrecord 也会自动停止并生成图）

我们来运行一遍：

bash bench-new.sh baseline-1

INFO[2023-09-08 21:15:17][47:main.setupLogger] WebP Server Go ready.                        

 
▌ ▌   ▌  ▛▀▖ ▞▀▖                ▞▀▖
▌▖▌▞▀▖▛▀▖▙▄▘ ▚▄ ▞▀▖▙▀▖▌ ▌▞▀▖▙▀▖ ▌▄▖▞▀▖
▙▚▌▛▀ ▌ ▌▌   ▖ ▌▛▀ ▌  ▐▐ ▛▀ ▌   ▌ ▌▌ ▌
▘ ▘▝▀▘▀▀ ▘   ▝▀ ▝▀▘▘   ▘ ▝▀▘▘   ▝▀ ▝▀

WebP Server Go - v0.9.9
Develop by WebP Server team. https://github.com/webp-sh

WebP Server Go is Running on http://127.0.0.1:3333
WARN[2023-09-08 21:15:17][36:webp_server_go/helper.ReadMetadata] can't read metadata: open metadata/local/c1b6abd2b92cef19.json: no such file or directory 
Attaching to process 3793955
INFO[2023-09-08 21:15:18][295:webp_server_go/encoder.convertLog] WebP@80%: pics/DSC05982.jpg->exhaust/local/c1b6abd2b92cef19.webp 10093215->1560684 15.46% deflated
...
...
- [2023-09-08 21:17:00]  GET      /DSC06071.jpg  200   curl/7.82.0

Process finished (104.92 seconds)

对应的图如下：

现在我们知道了这一次 Benchmark 差不多需要 104.92s 之后，我们便可以通过 /debug/pprof/ 接口来采集一部分信息了，我们用相同的指令 Benchmark 一次，只不过这一次我们额外开个终端用 cURL 保存一下 pprof 信息：

curl -o cpu.pprof "http://localhost:3333/debug/pprof/profile?seconds=80"

我们只收集 80s 的信息，所以肯定能在 Benchmark 跑完前拿到

拿到的文件如下：

du -ch cpu.pprof 
28K	cpu.pprof
28K	total

go tool pprof  cpu.pprof
File: webp-server-linux-amd64
Build ID: e5de6a2d46540736f77d42a119799b2504d38a82
Type: cpu
Time: Sep 8, 2023 at 9:18pm (CST)
Duration: 80.11s, Total samples = 104.71s (130.70%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 102.99s, 98.36% of 104.71s total
Dropped 189 nodes (cum <= 0.52s)
Showing top 10 nodes out of 41
      flat  flat%   sum%        cum   cum%
    57.70s 55.10% 55.10%     57.70s 55.10%  runtime.cgocall
    26.74s 25.54% 80.64%     26.74s 25.54%  [libc.so.6]
    15.18s 14.50% 95.14%     15.18s 14.50%  [libjpeg.so.62.3.0]
     1.03s  0.98% 96.12%      1.03s  0.98%  [libglib-2.0.so.0.7200.3]
     1.02s  0.97% 97.10%      1.02s  0.97%  [libvips.so.42.14.2]
     0.75s  0.72% 97.81%      0.75s  0.72%  runtime/internal/syscall.Syscall6
     0.57s  0.54% 98.36%      0.57s  0.54%  runtime.memclrNoHeapPointers
         0     0% 98.36%     57.44s 54.86%  github.com/davidbyttow/govips/v2/vips.(*ImageRef).ExportWebp
         0     0% 98.36%     57.33s 54.75%  github.com/davidbyttow/govips/v2/vips._Cfunc_save_to_buffer
         0     0% 98.36%     57.44s 54.86%  github.com/davidbyttow/govips/v2/vips.vipsSaveToBuffer

我们继续根据文档的指示，将这个 cpu.pprof 文件改名为 default.pgo 文件，并放到我们修改了源代码的 WebP Server Go 目录下，并重新构建 Binary

此时你会注意到构建时间长了一些

构建后可以通过最后一行的 -pgo 来判断这次构建已经带上了刚刚的 pgo 文件。

go version -m ./builds/webp-server-linux-amd64

./builds/webp-server-linux-amd64: go1.21.0
	path	webp_server_go
	mod	webp_server_go	(devel)	
	dep	github.com/andybalholm/brotli	v1.0.5	h1:8uQZIdzKmjc/iuPu7O2ioW48L81FgatrcpfFmiq/cCs=
...
...
	build	vcs.modified=true
	build	-pgo=/home/Nova/bench/webp_server_go/default.pgo

Benchmark

由于我们知道 Encode 过程可能不太会有什么提升，所以我们直接测试非 Encode 过程，也就是上文「 WebP Server Go 是怎么工作的」中的以下两个步骤。

一个请求发到 WebP Server Go 上，比如 GET /123.jpg
WebP Server Go 先判断 .jpg 部分是否在配置文件允许的文件后缀中

测试使用 ApacheBench 进行，指令如下：

ab -k -c 5 -n 20000 http://127.0.0.1:3333/

我们先启动旧的 binary 并开始 Bench ，Bench 三次，结果如下：

Concurrency Level:      5
Time taken for tests:   0.431 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    46352.31 [#/sec] (mean)
Time per request:       0.108 [ms] (mean)
Time per request:       0.022 [ms] (mean, across all concurrent requests)
Transfer rate:          9188.98 [Kbytes/sec] received

Concurrency Level:      5
Time taken for tests:   0.454 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    44060.24 [#/sec] (mean)
Time per request:       0.113 [ms] (mean)
Time per request:       0.023 [ms] (mean, across all concurrent requests)
Transfer rate:          8734.60 [Kbytes/sec] received

Concurrency Level:      5
Time taken for tests:   0.426 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    46912.68 [#/sec] (mean)
Time per request:       0.107 [ms] (mean)
Time per request:       0.021 [ms] (mean, across all concurrent requests)
Transfer rate:          9300.07 [Kbytes/sec] received

根据 Time per request 来看，上面三次测试平均结果为 0.022 [ms]，现在启动新的 Binary 并再用同样的指令 Bench，Bench 三次，结果如下：

Concurrency Level:      5
Time taken for tests:   0.424 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    47198.64 [#/sec] (mean)
Time per request:       0.106 [ms] (mean)
Time per request:       0.021 [ms] (mean, across all concurrent requests)
Transfer rate:          9356.76 [Kbytes/sec] received

Concurrency Level:      5
Time taken for tests:   0.424 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    47140.68 [#/sec] (mean)
Time per request:       0.106 [ms] (mean)
Time per request:       0.021 [ms] (mean, across all concurrent requests)
Transfer rate:          9345.27 [Kbytes/sec] received

Concurrency Level:      5
Time taken for tests:   0.405 seconds
Complete requests:      20000
Failed requests:        0
Non-2xx responses:      20000
Keep-Alive requests:    20000
Total transferred:      4060000 bytes
HTML transferred:       580000 bytes
Requests per second:    49411.26 [#/sec] (mean)
Time per request:       0.101 [ms] (mean)
Time per request:       0.020 [ms] (mean, across all concurrent requests)
Transfer rate:          9795.40 [Kbytes/sec] received

根据 Time per request 来看，上面三次测试平均结果为 0.020 [ms]，此时 Bench 脚本生成图如下：

时间从上文的 104s 变到了 99s，提升了约 4%。

之后我们对比了一下两次用 webp_bench 仓库的数据，指令是 go tool pprof -diff_base cpu-old.pprof cpu.pprof：

File: webp-server-linux-amd64
Build ID: 76103c6f56c404678a45a9e9e42ecee8765f3110
Type: cpu
Time: Sep 8, 2023 at 10:17pm (CST)
Duration: 160.23s, Total samples = 106.09s (66.21%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top -cum
Showing nodes accounting for -0.96s, 0.9% of 106.09s total
Showing top 10 nodes out of 256
      flat  flat%   sum%        cum   cum%
         0     0%     0%     -0.76s  0.72%  runtime._ExternalCode
    -0.53s   0.5%   0.5%     -0.53s   0.5%  [libc.so.6]
    -0.25s  0.24%  0.74%     -0.25s  0.24%  [libglib-2.0.so.0.7200.3]
         0     0%  0.74%     -0.19s  0.18%  os.ReadFile
    -0.18s  0.17%   0.9%     -0.18s  0.17%  runtime.memclrNoHeapPointers
         0     0%   0.9%     -0.18s  0.17%  runtime.memclrNoHeapPointersChunked
         0     0%   0.9%     -0.15s  0.14%  runtime.makeslice
         0     0%   0.9%     -0.14s  0.13%  github.com/gofiber/fiber/v2.(*Ctx).Next
         0     0%   0.9%     -0.14s  0.13%  github.com/gofiber/fiber/v2/middleware/etag.New.func1
         0     0%   0.9%     -0.14s  0.13%  github.com/gofiber/fiber/v2/middleware/pprof.New.func1
(pprof) top
Showing nodes accounting for -1.03s, 0.97% of 106.09s total
Showing top 10 nodes out of 256
      flat  flat%   sum%        cum   cum%
    -0.53s   0.5%   0.5%     -0.53s   0.5%  [libc.so.6]
    -0.25s  0.24%  0.74%     -0.25s  0.24%  [libglib-2.0.so.0.7200.3]
    -0.18s  0.17%   0.9%     -0.18s  0.17%  runtime.memclrNoHeapPointers
     0.11s   0.1%   0.8%      0.11s   0.1%  [libvips.so.42.14.2]
    -0.09s 0.085%  0.89%     -0.12s  0.11%  runtime.scanobject
    -0.08s 0.075%  0.96%     -0.08s 0.075%  [libjpeg.so.62.3.0]
    -0.06s 0.057%  1.02%     -0.05s 0.047%  runtime.cgocall
     0.04s 0.038%  0.98%      0.04s 0.038%  runtime.futex
     0.04s 0.038%  0.94%      0.04s 0.038%  runtime/internal/syscall.Syscall6
    -0.03s 0.028%  0.97%     -0.03s 0.028%  runtime.memmove

小结

上文便是我们在发现 Go 1.21 中 PGO GA 之后的在 WebP Server Go 上的一次尝试，虽然取得了一些提升的效果，但是我们依然不是很确定这个是由于 PGO 产生的还是由于系统误差产生的（虽然我们已经尝试多次测试减少误差的产生）。

虽然 PGO 已经 GA，但是其实根据官方文档来看能做到的优化也非常有限，对于 WebP Server Go 这样轻业务逻辑，重度依赖 libvips 底层代码的程序来说提升就更加不明显。

这也是本文标题为「一文读不懂的 Go 1.21 GA 的 PGO 优化」的原因，因为我们没有清华，私募量化交易和互联网大厂经验，也不开公众号来割韭菜，也不会翻译了一下 Golang 官方博客就发布一篇「一文读懂Go 1.20引入的PGO性能优化」类似名字的文章。

References

https://go.dev/blog/pgo
https://zhuanlan.zhihu.com/p/609529412 （网页镜像： https://archive.ph/dF2LN ）

WebP Cloud Services 团队是一个来自上海和马尔默的三人小团队，由于我们不融资，且没有盈利压力，所以我们会坚持做我们认为正确的事情，力求在我们的资源和能力允许范围内尽量把事情做到最好，同时也会在不影响对外提供的服务的情况下整更多的活，并在我们产品上实践各种新奇的东西。

如果你觉得我们的这个服务有意思或者对我们服务感兴趣，欢迎登录 WebP Cloud Dashboard 来体验，如果你好奇它还有哪些神奇的功能，可以来看看我们的文档 WebP Cloud Services Docs，希望大家玩的开心~

Discuss on Hacker News