CPU usage gradually increasing continuously? A record of troubleshooting Goroutine leaks

Dec 4, 2023

In the previous article “How can we dynamically modify the LogLevel of a Go program at runtime?,” we discussed how our WebPLB logs were being flooded by a control request from ourselves, gradually leading to the story of how we dynamically modified the LogLevel using signals.

Due to the use of the Fiber framework, the default Logger outputs a log for each request. Although we can change the LogLevel for some logs in the application, such as:

log.Infoln("converting image to sRGB color space")

log.Debugln("converting image to sRGB color space")

Fiber still logs the request’s logs. This led to a new requirement: how to avoid logging for certain requests?

For example, our control request is /some_kind_of_webp_cloud_command, so we added the following code in a modification:

// Fiber logger format
app.Use(func(c *fiber.Ctx) error {
    if c.Path() == "/some_kind_of_webp_cloud_command" {
        return c.Next()
    }

    return logger.New(logger.Config{
        Format:     "${ip} - [${time}] ${method} ${url} ${status} ${referer} ${ua}\n",
        TimeFormat: "2006-01-02 15:04:05",
    })(c)
})

This way, logs will not be recorded for specific requests, and c.Next() is used to skip the Logger middleware.

We also bumped some dependencies, such as:

github.com/ClickHouse/clickhouse-go/v2
github.com/gofiber/fiber/v2
github.com/valyala/fasthttp
gorm.io/driver/mysql

Additionally, we added the LogLevel modification mentioned in the previous article, and then we deployed the new version.

And then… we observed that the CPU usage of all our machines was gradually increasing. For example, this is the CPU load monitoring of a machine in our Germany region:

Each time the CPU usage drops, it indicates that we restarted the container (deployed a new version).

Now the question is, why is this happening?

Is it because one of our scheduled tasks (such as cleaning local cache) is getting stuck due to traversing the file system? (But IO Wait is not high)
Could it be a problem with a specific version of Fiber or ClickHouse?
Or does the code listening for system signals have an issue causing a loop?

Interested readers can pause here. If you have encountered similar issues, how would you troubleshoot them?

Firstly, let’s rule out the three hypotheses mentioned above. Our scheduled task does traverse the file system, and using fstat indeed increases CPU usage, but unless there is overlap between tasks, CPU usage won’t steadily rise.

To investigate this issue, we added PProf to WebPLB. This step is relatively simple; you can refer to the Fiber documentation:

import (
  "github.com/gofiber/fiber/v2"
  "github.com/gofiber/fiber/v2/middleware/pprof"
)

app.Use(pprof.New())

After deploying PProf, we waited for a day for the CPU usage to increase and obtained a CPU profile:

File: webplb
Build ID: 2d0e21c64238ef4dcfc1e549023659cb8b637004
Type: cpu
Duration: 50.14s, Total samples = 16.47s (32.85%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 10650ms, 64.66% of 16470ms total
Dropped 227 nodes (cum <= 82.35ms)
Showing top 20 nodes out of 169
      flat  flat%   sum%        cum   cum%
    1670ms 10.14% 10.14%     1710ms 10.38%  runtime.siftdownTimer
    1410ms  8.56% 18.70%     1410ms  8.56%  runtime/internal/syscall.Syscall6
     830ms  5.04% 23.74%      830ms  5.04%  runtime.futex
     740ms  4.49% 28.23%      740ms  4.49%  runtime/internal/atomic.(*Uint32).Load (inline)
     610ms  3.70% 31.94%     5250ms 31.88%  github.com/gofiber/fiber/v2/middleware/logger.New.func1
     610ms  3.70% 35.64%      610ms  3.70%  runtime/internal/atomic.(*Uint32).CompareAndSwap (inline)
     580ms  3.52% 39.16%      590ms  3.58%  runtime.gopark
     510ms  3.10% 42.26%      510ms  3.10%  runtime.nanotime (inline)
     500ms  3.04% 45.29%     1600ms  9.71%  time.Time.appendFormat
     430ms  2.61% 47.91%      430ms  2.61%  sync/atomic.(*Value).Store
     400ms  2.43% 50.33%     3870ms 23.50%  runtime.runtimer
     350ms  2.13% 52.46%      530ms  3.22%  runtime.lock2
     330ms  2.00% 54.46%      330ms  2.00%  [libc.so.6]

Hey, this is interesting. What is runtime.siftdownTimer? Why does github.com/gofiber/fiber/v2/middleware/logger.New.func1 have such a high cumulative time? Let’s look at the call stack:

This indeed looks like a problem with Fiber’s Logger!

So, we immediately compared the code between the version where we bumped Fiber and ClickHouse and the corresponding versions before the bump. However, we didn’t find any relevant changes in the Logger or the modified parts.

At this point, we had a new idea. What is runtime.siftdownTimer, and why does it have such a high flat time?

After searching online, we didn’t find much useful information, except for this post: https://forum.golangbridge.org/t/runtime-siftdowntimer-consuming-60-of-the-cpu/3773

In the post, someone had a peculiar use of Ticker:

func() {
  for {
    select {
      case <-time.Tick(10*time.Milliseconds):
    }
  }
}

However, in our code, all places where Ticker is used will destroy it after the function execution, and there is no issue of allocating Ticker in a loop or not using it.

So, we continued analyzing the profile and found an interesting situation. When inspecting Go routines, we found that every time we sent a new request to WebPLB, a new Go routine appeared, still in sleep state. After sending some requests, we already had 80+ sleeping Goroutines:

goroutine 85 [sleep]:
time.Sleep(0x1dcd6500)
	/usr/local/go/src/runtime/time.go:195 +0x125
github.com/gofiber/fiber/v2/middleware/logger.New.func1()
	/go/pkg/mod/github.com/gofiber/fiber/v2@v2.51.0/middleware/logger/logger.go:45 +0x73
created by github.com/gofiber/fiber/v2/middleware/logger.New in goroutine 35
	/go/pkg/mod/github.com/gofiber/fiber/v2@v2.51.0/middleware/logger/logger.go:43 +0x2c9

...

goroutine 87 [sleep]:
time.Sleep(0x1dcd6500)
	/usr/local/go/src/runtime/time.go:195 +0x125
github.com/gofiber/fiber/v2/middleware/logger.New.func1()
	/go/pkg/mod/github.com/gofiber/fiber/v2@v2.51.0/middleware/logger/logger.go:45 +0x73
created by github.com/gofiber/fiber/v2/middleware/logger.New in goroutine 35
	/go/pkg/mod/github.com/gofiber/fiber/v2@v2.51.0/middleware/logger/logger.go:43 +0x2c9

We suspected that there might be an issue with the part of the code where we wrote about /some_kind_of_webp_cloud_command. So, we quickly removed the corresponding code and kept the original style:

// Fiber logger format
app.Use(logger.New(logger.Config{
    Format:     "${ip} - [${time}] ${method} ${url} ${status} ${referer} ${ua}\n",
    TimeFormat: "2006-01-02 15:04:05",
}))

And then, we started profiling again. At this point, regardless of how we sent requests to WebPLB, the number of Go routines remained stable, and Goroutines related to the Logger only appeared in one place:

goroutine 20 [sleep]:
time.Sleep(0x1dcd6500)
	/usr/local/go/src/runtime/time.go:195 +0x125
github.com/gofiber/fiber/v2/middleware/logger.New.func1()
	/go/pkg/mod/github.com/gofiber/fiber/v2@v2.51.0/middleware/logger/logger.go:45 +0x73
created by github.com/gofiber/fiber/v2/middleware/logger.New in goroutine 1
	/go/pkg/mod/github.com/gofiber/fiber/v2@v2.51.0/middleware/logger/logger.go:43 +0x2c9

Indeed, it seemed like this was the problem. Since middleware returns a new Logger each time, a large number of Logger Goroutines accumulated in the system, which over time led to a slow and steady increase in CPU usage.

Let’s see who wrote this:

Or maybe I suspect this code was auto-completed by Copilot.

But how do we solve this issue? According to the Fiber documentation https://docs.gofiber.io/api/middleware/logger, middleware should use Next to achieve similar operations:

Property	Type	Description	Default
Next	`func(*fiber.Ctx) bool`	Next defines a function to skip this middleware when returned true.	`nil`
Done	`func(*fiber.Ctx, []byte)`	Done is a function that is called after the log string for a request is written to Output, and pass the log string as parameter.	`nil`

So, we modified the code to:

app.Use(logger.New(logger.Config{
    Format:     "${ip} - [${time}] ${method} ${url} ${status} ${referer} ${ua}\n",
    TimeFormat: "2006-01-02 15:04:05",
    Next: func(c *fiber.Ctx) bool {
        // Ignore some paths
        ignoredPaths := []string{"/some_kind_of_webp_cloud_healthz", "/some_kind_of_webp_cloud_command"}
        return slices.Contains(ignoredPaths, c.Path())
    },
}))

The new version has been deployed for more than two days, and the CPU usage issue has not reappeared.

The WebP Cloud Services team is a small team of three individuals from Shanghai and Malmö. Since we are not funded and have no profit pressure, we remain committed to doing what we believe is right. We strive to do our best within the scope of our resources and capabilities. We also engage in various activities without affecting the services we provide to the public, and we continuously explore novel ideas in our products.

If you find this service interesting, feel free to log in to the WebP Cloud Dashboard to experience it. If you’re curious about other magical features it offers, take a look at our WebP Cloud Services Docs. We hope everyone enjoys using it!

Discuss on Hacker News