Finding and fixing a goroutine leak in a popular Go library

A while ago I contributed a fix to maroto, a popular Go library for generating PDFs: PR #499 closed a goroutine pool that was being created on every Generate() call and never shut down. Each document you generated left a set of worker goroutines parked forever.

This post walks through the anatomy of that class of bug: how to detect it, why it happens so often in library code specifically, and the rules I now follow when a library of mine spawns goroutines.

Why goroutine leaks are sneaky

A goroutine leak rarely announces itself. The program doesn’t crash, requests still succeed, and memory grows so slowly that dashboards stay green for days. The symptom usually shows up far from the cause:

Memory usage creeps up linearly with traffic, not with load
runtime.NumGoroutine() only ever goes up
Latency degrades after the service has been running for a while, because the scheduler is juggling thousands of parked goroutines

In maroto’s case the leak was proportional to usage: every call to generate a PDF created a worker pool, and nothing ever told those workers to stop. A service rendering ten thousand invoices a day would accumulate tens of thousands of goroutines blocked on channel receives.

A minimal reproduction

The shape of the bug is almost always the same. Here is a simplified version of the pattern — a pool that processes jobs through a channel:

type pool struct {
	jobs chan job
}

func newPool(workers int) *pool {
	p := &pool{jobs: make(chan job)}
	for i := 0; i < workers; i++ {
		go func() {
			for j := range p.jobs { // blocks forever if jobs is never closed
				j.run()
			}
		}()
	}
	return p
}

Each worker blocks on for j := range p.jobs. That loop only exits when the channel is closed. If the code that owns the pool never closes it, every worker stays parked on the receive — invisible, unreachable, and never garbage collected, because a parked goroutine is a GC root.

Now imagine newPool being called inside a public API method:

func (m *Maroto) Generate() (Document, error) {
	p := newPool(runtime.NumCPU()) // new pool per call
	// ... process pages through the pool ...
	return doc, nil               // pool goes out of scope, workers live on
}

The pool value gets garbage collected. The goroutines do not. That’s the whole bug.

Detecting it

Three tools make this class of leak visible in minutes.

1. Count goroutines in a test

The cheapest detector is an assertion that calling the API repeatedly doesn’t grow the goroutine count:

func TestGenerateDoesNotLeak(t *testing.T) {
	before := runtime.NumGoroutine()

	for i := 0; i < 10; i++ {
		if _, err := m.Generate(); err != nil {
			t.Fatal(err)
		}
	}

	runtime.GC()
	time.Sleep(100 * time.Millisecond) // let exiting goroutines finish

	if after := runtime.NumGoroutine(); after > before+2 {
		t.Fatalf("goroutines grew from %d to %d", before, after)
	}
}

It’s crude, but it catches the “leak per call” pattern reliably, and it runs in CI forever after.

2. pprof’s goroutine profile

In a running service, go tool pprof http://localhost:6060/debug/pprof/goroutine groups goroutines by stack. A leak looks unmistakable: thousands of goroutines parked on the exact same chan receive line.

3. goleak

Uber’s goleak wraps the same idea into defer goleak.VerifyNone(t) and filters out runtime noise for you.

The fix: every goroutine needs an owner

The actual fix in the PR was conceptually one line: when generation finishes, close the pool so the workers’ range loops exit.

func (m *Maroto) Generate() (Document, error) {
	p := newPool(runtime.NumCPU())
	defer p.Close() // workers exit when the jobs channel closes

	// ... process pages through the pool ...
	return doc, nil
}

But the lasting lesson is the design rule behind it, which Go’s own documentation hints at and code review keeps re-teaching:

Never start a goroutine without knowing how it will stop.

For library code I’d make it stricter. A library that spawns goroutines must either:

Tie their lifetime to a call — start them inside the function and guarantee they exit before or shortly after it returns (defer pool.Close(), errgroup.Wait()), or
Tie their lifetime to an object — and give the object an explicit Close()/Shutdown(ctx) the caller is documented to call, or
Accept a context.Context and exit when it’s cancelled.

Anything else means the caller pays for goroutines they can’t see and can’t stop. That’s exactly what made the maroto bug interesting: nothing inside the library was wrong from the library’s point of view — the contract with the caller was wrong.

Takeaways

Goroutine leaks are usage-proportional, not load-proportional. Watch NumGoroutine() over time, not just memory.
A parked goroutine is a GC root. The pool object being collected does not save you.
Put a leak assertion in your test suite the day you introduce a worker pool — it’s ten lines.
In library APIs, concurrency lifetime is part of the public contract. Document who stops what, or better, make it impossible to get wrong.

The fix shipped in maroto and the leak is gone. The test that counts goroutines is still there, making sure it stays gone.