[Golang Memory Allocation]

Tags: #go #golang #memory #allocation

1. Memory Allocation Summary.md

Summary

[!TIP]

Returning a pointer (or a value containing a pointer, e.g. slice) will escape to the heap.

any/interface{} and generics all escape to heap.

Some types contain a pointer and you might not realise (e.g. Time struct).

Storing non-pointer values in sync.Pool.Put(t any) allocates (so send *T, e.g. *[]string).

“Copying is expensive” usually is a myth (pointers should represent ownership and mutability).

Pre-allocate maps/slices to avoid unnecessary allocations when a resize occurs.

Use interface types sparingly in hot paths (e.g. method calls will cause the receiver + certain args to escape).

Prefer non-reference types for map keys/values.

Try keeping map keys/values <= 128 bytes.

[!NOTE] When a function accepts any, Go needs to create an interface value, and the storage of the underlying value might involve a heap allocation.

Full Details

Here’s a list of some patterns we’ve found which typically cause variables to escape to the heap:

Sending pointers or values containing pointers to channels. At compile time there’s no way to know which goroutine will receive the data on a channel. Therefore the compiler cannot determine when this data will no longer be referenced.
Storing pointers or values containing pointers in a slice. An example of this is a type like []*string. This always causes the contents of the slice to escape. Even though the backing array of the slice may still be on the stack, the referenced data escapes to the heap.
Backing arrays of slices that get reallocated because an append would exceed their capacity. In cases where the initial size of a slice is known at compile time, it will begin its allocation on the stack. If this slice’s underlying storage must be expanded based on data only known at runtime, it will be allocated on the heap.
Calling methods on an interface type. Method calls on interface types are a dynamic dispatch — the actual concrete implementation to use is only determinable at runtime. Consider a variable r with an interface type of io.Reader. A call to r.Read(b) will cause both the value of r and the backing array of the byte slice b to escape and therefore be allocated on the heap. So use interface types sparingly in hot paths.

The rule of thumb is: pointers point to data allocated on the heap.
Ergo, reducing the number of pointers in a program reduces the number of heap allocations.
Pointers should primarily be used to reflect ownership semantics and mutability.

Note: an extra bonus of ‘passing by value’ is the increased safety of eliminating nil.

Copying a value is much less expensive than the overhead of using a pointer:

The compiler generates checks when dereferencing a pointer. The purpose is to avoid memory corruption by running panic() if the pointer is nil. This is extra code that must be executed at runtime. When data is passed by value, it cannot be nil.
Pointers often have poor locality of reference. All of the values used within a function are collocated in memory on the stack. Locality of reference is an important aspect of efficient code. It dramatically increases the chance that a value is warm in CPU caches and reduces the risk of a miss penalty during prefetching.
Copying objects within a cache line is roughly equivalent to copying a single pointer. CPUs move memory between caching layers and main memory on cache lines of constant size. On x86 this is 64 bytes. Further, Go uses a technique called Duff’s device to make common memory operations like copies very efficient.

Reducing the number of pointers in a program can yield another helpful result: the garbage collector will skip regions of memory that it can prove will contain no pointers. For example, regions of the heap which back slices of type []byte aren’t scanned at all. This also holds true for arrays of struct types that don’t contain any fields with pointer types.

Not only does reducing pointers result in less work for the garbage collector, it produces more cache-friendly code. Reading memory moves data from main memory into the CPU caches. Caches are finite, so some other piece of data must be evicted to make room. Evicted data may still be relevant to other portions of the program.

Golang Memory Analysis

To see optimisations the go compiler has applied to your code when running it, you can use the -gcflags flag:

go run -gcflags="-m" ./main.go

To find out more about this flag we have to understand where it actually lives (because if you try to look for the flag and the possible values it can accept, then you’ll be surprised to discover no information about it when running go help run).

To start, let’s consider the go run subcommand. This will trigger a build, and that’s where the -gcflags is exposed.

go help build

  -gcflags '[pattern=]arg list'                
    arguments to pass on each go tool compile invocation.

It tells us that the go tool compile subcommand is what the flag values are passed on to, so we should look there for the relevant values, of which we’ll find (amongst many other flags) -m which is used for printing out any optimisations the go compiler has applied to the code (such as, when data escapes to the heap):

go tool compile -help

  -m    print optimization decisions

Optimising Struct Fields

Modern CPU hardware performs reads and writes to memory most efficiently when the data is naturally aligned. CPU reads data in word-size instead of byte-size. A word in a 64-bit system is 8 bytes, while a word in a 32-bit system is 4 bytes. In short, CPU reads address in the multiple of its word size.

The side-effect of this is that the compiler will sometimes add ‘padding’ to the memory reservation to help keep things nice and even. Padding is the key to achieving data alignment, but that means a certain order of struct fields can yield extra/unncessary memory reservation, as demonstrated in the following example program (for visual aids see: https://wagslane.dev/posts/go-struct-ordering/ and https://betterprogramming.pub/how-to-speed-up-your-struct-in-golang-76b846209587):

package main

import (
	"unsafe"
)

type T1 struct {
	a int8
	b int64
	c int16
}

type T2 struct {
	a int8
	c int16
	b int64
}

func main() {
    println(unsafe.Sizeof(T1{})) // 24 (more memory reserved)
	println(unsafe.Sizeof(T2{})) // 16
}

Imagine using a 64-bit system and fetching the b variable. The CPU would take two cycles to access the data instead of one. The first cycle will fetch memory 0 to 7 and the subsequent cycle will fetch the rest. Think of it as a notebook: each page can only store word-size data, in this case, 8 bytes. If the b data is scattered across two pages, it takes two page flips to retrieve the complete data.

Another way to verify this is to first know the amount of bytes each type should consume and then check the offsets.

Here’s an example program to validate this approach:

package main

import (
	"fmt"
	"reflect"
	"runtime"
)

// Should be 4 bytes total but is actually 6!
//
// Notice the offsets aren't correct!
// Foo only takes up 1 byte (8 bits) and so Bar, which takes
// up 2 bytes (16 bits), should actually be at offset 1. 
// The fact it's offset 2 means a whole byte was reserved for
// Foo when it only needed 1 byte.
type stats struct {
	Foo uint8  // should take up 1 byte (offset: 0)
	Bar uint16 // should take up 2 byte (offset: 2)
	Baz uint8  // should take up 1 byte (offset: 4)
}

func main() {
	typ := reflect.TypeOf(stats{})
	fmt.Printf("Struct is %d bytes long\n", typ.Size())
	n := typ.NumField()
	for i := 0; i < n; i++ {
		field := typ.Field(i)
		fmt.Printf("%s at offset %v, size=%d, align=%d\n",
			field.Name, field.Offset, field.Type.Size(),
			field.Type.Align())
	}

	allStats := []stats{}
	for i := 0; i < 100000000; i++ {
		allStats = append(allStats, stats{})
	}

	printMemUsage()
}

func printMemUsage() {
	var m runtime.MemStats
	runtime.ReadMemStats(&m)
	fmt.Printf("Alloc = %v MiB", bToMb(m.Alloc))
	fmt.Printf("\tTotalAlloc = %v MiB", bToMb(m.TotalAlloc))
	fmt.Printf("\tSys = %v MiB", bToMb(m.Sys))
	fmt.Printf("\tNumGC = %v\n", m.NumGC)
}

func bToMb(b uint64) uint64 {
	return b / 1024 / 1024
}

The output of this program is:

Struct is 6 bytes long
Foo at offset 0, size=1, align=1
Bar at offset 2, size=2, align=2
Baz at offset 4, size=1, align=1

We can see although the struct should really only take up four bytes, it actually takes up six because the uint8 types are getting double the amount of bytes reserved (two instead of one).

If we change the struct such that the uint16 isn’t defined between the uint8s and check the offsets again, then we’ll get the following output:

Struct is 4 bytes long
Bar at offset 0, size=2, align=2
Foo at offset 2, size=1, align=1
Baz at offset 3, size=1, align=1

Notice that the offsets are what we would expect, offset zero for Bar, then Foo is offset straight after it at 2 and only 1 byte is reserved so Baz is offset straight after it at 3.

Also notice that because of this the total struct size is now smaller.