Go Memory Management

Go Memory Management

Go Memory management is based on TCMalloc, using continuous virtual addresses, managed in page (8k) units and multilevel caches; when allocating memory, size needs to be aligned, suitable mspan found according to best-fit, and unused memory will be split into other size mspans to continue to be used.

In the case of a new object (ignoring escape analysis), different allocation strategies are based on the object's size:

  • Very small objects (size < 16byte) are allocated directly on the tiny cache on the current P's mcache.
  • Small objects (16byte <= size <= 32k) are allocated in the slot's free list on the current P's mcache, while non-free lists continue to apply to the mcenter (or not to the mheap);
  • Large objects (size > 32k) are requested directly through the mheap.

1. Memory Allocation Knowledge

  • The main memory of a computer system is organized into an array of M consecutive byte-sized cells, each with a unique physical address (PA)
  • Modern processors use a form of addressing that is virtual addressing (VA), with the minimum number of addressing units being words.
  • Virtual address mapping physical addresses are translated by reading a page table: the page table is stored in physical memory
  • MMU (Kernel Bar Physical Page as the basic unit of memory management) manages page tables in the system by page size
  • In the virtual memory idiom, a DRAM cache miss becomes: Page missing
  • The structure of pages is related to physical pages, not virtual pages
  • Each physical page in the system is assigned a page structure

Before we get to know how Go s memory allocator works, let's look at Dynamic Storage Allocator.

2. Dynamic Storage Allocator

A dynamic storage allocator maintains the virtual storage area of a process, called a heap, which can be thought of as a collection of blocks of different sizes (chunk: contiguous virtual storage slices that rely on contiguous addresses for both memory allocators and garbage collection algorithms) and is maintained by dynamic storage.

Dynamic allocators are mainly divided into:

  • Explicit: Common malloc
  • Implicit: garbage collection

In Go, the allocator divides the memory blocks it manages (chunks --> chunks) into two types:

  • span: a large block of memory consisting of consecutive page s
  • Object: divides a space into small pieces of a specific size, each of which can store an object

For its purposes, space is for internal management and object is for object assignment.

The allocator distinguishes spans of different sizes by number of pages.For example, if you store a span in an administrative array in pages, you index it as needed.Of course, the span size is not fixed.When fetching an idle space, if no appropriate size is found, more pages will be returned, which will trigger a clipping operation, and the redundancy will form a new space that will be put back into the managed array.The allocator also tries to merge free spans adjacent to addresses to build larger blocks of memory, reduce fragmentation, and provide a more flexible allocation strategy.

In the malloc.go code of go:

const (
    _PageShift = 13
    _PageSize  = 1 << _PageShift
    _PageMask  = _PageSize - 1
)

You can see that the objects used to store objects are divided into n types by 8 byte multiples.Although this may result in some memory waste, the allocator has to face only a limited number of sizes of small blocks of memory, optimizing the allocation and reuse management strategy.

The allocator tries to save memory by combining multiple tiny objects into one object fast.

We see part of msize.go:

var class_to_size [_NumSizeClasses]int32
var class_to_allocnpages [_NumSizeClasses]int32
var class_to_divmagic [_NumSizeClasses]divMagic

var size_to_class8 [1024/8 + 1]int8
var size_to_class128 [(_MaxSmallSize-1024)/128 + 1]int8

func sizeToClass(size int32) int32 {
    if size > _MaxSmallSize {
        throw("invalid size")
    }
    if size > 1024-8 {
        return int32(size_to_class128[(size-1024+127)>>7])
    }
    return int32(size_to_class8[(size+7)>>3])
}

When the allocator is initialized, a relationship between the size and specifications of the table storage, including the number of span pages used for slicing, is built.

  // Tunable constants.
_MaxSmallSize = 32 << 10

From the code snippet above, we can probably specify that if the object size exceeds a certain threshold, it will be treated as a big object in particular.

3. mmap function

Unix processes can use the mmap function to create new virtual storage areas and map objects to them.

The mmap function requires the kernel to create a new virtual storage area, preferably one that starts at the starting address and maps a contiguous chunk of the object specified by the file descriptor fd to the new area.

4. Frequent data allocation and recycling

There are two general ways to efficiently allocate and recycle data frequently and reduce fragmentation:

  • Idle Chain List: Provides directly available, allocated building blocks with the disadvantage of not being globally controlled.
  • slab:linux provides the ability to divide different objects into so-called cache groups.

5. Go Gos memory allocation

Go s memory allocator uses google's own tcmalloc, which is a allocator with a memory pool. The bottom level calls mmap functions directly and uses bestfit for dynamic allocation.

Go allocates a local MCache for each system thread, a small number of address allocations are made from MCache, and garbage collection occurs regularly, so the allocator for the visible go contains explicit and implicit calls.

Go defines small blocks of memory, 32K or less in size, that are cut at the bottom of the go according to a specified specification (about 100 types): to avoid arbitrary cutting, any byte of memory is requested up to a near block, and the entire block is allocated (from the free chain table) to the applicant.

Go Memory Allocation Major Components:

  • MCache: Hierarchies are similar to MHeap in that they have an idle chain list for each size category.Each M has its own local Mcache (small objects are taken from it without locking), which is why Go is able to efficiently manage memory in multiple threads.

  • MCentral: When there is no free memory, request a span from Mheap instead of multiple pages. How many pages the requested span contains is determined by the sizeclass of the central (cross-process reuse)

  • MHeap: Responsible for organizing and managing MSpan.

(1). Allocation process: Allocate from free[], if a cut occurs, put the rest back into free[].

(2) Recycling process: When a Mspan is recycled, it is preferred to find its adjacent address, and map it to get the corresponding Mspan. If the state of the Mspan is not used, the two can be merged.Finally, the page or merged page is returned to the free[] allocation pool or large s.

6. Go Gos memory model

The Go s memory model can be thought of as a two-level memory model:

Level 1: Mheap is the main component: Pages are allocated, but MSpan is managed. Each allocation uses the bestFit principle to allocate consecutive pages, and recycling uses bitmaps.

Level 2: MCache is the main component: equivalent to a memory pool, recycling uses reference counters.

Assign Scenarios:

Allocating memory for objects must distinguish between allocating on the stack and allocating on the heap.In general, it is the compiler's responsibility to use registers and stacks to store objects whenever possible, which can help improve performance and reduce the pressure on the garbage collector.

Examples of applications:

package main

func patent() *int {
    x := new(int)
    *x = 1234
    return x
}

func main() {
    println(*patent())
}

We prohibit inline optimizations from compiling the code above:

> go build -gcflags="-l" -o patent main.go

The result is:

        main.go:3       0x2040  65488b0c25a0080000      GS MOVQ GS:0x8a0, CX
        main.go:3       0x2049  483b6110                CMPQ 0x10(CX), SP
        main.go:3       0x204d  7639                    JBE 0x2088
        main.go:3       0x204f  4883ec18                SUBQ $0x18, SP
        main.go:3       0x2053  48896c2410              MOVQ BP, 0x10(SP)
        main.go:3       0x2058  488d6c2410              LEAQ 0x10(SP), BP
        main.go:4       0x205d  488d05dc3b0500          LEAQ 0x53bdc(IP), AX
        main.go:4       0x2064  48890424                MOVQ AX, 0(SP)
        main.go:4       0x2068  e823a70000              CALL runtime.newobject(SB)
        main.go:4       0x206d  488b442408              MOVQ 0x8(SP), AX
        main.go:5       0x2072  48c700d2040000          MOVQ $0x4d2, 0(AX)
        main.go:6       0x2079  4889442420              MOVQ AX, 0x20(SP)
        main.go:6       0x207e  488b6c2410              MOVQ 0x10(SP), BP
        main.go:6       0x2083  4883c418                ADDQ $0x18, SP
        main.go:6       0x2087  c3                      RET
        main.go:3       0x2088  e893720400              CALL runtime.morestack_noctxt(SB)
        main.go:3       0x208d  ebb1                    JMP main.patent(SB)
        :-1             0x208f  cc                      INT $0x3

From the resulting CALL runtime.newobject(SB), it is evident that our object was allocated on the heap.

But with the default parameters, let's look at the results:

> go build  -o patent main.go

When we analyze the distribution of test s as above:

> go tool objdump -s "main\.patent" patent

There is no output after the command is executed. Let's analyze the main method:

> go tool objdump -s "main\.main" patent

The results are as follows:

        main.go:9       0x2040  65488b0c25a0080000      GS MOVQ GS:0x8a0, CX
        main.go:9       0x2049  483b6110                CMPQ 0x10(CX), SP
        main.go:9       0x204d  763d                    JBE 0x208c
        main.go:9       0x204f  4883ec18                SUBQ $0x18, SP
        main.go:9       0x2053  48896c2410              MOVQ BP, 0x10(SP)
        main.go:9       0x2058  488d6c2410              LEAQ 0x10(SP), BP
        main.go:10      0x205d  48c7442408d2040000      MOVQ $0x4d2, 0x8(SP)
        main.go:10      0x2066  e875210200              CALL runtime.printlock(SB)
        main.go:10      0x206b  48c70424d2040000        MOVQ $0x4d2, 0(SP)
        main.go:10      0x2073  e8b8280200              CALL runtime.printint(SB)
        main.go:10      0x2078  e8e3230200              CALL runtime.printnl(SB)
        main.go:10      0x207d  e8ee210200              CALL runtime.printunlock(SB)
        main.go:11      0x2082  488b6c2410              MOVQ 0x10(SP), BP
        main.go:11      0x2087  4883c418                ADDQ $0x18, SP
        main.go:11      0x208b  c3                      RET
        main.go:9       0x208c  e83f720400              CALL runtime.morestack_noctxt(SB)
        main.go:9       0x2091  ebad                    JMP main.main(SB)

This indicates that the inline optimized code did not call newobject to allocate memory on the heap.

The compiler does this for the purpose of passing objects between two stack frames when there is no inline, so allocating data on the stack instead of returning one invalid stack frame.When inlined, it is actually considered a local variable within the main stack frame and does not need to be manipulated on the stack.

Memory Allocation Process
1. Round up the size of small objects to a corresponding size category (about 100), find the corresponding MCache free chain list, if the chain list is not empty, assign an object directly from above, this process does not lock

2. If the MCache free-chain table is empty, some objects are supplemented by the MCentral free-chain table

3. If the free-chain list of MCentral is empty, take some pages from MHeap to supplement MCentral and truncate the memory to a specific size

4. If MHeap is empty or not large enough, assign a new set of pages from the operating system, usually more than 1MB

Go Assignment Process Core Source Implementation:

func mallocgc(size uintptr, typ *_type, flags uint32) unsafe.Pointer {
    if gcphase == _GCmarktermination {
        throw("mallocgc called with gcphase == _GCmarktermination")
    }

    if size == 0 {
        return unsafe.Pointer(&zerobase)
    }

    if flags&flagNoScan == 0 && typ == nil {
        throw("malloc missing type")
    }

    if debug.sbrk != 0 {
        align := uintptr(16)
        if typ != nil {
            align = uintptr(typ.align)
        }
        return persistentalloc(size, align, &memstats.other_sys)
    }

    // assistG is the G to charge for this allocation, or nil if
    // GC is not currently active.
    var assistG *g
    if gcBlackenEnabled != 0 {
        // Charge the current user G for this allocation.
        assistG = getg()
        if assistG.m.curg != nil {
            assistG = assistG.m.curg
        }
        // Charge the allocation against the G. We'll account
        // for internal fragmentation at the end of mallocgc.
        assistG.gcAssistBytes -= int64(size)

        if assistG.gcAssistBytes < 0 {
            // This G is in debt. Assist the GC to correct
            // this before allocating. This must happen
            // before disabling preemption.
            gcAssistAlloc(assistG)
        }
    }

    // Set mp.mallocing to keep from being preempted by GC.
    mp := acquirem()
    if mp.mallocing != 0 {
        throw("malloc deadlock")
    }
    if mp.gsignal == getg() {
        throw("malloc during signal")
    }
    mp.mallocing = 1

    shouldhelpgc := false
    dataSize := size
    c := gomcache()
    var s *mspan
    var x unsafe.Pointer
    if size <= maxSmallSize {
        if flags&flagNoScan != 0 && size < maxTinySize {
            //Small Object Assignment
            off := c.tinyoffset
            // Align tiny pointer for required (conservative) alignment.
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // Allocate a new maxTinySize block.
            s = c.alloc[tinySizeClass]
            v := s.freelist
            if v.ptr() == nil {
                systemstack(func() {
                    c.refill(tinySizeClass)
                })
                shouldhelpgc = true
                s = c.alloc[tinySizeClass]
                v = s.freelist
            }
            s.freelist = v.ptr().next
            s.ref++
            // prefetchnta offers best performance, see change list message.
            prefetchnta(uintptr(v.ptr().next))
            x = unsafe.Pointer(v)
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
        } else {
            var sizeclass int8
            if size <= 1024-8 {
                sizeclass = size_to_class8[(size+7)>>3]
            } else {
                sizeclass = size_to_class128[(size-1024+127)>>7]
            }
            size = uintptr(class_to_size[sizeclass])
            s = c.alloc[sizeclass]
            v := s.freelist
            if v.ptr() == nil {
                systemstack(func() {
                    c.refill(int32(sizeclass))
                })
                shouldhelpgc = true
                s = c.alloc[sizeclass]
                v = s.freelist
            }
            s.freelist = v.ptr().next
            s.ref++
            // prefetchnta offers best performance, see change list message.
            prefetchnta(uintptr(v.ptr().next))
            x = unsafe.Pointer(v)
            if flags&flagNoZero == 0 {
                v.ptr().next = 0
                if size > 2*sys.PtrSize && ((*[2]uintptr)(x))[1] != 0 {
                    memclr(unsafe.Pointer(v), size)
                }
            }
        }
    } else {
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, flags)
        })
        x = unsafe.Pointer(uintptr(s.start << pageShift))
        size = s.elemsize
    }

    if flags&flagNoScan != 0 {
        // All objects are pre-marked as noscan. Nothing to do.
    } else {
        if typ == deferType {
            dataSize = unsafe.Sizeof(_defer{})
        }
        heapBitsSetType(uintptr(x), size, dataSize, typ)
        if dataSize > typ.size {
            // Array allocation. If there are any
            // pointers, GC has to scan to the last
            // element.
            if typ.ptrdata != 0 {
                c.local_scan += dataSize - typ.size + typ.ptrdata
            }
        } else {
            c.local_scan += typ.ptrdata
        }
        publicationBarrier()
    }
    if gcphase == _GCmarktermination || gcBlackenPromptly {
        systemstack(func() {
            gcmarknewobject_m(uintptr(x), size)
        })
    }

    if raceenabled {
        racemalloc(x, size)
    }
    if msanenabled {
        msanmalloc(x, size)
    }

    mp.mallocing = 0
    releasem(mp)

    if debug.allocfreetrace != 0 {
        tracealloc(x, size, typ)
    }

    if rate := MemProfileRate; rate > 0 {
        if size < uintptr(rate) && int32(size) < c.next_sample {
            c.next_sample -= int32(size)
        } else {
            mp := acquirem()
            profilealloc(mp, x, size)
            releasem(mp)
        }
    }

    if assistG != nil {
        // Account for internal fragmentation in the assist
        // debt now that we know it.
        assistG.gcAssistBytes -= int64(size - dataSize)
    }

    if shouldhelpgc && gcShouldStart(false) {
        gcStart(gcBackgroundMode, false)
    }

    return x
}

There are also three common principles for Go happens-before, go happens-before:

  • Write happens-before read to channel without buffer
  • For buffered channel, read happens-before to write to it
  • Receive operation happens-before corresponding channel send operation completed for channel without buffer

Welcome to my public number, reply to the keyword "Golang", there will be a big gift!!!Wish you all a successful interview!!!

72 original articles published, 0 praised, 1216 visits
Private letter follow

Tags: Unix Linux Google less

Posted on Sat, 08 Feb 2020 16:44:49 -0800 by tempa