Deep understanding of Python GC

Object memory management

There are two methods for object memory management in python, reference count / GC.
When the reference counting policy is applied to the management of each object, the receiving / returning object needs a + - object count, and whether the object supports GC is optional, because the existence of GC is to solve the problem of circular reference left by reference counting, and GC can not be supported for objects without other object pointers.

Reference counting

The advantage of reference counting is that it is simple and allocates the object destruction time to the program life cycle

static PyObject* PyPerson_New(PyTypeObject *type, PyObject *args, PyObject *kwds){

    PyObject *ret = create_person(...);

    //Destroy object in case of exception
    if(PyErr_Occurred()){
        Py_XDECREF(ret);
        return NULL;
    }


    if(ret == NULL)
        FAST_RETURN_ERROR("create person obj fail");
    
    if(!check_person(ret)){
        Py_XDECREF(ret); //Destroy object-1
         Py_XINCREF(Py_None); //Return object + 1
        return Py_None;
    }
    
    return ret;
}

The example code above demonstrates that the return object count + 1 and the object overrun act on the count-1
When the object count = = 0, the TP ﹐ dealloc function of typeobject is called to complete the object cleaning

#define Py_DECREF(op)                                   \
    do {                                                \
        PyObject *_py_decref_tmp = (PyObject *)(op);    \
        if (_Py_DEC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
        --(_py_decref_tmp)->ob_refcnt != 0)             \
            _Py_CHECK_REFCNT(_py_decref_tmp)            \
        else                                            \
            _Py_Dealloc(_py_decref_tmp);                \
    } while (0)

Of course, reference counting also has a well-known disadvantage, circular reference, so we need to introduce GC mechanism to make up for it

GC

The policy python gc uses is mark clear. It can judge whether an object is a garbage object according to whether it is reachable from root object

Object supports GC

Since it is optional to support GC or not, we need to actively select whether the object supports GC or not. We only need to add a tag to the typeobject

GC object memory model

In addition to the data of the object itself, some gc information is added to the gc object. For details, see the memory allocation process of the gc object:

PyObject *
PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems)
{
//......
    if (PyType_IS_GC(type))
        obj = _PyObject_GC_Malloc(size);
    else
        obj = (PyObject *)PyObject_MALLOC(size);
//.....
}

static PyObject *
_PyObject_GC_Alloc(int use_calloc, size_t basicsize)
{
    PyObject *op;
    PyGC_Head *g;
    size_t size;
    if (basicsize > PY_SSIZE_T_MAX - sizeof(PyGC_Head))
        return PyErr_NoMemory();
    size = sizeof(PyGC_Head) + basicsize;
//....
}

It can be seen that the gc object memory model is as follows:

-------
|gc head|
|-------|
|  obj  |
|       |
-------
//Form a linked list of gc objects through pygc ﹣ head
typedef union _gc_head {
    struct {
        union _gc_head *gc_next;
        union _gc_head *gc_prev;
        Py_ssize_t gc_refs;  //State of gc object
    } gc;
    double dummy;  /* force worst-case alignment */
} PyGC_Head;
/Set to untrack Not tracking
g->gc.gc_refs = 0;
_PyGCHead_SET_REFS(g, GC_UNTRACKED);  
//Count new objects in generation 0
_PyRuntime.gc.generations[0].count++; /* number of allocated GC objects */
//Trigger gc if the number of generation 0 objects exceeds the threshold
if (_PyRuntime.gc.generations[0].count > _PyRuntime.gc.generations[0].threshold &&
    _PyRuntime.gc.enabled &&
    _PyRuntime.gc.generations[0].threshold &&
    !_PyRuntime.gc.collecting &&
    !PyErr_Occurred()) {
    _PyRuntime.gc.collecting = 1;
    collect_generations(); //gc
    _PyRuntime.gc.collecting = 0;
}

python gc is also generational. It also has the manifestations of the new generation and the old generation. Different from the generation of jvm gcheap, the generation here is just statistically significant and has no memory occupation

Notice that after the object memory allocation process, the object is not actually allocated to a generation
Do not perform this step until memory is allocated in pyobject? GC? Alloc

 if (PyType_IS_GC(type))
        _PyObject_GC_TRACK(obj);  //Add to object list


//Add objects to the object list of generation 0
#define _PyObject_GC_TRACK(o) do { \
    PyGC_Head *g = _Py_AS_GC(o); \
    if (_PyGCHead_REFS(g) != _PyGC_REFS_UNTRACKED) \
        Py_FatalError("GC object already tracked"); \
    _PyGCHead_SET_REFS(g, _PyGC_REFS_REACHABLE); \
    g->gc.gc_next = _PyGC_generation0; \
    g->gc.gc_prev = _PyGC_generation0->gc.gc_prev; \
    g->gc.gc_prev->gc.gc_next = g; \
    _PyGC_generation0->gc.gc_prev = g; \
    } while (0);

GC process

To filter out garbage objects, the most direct way is to mark all accessible objects from rootobject, and the rest is garbage objects. Two things need to be done in this process: 1. Determine which is the rootobject; 2. Traverse the object from the rootobject.

Determine GC scope


    static Py_ssize_t
    collect_generations(void)
    {
        int i;
        Py_ssize_t n = 0;
    //Find oldest generation over threshold
        for (i = NUM_GENERATIONS-1; i >= 0; i--) {
            if (_PyRuntime.gc.generations[i].count > _PyRuntime.gc.generations[i].threshold) {
        
    //Avoid full gc if there are not many long Fu live objects
                if (i == NUM_GENERATIONS - 1
                    && _PyRuntime.gc.long_lived_pending < _PyRuntime.gc.long_lived_total / 4)
                    continue;
                n = collect_with_callback(i); //Collect gen[i] - gen[0]
                break;
            }
        }
        return n;
    }

Object traversal

Before determining the rootobject, we need to solve the problem of object traversal. Because we need to start from an object to access the object it references, that is, breadth first traversal, we can not directly traverse the gc object list, but use additional mechanisms.

static void
subtract_refs(PyGC_Head *containers)
{
    traverseproc traverse;
    PyGC_Head *gc = containers->gc.gc_next;
    for (; gc != containers; gc=gc->gc.gc_next) {
        traverse = Py_TYPE(FROM_GC(gc))->tp_traverse;
        (void) traverse(FROM_GC(gc),
                       (visitproc)visit_decref,
                       NULL);
    }
}

This traversal mechanism is the TP traverse function in the typeobject. In the TP traverse function, the object must be referred to the visitproc function for processing, thus completing the breadth first traversal of the object.

static int person_traverse(PyObject *self, visitproc visit, void *arg){

    Person *p = (Person*)self;
    //visit(p->dict,arg)
    Py_VISIT(p->dict);

    return 0;
}

Determine rootobject

All direct references to objects form a directed graph. First, the references between objects are removed. Then, the final count > 0 indicates that there are non object references between objects, that is, rootobject

Back to the above example, visit ABCD is a function used to refer to - 1 in an object

static int
visit_decref(PyObject *op, void *data)
{
    if (PyObject_IS_GC(op)) {
        PyGC_Head *gc = AS_GC(op);
  
        if (_PyGCHead_REFS(gc) > 0)
            _PyGCHead_DECREF(gc);
    }
    return 0;
}

After the first round of filtering, mark unreachable if count > 0 and unreachable if count = = 0


static void
move_unreachable(PyGC_Head *young, PyGC_Head *unreachable)
{
    PyGC_Head *gc = young->gc.gc_next;

        while (gc != young) {
        PyGC_Head *next;
//Refs > 0 goes through the above refs-1 root object refs > 0
        if (_PyGCHead_REFS(gc)) {

             PyObject *op = FROM_GC(gc);
            traverseproc traverse = Py_TYPE(op)->tp_traverse;
            assert(_PyGCHead_REFS(gc) > 0);
//Set the object to reachable
            _PyGCHead_SET_REFS(gc, GC_REACHABLE);
//All the objects that can be accessed from this rootobject are reachable
            (void) traverse(op,
                            (visitproc)visit_reachable,
                            (void *)young);
            next = gc->gc.gc_next;
            if (PyTuple_CheckExact(op)) {
                _PyTuple_MaybeUntrack(op);
            }
        }
        else {
//unreachable: it will be corrected when misjudging the traversal
            next = gc->gc.gc_next;
            gc_list_move(gc, unreachable);
            _PyGCHead_SET_REFS(gc, GC_TENTATIVELY_UNREACHABLE);
        }
        gc = next;
    }
}



static int
visit_reachable(PyObject *op, PyGC_Head *reachable)
{

    if (PyObject_IS_GC(op)) {
        PyGC_Head *gc = AS_GC(op);
        const Py_ssize_t gc_refs = _PyGCHead_REFS(gc);

        if (gc_refs == 0) {
//Count + 1 so that when you traverse it, it will be classified as reachable
            _PyGCHead_SET_REFS(gc, 1);
        }
        else if (gc_refs == GC_TENTATIVELY_UNREACHABLE) {
      
//In the above traversal, it is misjudged to put the object back into the reachable list
            gc_list_move(gc, reachable);
            _PyGCHead_SET_REFS(gc, 1);
        }
     
        }
    return 0;
}

Live object migration

After filtering the reachable and unreachable objects, the surviving objects need to be moved to the elderly generation

if (young != old) {
//If it's the survival number of gen[1], it will affect the full gc
    if (generation == NUM_GENERATIONS - 2) {
        _PyRuntime.gc.long_lived_pending += gc_list_size(young);
    }
//Live objects enter the older gen 
    gc_list_merge(young, old);
}
else {
//If it is full gc, it will untrack the dict object to reduce the gc burden
    untrack_dicts(young);
    _PyRuntime.gc.long_lived_pending = 0;
//Object enters long live state 
    _PyRuntime.gc.long_lived_total = gc_list_size(young);
}

It's still the last step to complete the object filtering, because there are some problems in the design if the object implements the TP del function. Because some objects will call the tp_del of the referenced object in tp_dealloc and tp_free to clean up, but gc does not guarantee that A refers to B. B must be destroyed later than A. If B destroys, A will also call the B's B causing memory errors, so the object will be abandoned. In order to give programmers the chance to clean up these objects manually, gc will store these objects in the garbage list.

gc_list_init(&finalizers);
//The object of TP del is moved to the finalizers list
move_legacy_finalizers(&unreachable, &finalizers);
//Set to reach
move_legacy_finalizer_reachable(&finalizers);
//Stored in the garbage linked list for the programmer to handle
handle_legacy_finalizers(&finalizers, old);

Object cleanup


static void
delete_garbage(PyGC_Head *collectable, PyGC_Head *old)
{
    inquiry clear;

    while (!gc_list_is_empty(collectable)) {
        PyGC_Head *gc = collectable->gc.gc_next;
        PyObject *op = FROM_GC(gc);
//If DEBUG_SAVEALL is defined, it will be stored in the grabage linked list instead of being cleared
        if (_PyRuntime.gc.debug & DEBUG_SAVEALL) {
            PyList_Append(_PyRuntime.gc.garbage, op);
        }
        else {
            if ((clear = Py_TYPE(op)->tp_clear) != NULL) {
//Call tp'u clear
               Py_INCREF(op);
                clear(op); 
                Py_DECREF(op); 
            }
        }

tp_clear is to reference the object count-1, remove the object from unreachable, and release the object memory back to the memory pool

static int person_clear(PyObject *self){
    Person *p = (Person*)self;
    Py_CLEAR(p->dict);
    //PyObject_GC_Del
    Py_TYPE(self)->tp_free(self);
    return 0;
}

Tags: Python jvm REST

Posted on Wed, 06 Nov 2019 22:09:55 -0800 by FlashHeart