Art of concurrent programming 06 compound & hierarchical spin lock

 

Compound lock

Consider the following conclusion based on observation: in a queue lock, only the thread in front of the queue needs to switch locks. A balance scheme of queue lock and back lock is to allow only a small number of waiting threads in the queue during the process of entering the critical area (making the queue shorter, the queue shorter and the number of threads competing for the lock less). If the remaining threads attempt to enter the queue, the exponential back method is adopted. It is common for threads to terminate by backing off.

CompositeLock maintains a short array of lock nodes of a fixed size. Each thread trying to acquire a lock randomly selects a node in the array. If the selected node is being used by another thread, the current thread will back for a while, and then try again on this node. Once the thread gets a node, it adds it to the end of the queue. The thread spins on the predecessor node of the node. If the owner thread of the node sends a completed signal, the thread enters the critical area. When a thread leaves the critical area, when it finishes or times out, the thread gives up the ownership of the node, and another backward thread can obtain the node.

The tail domain is an atomicsstampedreference < qnode >. It contains a reference to a queue tail node and a version number. The version number is to avoid the ABA problem in the update operation. The tail field is either null or points to the node most recently inserted into the queue.

Four states of QNode:

WAITING: the node is linked in the queue, and the thread is either in the critical area or WAITING to enter the critical area.

RELEASED: when the owner thread of a node is ready to leave the critical area to release the lock.

ABORTED: the thread wants to give up acquiring the lock, and the node has entered the queue.

FREE: the thread wants to give up acquiring the lock, and the node does not enter the queue.

 

acquireQNode Function Description:

1. The thread randomly selects a node in the QNode array, and tries to change the state of the node from FREE to WAITING by the test and set instruction. (the FREE status indicates that the node is not occupied by other threads. If the status is changed to WAITING successfully, it indicates that the current thread has successfully occupied the node.) If the status of the node is successfully modified to WAITING, the node will be directly returned for subsequent operations.

2. If it fails, it will determine whether the node is in ABORTED or RELEASED state. If it is and the node is still at the end of the list and in the abort state, then it will try to remove the node from the list and make the node pointed to by the node's PRED field as the end of the list. When the node has no successor node at this time, it can be removed successfully, and the status of the node can be modified to WAITING and then directly returned to the node. If the node is in RELEASE state, the queue tail will be set to null directly, because it means that all locks in the queue have been RELEASED, and the last queue tail will be in RELEASE state. If the node is in the aborded state, it needs to set the node indicated by the pred domain of the node to the end of the queue, because the node indicated by the pred domain is not clear at present, and it needs to continue to exist in the queue.

 

3. If the node is not in the abort or RELEASED state, it means that the node is in the WAITING state and is being occupied by other threads, so it will go back directly.

The acquireQNode function spins in the above three steps until either a QNode is returned successfully or a timeout occurs.

 

spliceQNode Function Description:

What this function does is to put the QNode node obtained by the thread to the end of the queue, and return to the end of the queue node before success. Or timeout during spin execution.

 

Description of waitForPredecessor function:

The waitForPredecessor function spins (PRED) on the predecessor node until its state changes to RELEASE, indicating that the thread occupying the node releases the lock. If the pred node status is ABORTED, it will change its status to FREE and be removed from the queue. The node pointed to by the pred domain of the node will be the precursor node of the selected node. Repeat this process until the timeout or the state of a precursor node changes to RELEASE. After the precursor node state changes to RELEASE, the function ends the spin and changes the precursor node state to FREE, then the thread can enter the critical area.

 

Backoff

public class Backoff {
    final int minDelay , maxDelay;
    int limit;
    final Random random;
    
    public Backoff(int min , int max) {
        minDelay = min;
        maxDelay = max;
        limit = minDelay;
       random = new Random(); 
    }
    
    public void backoff() throw Exception {
        int delay = random.nextInt(limit);
        limit = Math.min(maxDelay , 2 * limit);
        Thread.sleep(delay);
    }
}

 

CompositeLock

enum State {FREE, WAITING, RELEASED, ABORTED};
class QNode {
    AtomicReference<State> state;
    QNode pred;
    public QNode() {
      state = new AtomicReference<State>(State.FREE);
    }
}

public class CompositeLock implements Lock{
  private static final int SIZE = 4;
  private static final int MIN_BACKOFF = 1024;
  private static final int MAX_BACKOFF = 256 * MIN_BACKOFF;
  AtomicStampedReference<QNode> tail;
  QNode[] waiting;
  Random random;
  ThreadLocal<QNode> myNode = new ThreadLocal<QNode>() {
    protected QNode initialValue() { return null; };
  };
  
  public CompositeLock() {
    tail = new AtomicStampedReference<QNode>(null, 0);
    random = new Random();
    waiting = new QNode[SIZE];
    for (int i = 0; i < waiting.length; i++) {
      waiting[i] = new QNode();
    }
  }
  
  public boolean tryLock(long time, TimeUnit unit) throws InterruptedException {
    long patience = TimeUnit.MILLISECONDS.convert(time, unit);
    long startTime = System.currentTimeMillis();
    Backoff backoff = new Backoff(MIN_BACKOFF, MAX_BACKOFF);
    try {
      QNode node = acquireQNode(backoff, startTime, patience);
      QNode pred = spliceQNode(node, startTime, patience);
      waitForPredecessor(pred, node, startTime, patience);
      return true;
    } catch (TimeoutException e) {
      return false;
    }
  }
  
  public void unlock() {
    QNode acqNode = myNode.get();
    acqNode.state.set(State.RELEASED);
  }
  
  private boolean timeout(long startTime, long patience) {
    return System.currentTimeMillis() - startTime > patience;
  }
  
  private QNode acquireQNode(Backoff backoff, long startTime, long patience)
      throws TimeoutException, InterruptedException {
    QNode node = waiting[random.nextInt(SIZE)];
    QNode currTail;
    int[] currStamp = {0};
    while (true) {
      if (node.state.compareAndSet(State.FREE, State.WAITING)) {
        return node;
      }
      currTail = tail.get(currStamp);
      State state = node.state.get();
      if (state == State.ABORTED || state == State.RELEASED) {
        if (node == currTail) {
          QNode myPred = null;
          if (state == State.ABORTED) {
            myPred = node.pred;
          }
          if (tail.compareAndSet(currTail, myPred,
              currStamp[0], currStamp[0]+1)) {
            node.state.set(State.WAITING);
            return node;
          }
        }
      }
      backoff.backoff();
      if (timeout(patience, startTime)) {
        throw new TimeoutException();
      }
    }
  }
  
  private QNode spliceQNode(QNode node, long startTime, long patience)
      throws TimeoutException {
    QNode currTail;
    int[] currStamp = {0};
    // splice node into queue
    do {
      currTail = tail.get(currStamp);
      if (timeout(startTime, patience)) {
        node.state.set(State.FREE);
        throw new TimeoutException();
      }
    } while (!tail.compareAndSet(currTail, node,
        currStamp[0], currStamp[0]+1));
    return currTail;
  }
  
  private void waitForPredecessor(QNode pred, QNode node, long startTime, long patience)
      throws TimeoutException {
    // wait for predecessor to release lock
    int[] stamp = {0};
    if (pred == null) {
      myNode.set(node);
      return;
    }
    State predState = pred.state.get();
    while (predState != State.RELEASED) {
      if (predState == State.ABORTED) {
        QNode temp = pred;
        pred = pred.pred;
        temp.state.set(State.FREE);
      }
      if (timeout(patience, startTime)) {
        node.pred = pred;
        node.state.set(State.ABORTED);
        throw new TimeoutException();
      }
      predState = pred.state.get();
    }
    pred.state.set(State.FREE);
    myNode.set(node);
    return;
  }

CompositeLock has some interesting features. When multiple threads back down, they access different storage locations, reducing contention. The switch speed of lock is very fast. Giving up a lock request is normal for a thread in the backward phase, and it is simpler and more straightforward for a thread that has already acquired a queue node. CompositeLock also has a disadvantage that it loses the fairness of first come first serve.

 

Fast compound path lock

Although the original intention of composite lock is to ensure better performance in contention, performance is also very important in the absence of concurrency. Ideally, for a single running thread, getting a lock should be as simple as getting an uncontested TASLock. However, in the CompositeLock algorithm, a thread running alone must reset the tail field to leave a released lock, declare the node and insert it into the queue.

Fast path is a shortcut for a single thread to execute complex algorithms. The CompositeLock algorithm can be extended to include a fast path in which a single thread can acquire an idle lock without acquiring a node and inserting a queue.

 

fastPathLock Function Description:

1. This function will check whether the tail has a reference to a non null QNode. If it does, it means that the single thread will directly return false if it does not meet the condition of obtaining lock.

2. Check whether the stamp has changed. If the value of stamp is greater than or equal to the value of FASTPATH (the selection of FASTPATH value needs to select a number with binary high bit 1 and other bits 0, for example: 1000-0000) , then the stamp & FASTPATH must not be equal to 0, indicating that the fast path flag quantity has been occupied. In this case, false is returned directly. (the number smaller than FASTPATH is FASTPATH & all equal to 0) (& operator is binary operation, and if both bits are 1, the result is 1. If one bit is 0, the result is 0). After passing the above two checks, a new stamp will be calculated according to the FASTPATH value and the queue tail node will be set to null.

fastPathUnlock Function Description:

This function restores the value of stamp to the value of stamp before calling the fastPathLock() function. for instance:

1. stamp is 11.

2. fastPathLock() modifies the stamp to 1073741824.

3. After fastpathunlock() restores the stamp, the stamp is 11.

 

fastPathWait Function Description:

The FASTPATH() function is used to ensure that no other thread holds the fast path lock and waits for the FASTPATH flag to be cleared before returning from the slow path.

CompositeFastPathLock

public class CompositeFastPathLock extends CompositeLock {
  private static final int FASTPATH = 1 << 30;
  public int fastPathTaken = 0;
  
  public boolean tryLock(long time, TimeUnit unit) throws InterruptedException {
    if (fastPathLock()) {
      fastPathTaken++;
      return true;
    }
    if (super.tryLock(time, unit)) {
      fastPathWait();
      return true;
    }
    return false;
  }
  
  public void unlock() {
    if (!fastPathUnlock()) {
      super.unlock();
    };
  }
  
  private boolean fastPathLock() {
    int oldStamp, newStamp;
    int stamp[] = {0};
    QNode qnode;
    qnode = tail.get(stamp);
    oldStamp = stamp[0];
    if (qnode != null) {
      return false;
    }
    if ((oldStamp & FASTPATH) != 0) {
      return false;
    }
    newStamp = (oldStamp + 1) | FASTPATH;   // set flag
    return tail.compareAndSet(qnode, null, oldStamp, newStamp);
  }
  
  private boolean fastPathUnlock() {
    int oldStamp, newStamp;
    oldStamp = tail.getStamp();
    if ((oldStamp & FASTPATH) == 0) {
      return false;
    }
    int[] stamp = {0};
    QNode qnode;
    do {
      qnode = tail.get(stamp);
      oldStamp = stamp[0];
      newStamp = oldStamp & (~FASTPATH);   // unset flag
    } while (!tail.compareAndSet(qnode, qnode, oldStamp, newStamp));
    return true;
  }
  
  private void fastPathWait() {
    while ((tail.getStamp() & FASTPATH ) != 0) {} // spin while flag is set
  }
}

 

Hierarchical lock

At present, most cache consistent system structures organize processors in a cluster way, and the communication within a cluster is much faster than that between clusters. For example, a cluster can correspond to a processor sharing memory through fast interconnection, or all threads running on a single core in a multi-core system structure. This part mainly considers the lock which is sensitive to local differences. This kind of lock is called hierarchical lock because the hierarchical storage structure and access cost of the system should be considered in the design.

The storage structure of the system structure can have multiple levels. For simplicity, we assume that there are only two levels. Next, consider a system structure composed of multiple processor clusters. The processors in the same cluster communicate efficiently through shared cache. The cost of communication between clusters is much higher than that within clusters. Suppose that each cluster has a unique cluster ID, which is known to each thread in the cluster, and can be obtained through ThreadID.getCluster(). Threads cannot be migrated between clusters.

 

Level back lock

The test test and set lock is very suitable for cluster development. Thread A is assumed to hold the lock. If the thread in the cluster where A is located has A shorter backward time, then when the lock is released, the local thread is more likely to obtain the lock than the remote thread, thus reducing the total time required to switch the ownership of the lock.

One of the disadvantages of HBOLock is that it overuses locality. In this way, it is possible that threads in the same cluster continuously transfer locks, while threads in other clusters are hungry. In addition, obtaining and releasing locks will invalidate the remote cache replica, which will cause huge overhead in the system structure to ensure cache consistency.

HBOLock

public class HBOLock implements Lock {
  private static final int LOCAL_MIN_DELAY = 8;
  private static final int LOCAL_MAX_DELAY = 256;
  private static final int REMOTE_MIN_DELAY = 256;
  private static final int REMOTE_MAX_DELAY = 1024;
  private static final int FREE = -1;
  AtomicInteger state;
  
   public HBOLock() {
    state = new AtomicInteger(FREE);
  }
  
  public void lock() {
    int myCluster = ThreadID.getCluster();
    Backoff localBackoff = new Backoff(LOCAL_MIN_DELAY, LOCAL_MAX_DELAY);
    Backoff remoteBackoff = new Backoff(REMOTE_MIN_DELAY, REMOTE_MAX_DELAY);
    while (true) {
      if (state.compareAndSet(FREE, myCluster)) {
        return;
      }
      int lockState = state.get();
      try {
        if (lockState == myCluster) {
          localBackoff.backoff();
        } else {
          remoteBackoff.backoff();
        }
      } catch (InterruptedException ex) {
      }
    }
  }
  
  public void unlock() {
    state.set(FREE);
  }
}

 

Hierarchical CLH queue lock

In order to provide a more balanced cluster development method, the design of hierarchical queue lock is analyzed. In the design of hierarchical lock, the key to the problem is to coordinate the fairness requirements in conflict. In order to avoid high communication overhead, we should ensure the delivery of locks in the same cluster, and at the same time ensure a certain degree of fairness, so that the remote lock request is not too late than the local lock request. We balance these requirements by scheduling the request sequences of the same cluster together.

HCLHLock consists of a group of local queues and a global queue, each cluster corresponds to a local queue. All queues are linked lists composed of nodes, and the linked lists are implicit, that is, the linked lists are kept in the thread's local domain myqnode, mypred.

 

QNode has three virtual domains:

1. ClusterId of the current or recent owner.

2. successorMustWait: set to true before joining the team and false when releasing the lock. In this way, for a thread waiting for a lock, when the successorMustWait of its predecessor node becomes false, the thread can end spinning.

3. tailWhenSpliced: indicates whether the node is the last node currently spliced into the global queue.

These domains are called virtual because they are updated atomically. Therefore, they are described as bit fields in the atomicinter domain, and their values are obtained by simple shift and mask operations.

The lock() function first adds the thread node to the local queue, and then waits until the thread either obtains the lock to enter the critical area, or its node becomes the queue leader of the local queue. In the latter case, the thread becomes the cluster master, which is responsible for splicing the local queue into the global queue. Because the node of the thread has been initialized, successorMustWait is true and tailWhenSpliced is false. ClusterId is the cluster ID of the caller thread. The thread adds its nodes to the end of the local queue. The thread then calls waitForGrantOrClusterMaster() to spin the thread until the following happens.

1. The precursor nodes are from the same cluster, and the successorMustWait and tailWhenSpliced are all false.

2. The flag quantity tailWhenSpliced from different clusters or precursor nodes is true.

In the first case, the node of the thread is the head of the global queue, so it enters the critical area and returns.

In the second case, the node of the thread is the head of the local queue, so the thread is the cluster master, so it is responsible for splicing the local queue into the global queue.

In addition, either the tailWhenSpliced flag of the thread's precursor is true, or the cluster of its precursor is different from its own cluster. If the clusters are different, it is impossible for the precursor node to be in the local queue of the thread. The predecessor node must have been moved to the global queue and recycled by a thread in a different cluster. On the other hand, if tailWhenSpliced of the precursor node is true, the precursor node is the last node entering the global queue, so the node of this thread is at the head of the local queue at this time. It cannot be moved to the global queue because only the cluster master whose node is at the head of the local queue can move the node to the global queue.

As a cluster master, its task is to splice the local queue into the global queue. Each thread of the local queue spins on its predecessor node. Once the cluster master enters the global queue, it will enter the critical area when its predecessor node's successorMustWait is false, just like the normal CLHLock pair.

HCLHLock

public class HCLHLock implements Lock {
    
  static final int MAX_CLUSTERS = 32;
  List<AtomicReference<QNode>> localQueues;
  AtomicReference<QNode> globalQueue;
 
  ThreadLocal<QNode> currNode = new ThreadLocal<QNode>() {
    protected QNode initialValue() { return new QNode(); };
  };
 
 ThreadLocal<QNode> predNode = new ThreadLocal<QNode>() {
    protected QNode initialValue() { return null; };
 };
  
  public HCLHLock() {
    localQueues = new ArrayList<AtomicReference<QNode>>(MAX_CLUSTERS);
    for (int i = 0; i < MAX_CLUSTERS; i++) {
      localQueues.add(new AtomicReference <QNode>());
    }
    QNode head = new QNode();
    globalQueue = new AtomicReference<QNode>(head);
    
  }
  
  public void lock() {
    QNode myNode = currNode.get();
    AtomicReference<QNode> localQueue = localQueues.get(ThreadID.getCluster());
    // splice my QNode into local queue
    QNode myPred = null;
    do {
      myPred = localQueue.get();
    } while (!localQueue.compareAndSet(myPred, myNode));
    if (myPred != null) {
      boolean iOwnLock = myPred.waitForGrantOrClusterMaster();
      if (iOwnLock) {
        // I have the lock. Save QNode just released by previous leader
        predNode.set(myPred);
        return;
      }
    }
    // At this point I am the cluster master.
    // Splice local queue into global queue.
    QNode localTail = null;
    do {
      myPred = globalQueue.get();
      localTail = localQueue.get();
    } while(!globalQueue.compareAndSet(myPred, localTail));
    // inform successor it is the new master
    localTail.setTailWhenSpliced(true);
    // wait for predecessor to release lock
    while (myPred.isSuccessorMustWait()) {};
    // I have the lock. Save QNode just released by previous leader
    predNode.set(myPred);
    return;
  }
  
  public void unlock() {
    QNode myNode = currNode.get();
    myNode.setSuccessorMustWait(false);
    // promote pred node to current
    QNode node = predNode.get();
    node.unlock();
    currNode.set(node);
  }
  
  static class QNode {
    // private boolean tailWhenSpliced;
    private static final int TWS_MASK = 0x80000000;
    // private boolean successorMustWait= false;
    private static final int SMW_MASK = 0x40000000;
    // private int clusterID;
    private static final int CLUSTER_MASK = 0x3FFFFFFF;
    AtomicInteger state;
    public QNode() {
      state = new AtomicInteger(0);
    }
    boolean waitForGrantOrClusterMaster() {
      int myCluster = ThreadID.getCluster();
      while(true) {
        if (getClusterID() == myCluster &&
            !isTailWhenSpliced() &&
            !isSuccessorMustWait()) {
          return true;
        } else if (getClusterID() != myCluster || isTailWhenSpliced()) {
          return false;
        }
      }
    }
    public void unlock() {
      int oldState = 0;
      int newState = ThreadID.getCluster();
      // successorMustWait = true;
      newState |= SMW_MASK;
      // tailWhenSpliced = false;
      newState &= (~TWS_MASK);
      do {
        oldState = state.get();
      } while (! state.compareAndSet(oldState, newState));
    }
    
    public int getClusterID() {
      return state.get() & CLUSTER_MASK;
    }
    
    public void setClusterID(int clusterID) {
      int oldState, newState;
      do {
        oldState = state.get();
        newState = (oldState & ~CLUSTER_MASK) | clusterID;
      } while (! state.compareAndSet(oldState, newState));
    }
    
    public boolean isSuccessorMustWait() {
      return (state.get() & SMW_MASK) != 0;
    }
    
    public void setSuccessorMustWait(boolean successorMustWait) {
      int oldState, newState;
      do {
        oldState = state.get();
        if (successorMustWait) {
          newState = oldState | SMW_MASK;
        } else {
          newState = oldState & ~SMW_MASK;
        }
      } while (! state.compareAndSet(oldState, newState));
    }
    
    public boolean isTailWhenSpliced() {
      return (state.get() & TWS_MASK) != 0;
    }
    
    public void setTailWhenSpliced(boolean tailWhenSpliced) {
      int oldState, newState;
      do {
        oldState = state.get();
        if (tailWhenSpliced) {
          newState = oldState | TWS_MASK;
        } else {
          newState = oldState & ~TWS_MASK;
        }
      } while (! state.compareAndSet(oldState, newState));
    }
    
  }
}

 

Past highlights

Art of concurrent programming 01 - are you asked in the interview if the basic knowledge of concurrent programming can't be achieved?

Art 02 filter lock algorithm of concurrent programming

Art of concurrent programming 03 bakery mutual exclusion algorithm

Art of concurrent programming 04-TAS spin lock

Art of concurrent programming 05 queue spin lock

Daily reading computer bus system

What the hell is the memory barrier?

Learn about cache consistency protocol MESI

WeChat official account of "black hat technology"

Browse technical work articles for the first time

Tags: Programming less

Posted on Wed, 25 Mar 2020 01:56:00 -0700 by bengaltgrs