Queued Spinlocks in the WRK

Windows Research Kernel @ HPI

A few days ago, we came up with a discussion on the advantages of queued spinlocks over normal spinlocks. The biggest advantage in our oppinion is that queued spinlocks guarantee a FIFO ordering among competing processors while normal spinlocks don't. In this article, we show the implementation of queued spinlocks in the WRK. We present the source code of the 64-bit version for two reasons: first, the 64-bit version contains the implementation in plain C and not in the Assembly language, and second, in the 32-bit version, queued spinlocks are implemented in the HAL, which is not available as source code. The rational behind the implementation remains however the same.

Queued spinlocks come in two flavors in the WRK: numbered queued spinlocks and in stack queued spinlocks. In this article, we will only cover the latter as those are part of the exposed kernel API and can as well be used by driver developers.

The basic idea of a queued spinlock is that all competing CPUs or cores form a queue and upon release the first core gets the lock. Queued spinlocks use the following fundamental data structures:

typedef struct _KSPIN_LOCK_QUEUE
{
    struct _KSPIN_LOCK_QUEUE * volatile Next;
    PKSPIN_LOCK volatile Lock;
} KSPIN_LOCK_QUEUE, *PKSPIN_LOCK_QUEUE;

typedef struct _KLOCK_QUEUE_HANDLE
{
    KSPIN_LOCK_QUEUE LockQueue;
    KIRQL OldIrql;
} KLOCK_QUEUE_HANDLE, *PKLOCK_QUEUE_HANDLE;

The KSPIN_LOCK_QUEUE data structure is defined in the public\inc\sdk\ntkeapi.h file, so it comes with every Windows Driver Development Kit. We will come to the meaning of those fields a bit later.

The WRK and later Windows versions expose the following API for in stack queued spinlocks (for a complete list on the synchronization API please click here):

  • KeAcquireInStackQueuedSpinLock
  • KeAcquireInStackQueuedSpinLockAtDpcLevel
  • KeReleaseInStackQueuedSpinLock
  • KeReleaseInStackQueuedSpinLockFromDpcLevel

In the XxxDpcLevel functions, the WRK will not change the IRQL. So, if your driver or kernel code is already executing at an IRQL of DPC or higher, you may want to use those API. Otherwise the WRK first raises the IRQL to DPC level on acquisition and lowers it to the previous level upon release. As raising and lowering the IRQL is not of interest here, we only cover the XxxDpcLevel functions.

Acquiring a Queued Spinlock

The signature for acquiring a queud spinlock is as follows:

VOID
KeAcquireInStackQueuedSpinLockAtDpcLevel (
    __inout PKSPIN_LOCK SpinLock,
    __out PKLOCK_QUEUE_HANDLE LockHandle
    )

The first parameter is a pointer to the spinlock that should be acquired while the second parameter is a pointer to the lock queue entry that will be used to enqueue the CPU if necessary. The second parameter may and should point to a stack location of the current thread.

In a first part, the KLOCK_QUEUE_HANDLE data structure will be initialized (base\ntos\ke\amd64\queuelock.c:729):

{
    //
    // Acquire the specified in stack queued spin lock at the current
    // IRQL.
    //

#if !defined(NT_UP)

    LockHandle->LockQueue.Lock = SpinLock;
    LockHandle->LockQueue.Next = NULL;

#else

    UNREFERENCED_PARAMETER(SpinLock);

#endif

    KxAcquireQueuedSpinLock(&LockHandle->LockQueue, SpinLock);
    return;
}

The actual acquisition is done in KxAcquireQueuedSpinLock, an internal helper function. It should be clear that on uni-processor machines spinlocks are unnecessary. If the IRQL is raised to DPC level, no dispatching takes place, which means the currently running thread gets the lock and runs until completion. That's the reason for the #ifdefs in the above and following code fragments. The implementation of KxAcquireQueuedSpinLock (base\ntos\ke\amd64\queuelock.c:114) is as follows:

__forceinline
VOID
KxAcquireQueuedSpinLock (
    __inout PKSPIN_LOCK_QUEUE LockQueue,
    __inout PKSPIN_LOCK SpinLock
    )
{

    //
    // [...]
    //

#if !defined(NT_UP)

    PKSPIN_LOCK_QUEUE TailQueue;

    TailQueue = InterlockedExchangePointer((PVOID *)SpinLock, LockQueue);
    if (TailQueue != NULL) {
        KxWaitForLockOwnerShip(LockQueue, TailQueue);
    }

#else

    UNREFERENCED_PARAMETER(LockQueue);
    UNREFERENCED_PARAMETER(SpinLock);

#endif

    return;
}

First, the value of SpinLock is atomically exchanged with the address of the lock queue entry (PKSPIN_LOCK_QUEUE) and the result is stored into TailQueue. If TailQueue is NULL, it means that the lock was free and the caller is the first to attempt the acquire. In that case we are done.

If TailQueue is not NULL, we see the address of our predecessor in the waiter queue. It does not necessarily mean, that we are the last in the queue, but we have at least one predecessor. However, we must now wait for the lock to be released. This is done in the function KxWaitForLockOwnerShip (base\ntos\ke\amd64\queuelock.c:62)

DECLSPEC_NOINLINE
ULONG64
KxWaitForLockOwnerShip (
    __inout PKSPIN_LOCK_QUEUE LockQueue,
    __inout PKSPIN_LOCK_QUEUE TailQueue
    )
{

    ULONG64 SpinCount;

    //
    // [...]
    //

    *((ULONG64 volatile *)&LockQueue->Lock) |= LOCK_QUEUE_WAIT;
    TailQueue->Next = LockQueue;

    //
    // [...]
    //

    SpinCount = 0;
    do {
        KeYieldProcessor();
    } while ((*((ULONG64 volatile *)&LockQueue->Lock) & LOCK_QUEUE_WAIT) != 0);

    KeMemoryBarrier();
    return SpinCount;
}

What happens here is that we first set the LOCK_QUEUE_WAIT bit in the Lock field of the KSPIN_LOCK_QUEUE structure. Then, we update the Next field in our predecessor's queue entry. This is necessary to inform the preceding CPU that someone else is waiting. Finally, we spin on the Lock until the LOCK_QUEUE_WAIT bit is cleared. If the bit is cleared, the spinlock has been acquired. Please note that the actual spinning is done on a local variable. This is in contrast to normal spinlocks, where all competing CPUs are spinning on a global variable, which may decrease performance due to cache coherency issues.

Interestingly, the SpinCount variable that should return the number of spin-loops is defined, initialized, and returned, but it is never updated. Maybe this was due to performance issues or it is enabled only in debug version of the kernel.

Releasing a Queued Spinlock

When KeReleaseInStackQueuedSpinLockFromDpcLevel is called, it basically redirects control to KxReleaseQueuedSpinLock (base\ntos\ke\amd64\queuelock.c:227):

__forceinline
VOID
KxReleaseQueuedSpinLock (
    __inout PKSPIN_LOCK_QUEUE LockQueue
    )
{

    //
    // [...]
    //

#if !defined(NT_UP)

    PKSPIN_LOCK_QUEUE NextQueue;

    NextQueue = ReadForWriteAccess(&LockQueue->Next);
    if (NextQueue == NULL) {
        if (InterlockedCompareExchangePointer((PVOID *)LockQueue->Lock,
                                              NULL,
                                              LockQueue) == LockQueue) {
            return;
        }

        NextQueue = KxWaitForLockChainValid(LockQueue);
    }

    ASSERT(((ULONG64)NextQueue->Lock & LOCK_QUEUE_WAIT) != 0);

    InterlockedXor64((LONG64 volatile *)&NextQueue->Lock, LOCK_QUEUE_WAIT);
    LockQueue->Next = NULL;

#else

    UNREFERENCED_PARAMETER(LockQueue);

#endif

    return;
}

What happens here is that we first determine, whether there are any processors waiting for the lock. Therefore, the Next field is examined. If it is NULL, we are the last (and only) core and must free the lock. The lock is released by setting it to NULL. However, we must do this in an atomic compare-and-exchange operation to make sure that no other processor updated the spinlock meanwhile. InterlockedCompareExchangePointer will set the data where LockQueue->Lock points to to NULL, if and only if it still contains the address of LockQueue. If it succeeds, it returns the LockQueue address. In that case, the lock has been released.

Otherwise, it returns the actual value of the spinlock. Unfortunately that address is not necessarily the next processor to acquire the lock, as multiple acquisition attempts may have been made meanwhile and the lock only contains the address of the last entry.

To retrieve the actual next processor, our CPU must spin-wait on the other processor to finishing its initialization. The initialization is done when the succeeding CPU has set the Next field in our KSPIN_LOCK_QUEUE entry. The busy-wait is performed in the KxWaitForLockChainValid (base\ntos\ke\amd64\queuelock.c:24) function:

DECLSPEC_NOINLINE
PKSPIN_LOCK_QUEUE
KxWaitForLockChainValid (
    __inout PKSPIN_LOCK_QUEUE LockQueue
    )
{

    PKSPIN_LOCK_QUEUE NextQueue;

    // [...]

    do {
        KeYieldProcessor();
    } while ((NextQueue = LockQueue->Next) == NULL);

    return NextQueue;
}

It spins until the LockQueue->Next entry is not NULL. In that case, our successor in the queue has finished its intialization and is spinning for the LOCK_QUEUE_WAIT bit.

To inform the next CPU that the spinlock has been released, InterlockedXor64 is called to clear the LOCK_QUEUE_WAIT bit of the NextQueue->Lock field. Once the bit is cleared, the lock is automatically assigned to the waiting processor and after cleaning up the KSPIN_LOCK_QUEUE structure, the function returns.

A nice feature of that implementation is that if the user of the queued spinlock API allocated the storage for the KLOCK_QUEUE_HANDLE structure on the stack, the memory management of the lock queue entries comes for free.

Comments

One Response to "Queued Spinlocks in the WRK"

  1. Windows Research Kernel @ HPI - news, howtos and labdescriptions » Numbered Queued Spinlocks on January 30th, 2010 00:29

    [...] Queued Spinlocks in the WRK [...]