Synchronization

Chapter 5 — Thread synchronization, barriers, and atomic operations

Why Synchronization Matters

When thousands of threads run in parallel, you often need to coordinate them. Without proper synchronization, threads can read stale data or produce non-deterministic results (race conditions).

Block-Level Synchronization: __syncthreads()

__syncthreads() is a barrier — all threads in a block must reach it before any thread can proceed past it. Use it whenever threads in the same block need to share data via shared memory.

parallel_reduction.cu
12345678910111213141516171819202122
__global__ void blockSum(float *input, float *output, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + threadIdx.x;

    // Load into shared memory
    sdata[tid] = (gid < n) ? input[gid] : 0.0f;
    __syncthreads();

    // Parallel reduction in shared memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            sdata[tid] += sdata[tid + stride];
        }
        __syncthreads();  // sync after each step!
    }

    // Thread 0 writes the block's sum
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}
Deadlock danger

Never place __syncthreads() inside a conditional branch where some threads in the block won't reach it. All threads in the block must execute the same __syncthreads() call, or the program will deadlock.

Warp-Level Execution

A warp is a group of 32 threads that execute instructions in lockstep (SIMT — Single Instruction, Multiple Threads). Threads within the same warp are inherently synchronized — they execute the same instruction at the same time.

Modern CUDA (compute capability 7.0+) provides warp-level primitives:

cuda
123456789101112131415
// Warp shuffle — exchange data between threads in a warp
// No shared memory needed!
float val = /* thread's value */;

// Get the value from the thread 1 position to the left
float neighbor = __shfl_up_sync(0xFFFFFFFF, val, 1);

// Broadcast lane 0's value to all threads in the warp
float broadcast = __shfl_sync(0xFFFFFFFF, val, 0);

// Sum reduction within a warp (no shared memory!)
for (int offset = 16; offset > 0; offset >>= 1) {
    val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
// After this loop, thread 0 holds the sum
The mask parameter 0xFFFFFFFF

The first argument to __shfl_*_sync is a 32-bit mask indicating which threads participate. 0xFFFFFFFF means all 32 threads in the warp.

Atomic Operations

When multiple threads need to update the same memory location, use atomic operations to prevent race conditions:

atomics.cu
12345678910111213
__global__ void histogram(int *data, int *bins, int n) {
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    if (gid < n) {
        int bin = data[gid];
        atomicAdd(&bins[bin], 1);  // safe concurrent increment
    }
}

// Available atomic operations:
// atomicAdd, atomicSub, atomicMin, atomicMax
// atomicAnd, atomicOr, atomicXor
// atomicExch  — swap
// atomicCAS   — compare-and-swap (most flexible)
Atomics are slow

Atomic operations serialize access, creating contention. To build a fast histogram, first accumulate in shared memory per-block, then use one atomic per block to update global memory.

Cooperative Groups (Modern CUDA)

CUDA 9+ introduced Cooperative Groups — a flexible API for synchronization at various granularities:

cuda
12345678910111213141516
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    // Get the thread block group
    cg::thread_block block = cg::this_thread_block();
    block.sync();  // equiv to __syncthreads()

    // Get this thread's warp
    cg::coalesced_group active = cg::coalesced_threads();

    // Get a tile of 16 threads within a warp
    cg::thread_block_tile<16> tile =
        cg::tiled_partition<16>(block);
    tile.sync();
}

Summary

  • __syncthreads() synchronizes all threads within a block — required when sharing data via shared memory
  • Warps (32 threads) execute in lockstep; warp shuffle functions exchange data without shared memory
  • Atomic operations (atomicAdd, atomicCAS, etc.) prevent race conditions on shared data
  • Cooperative Groups provide modern, flexible synchronization primitives