Streams & Concurrency

Chapter 6 — Overlapping computation with data transfers

What are CUDA Streams?

A CUDA stream is a sequence of operations (kernel launches, memory copies) that execute in order. Operations in different streams can overlap and run concurrently, enabling massive throughput gains.

By default, all operations go into stream 0 (the default stream), which is synchronous with respect to the host.

Why streams matter

Without streams, the GPU sits idle while data is being transferred. With streams, you can overlap kernel execution on one chunk of data with the transfer of the next chunk — utilizing the full bandwidth of both the compute engine and the copy engine simultaneously.

Creating and Using Streams

1234567891011121314151617181920212223242526272829

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Async copy in stream1
cudaMemcpyAsync(d_a, h_a, bytes,
                cudaMemcpyHostToDevice, stream1);

// Async copy in stream2
cudaMemcpyAsync(d_b, h_b, bytes,
                cudaMemcpyHostToDevice, stream2);

// Launch kernels in different streams
kernelA<<<blocks, threads, 0, stream1>>>(d_a);
kernelB<<<blocks, threads, 0, stream2>>>(d_b);

// Copy results back asynchronously
cudaMemcpyAsync(h_a, d_a, bytes,
                cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_b, d_b, bytes,
                cudaMemcpyDeviceToHost, stream2);

// Wait for everything to finish
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Pinned memory required

cudaMemcpyAsync only achieves true asynchronous transfer withpinned (page-locked) host memory. Use cudaMallocHostinstead of malloc for the host buffers.

Pipeline Pattern: Chunked Processing

The most common stream pattern is to split data into chunks and process them in a pipeline:

123456789101112131415161718192021222324252627282930313233343536373839

const int nStreams = 4;
const int chunkSize = N / nStreams;
cudaStream_t streams[nStreams];

// Allocate pinned host memory
float *h_data;
cudaMallocHost(&h_data, N * sizeof(float));

for (int i = 0; i < nStreams; i++) {
    cudaStreamCreate(&streams[i]);
}

for (int i = 0; i < nStreams; i++) {
    int offset = i * chunkSize;
    size_t chunkBytes = chunkSize * sizeof(float);

    // 1. Copy chunk to GPU
    cudaMemcpyAsync(&d_data[offset], &h_data[offset],
                    chunkBytes, cudaMemcpyHostToDevice,
                    streams[i]);

    // 2. Process chunk
    int blocks = (chunkSize + 255) / 256;
    process<<<blocks, 256, 0, streams[i]>>>(
        &d_data[offset], chunkSize);

    // 3. Copy result back
    cudaMemcpyAsync(&h_data[offset], &d_data[offset],
                    chunkBytes, cudaMemcpyDeviceToHost,
                    streams[i]);
}

// Wait for all streams
cudaDeviceSynchronize();

for (int i = 0; i < nStreams; i++) {
    cudaStreamDestroy(streams[i]);
}
cudaFreeHost(h_data);

CUDA Events

Events let you measure timing and create inter-stream dependencies:

1234567891011121314151617

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, stream);
myKernel<<<blocks, threads, 0, stream>>>(args);
cudaEventRecord(stop, stream);

// Wait for the stop event
cudaEventSynchronize(stop);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel took %.3f ms\n", milliseconds);

cudaEventDestroy(start);
cudaEventDestroy(stop);

Inter-Stream Dependencies

You can make one stream wait for an event recorded in another:

12345678910

cudaEvent_t event;
cudaEventCreate(&event);

// Record event after kernel in stream1
kernelA<<<..., stream1>>>(args);
cudaEventRecord(event, stream1);

// Make stream2 wait for the event
cudaStreamWaitEvent(stream2, event, 0);
kernelB<<<..., stream2>>>(args);  // runs only after kernelA

Summary

Streams enable concurrent execution of kernels and memory transfers on different data chunks
Use cudaMemcpyAsync with pinned host memory for true async transfers
The chunked pipeline pattern overlaps copy↔compute for maximum throughput
CUDA Events provide precise GPU timing and inter-stream synchronization

← PreviousSynchronization Next →Performance Optimization