Streams & Concurrency
Chapter 6 — Overlapping computation with data transfers
What are CUDA Streams?
A CUDA stream is a sequence of operations (kernel launches, memory copies) that execute in order. Operations in different streams can overlap and run concurrently, enabling massive throughput gains.
By default, all operations go into stream 0 (the default stream), which is synchronous with respect to the host.
Without streams, the GPU sits idle while data is being transferred. With streams, you can overlap kernel execution on one chunk of data with the transfer of the next chunk — utilizing the full bandwidth of both the compute engine and the copy engine simultaneously.
Creating and Using Streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// Async copy in stream1
cudaMemcpyAsync(d_a, h_a, bytes,
cudaMemcpyHostToDevice, stream1);
// Async copy in stream2
cudaMemcpyAsync(d_b, h_b, bytes,
cudaMemcpyHostToDevice, stream2);
// Launch kernels in different streams
kernelA<<<blocks, threads, 0, stream1>>>(d_a);
kernelB<<<blocks, threads, 0, stream2>>>(d_b);
// Copy results back asynchronously
cudaMemcpyAsync(h_a, d_a, bytes,
cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_b, d_b, bytes,
cudaMemcpyDeviceToHost, stream2);
// Wait for everything to finish
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
// Cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);cudaMemcpyAsync only achieves true asynchronous transfer withpinned (page-locked) host memory. Use cudaMallocHostinstead of malloc for the host buffers.
Pipeline Pattern: Chunked Processing
The most common stream pattern is to split data into chunks and process them in a pipeline:
const int nStreams = 4;
const int chunkSize = N / nStreams;
cudaStream_t streams[nStreams];
// Allocate pinned host memory
float *h_data;
cudaMallocHost(&h_data, N * sizeof(float));
for (int i = 0; i < nStreams; i++) {
cudaStreamCreate(&streams[i]);
}
for (int i = 0; i < nStreams; i++) {
int offset = i * chunkSize;
size_t chunkBytes = chunkSize * sizeof(float);
// 1. Copy chunk to GPU
cudaMemcpyAsync(&d_data[offset], &h_data[offset],
chunkBytes, cudaMemcpyHostToDevice,
streams[i]);
// 2. Process chunk
int blocks = (chunkSize + 255) / 256;
process<<<blocks, 256, 0, streams[i]>>>(
&d_data[offset], chunkSize);
// 3. Copy result back
cudaMemcpyAsync(&h_data[offset], &d_data[offset],
chunkBytes, cudaMemcpyDeviceToHost,
streams[i]);
}
// Wait for all streams
cudaDeviceSynchronize();
for (int i = 0; i < nStreams; i++) {
cudaStreamDestroy(streams[i]);
}
cudaFreeHost(h_data);CUDA Events
Events let you measure timing and create inter-stream dependencies:
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, stream);
myKernel<<<blocks, threads, 0, stream>>>(args);
cudaEventRecord(stop, stream);
// Wait for the stop event
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel took %.3f ms\n", milliseconds);
cudaEventDestroy(start);
cudaEventDestroy(stop);Inter-Stream Dependencies
You can make one stream wait for an event recorded in another:
cudaEvent_t event; cudaEventCreate(&event); // Record event after kernel in stream1 kernelA<<<..., stream1>>>(args); cudaEventRecord(event, stream1); // Make stream2 wait for the event cudaStreamWaitEvent(stream2, event, 0); kernelB<<<..., stream2>>>(args); // runs only after kernelA
Summary
- Streams enable concurrent execution of kernels and memory transfers on different data chunks
- Use
cudaMemcpyAsyncwith pinned host memory for true async transfers - The chunked pipeline pattern overlaps copy↔compute for maximum throughput
- CUDA Events provide precise GPU timing and inter-stream synchronization