Please feel free to download the source code from the bottom of this article. Finally, threadIdx.x is equal to the number of threads before the current thread in the current block (remember that the threads winith a thread block are organized linearly in this tutorial, so threadIdx.y isn’t needed). The blockIdx.x * 256 is equal to the number of threads in all columns of the current grid-row before the curren’t thread’s block. BlockIdx.y * 128 * 256 is equal to the number of threads in all rows of the grid above the current thread’s position. In this example, the index of the array element that a thread is computing is determined by the block it’s in, and by it’s thread ID. Each thread runs the same code, so the only way for them to differentiate themselves from the other threads is to use their threadIdx, and their blockIdx. The next thing you should notice is how each thread figures out exactly which data element is it responsible for computing.
This simply indicates that this function may be called from either the host PC or the CUDA device. The first thing to notice is the _global_ keyword. PResult = sqrt (pDataA * pDataB / 12.34567 ) * sin (pDataA )
pResult = pDataA * pDataB // Each thread only multiplies one data element. x // This gives every thread a unique ID. We already set it to 256 threads per block, with 128 thread blocks per grid row. _global_ void multiplyNumbersGPU ( float *pDataA, float *pDataB, float *pResult ) Now that you know what the thread structure will be like, we can write the kernel. For the purposes of this tutorial, I have chosen 128 x 1024 x 1. Unfortunately, the maximum size for any dimension is 65535! Therefore, we are forced to chose another grid structure. Okay, so naturally, we would like our grid to have the dimensions of 131,072 x 1 x 1. Number of blocks required = total number of elements / 256 = 131,072 Number of threads each block will calculate = 256 Size of each array = 1024 x 1024 x 32 = 33,554,432 total number of elements For this tutorial, I wanted to support having each array have about 32 million elements. Naturally, since our problem is linear, we would like to make the grid have a linear structure. Exactly like the thread block, you may think of each ‘grid’ as a 3d brick, filled with blocks. Now it’s time to think of how we’re going to stucture the blocks. The variable as seen above is of type dim3, and it will be used when calling the CUDA kernel. Therefore, the blocks will all be shaped with dimensions 256x1x1. For our application, we are dealing with linear data, so it’s probably simplest to keep the thread structure linear. For some applications, it may make sense to shape a block with 16x16x1.
You may shape the block essentially any way you would like. It’s best to think of a thread block as a 3-d block of threads. For the purposes of this tutorial, 256 threads per block is chosen.
Therefore, 256, and 512 threads are common and practical numbers. A general guidline is that a block should consist of at least 192 threads in order to hide memory access latency. For this application, the simplest choice is to have each thread calculate one, and only one, element in the final result array. Organizing threadsĪ critical part of designing CUDA applications is to organize threads, thread blocks, and grids appropriately. If you haven’t read the first tutorial, it may be a good idea to go back and read the first CUDA tutorial. The data analysis will take place toward the end of the article. We will then study how fast the code executes on a CUDA device, and compare it to a traditional CPU. The idea is to take two arrays of floating point numbers, and perform an operation on them and store the result in a third floating point array. The goal of this application is very simple. For this tutorial, we will complete the previous tutorial by writing a kernel function.
This tutorial will cover the basics of how to write a kernel, and how to organize threads, blocks, and grids. Welcome to the second tutorial in how to write high performance CUDA based applications.