Learning CUDA: Step 02: Device code

1. File extension

File extension must be .cu when using with nvcc command.

nano main.cu
#include <stdio.h>
#include <stdlib.h>

2. Get GPU card information

int main()
{
    cudaError_t cuErr;
    int cuDev;
    struct cudaDeviceProp cuDevProp;
    cuErr = cudaGetDevice(&cuDev);
    if(cuErr == cudaSuccess){
        printf("Device %d\r\n",cuDev);
        cuErr = cudaGetDeviceProperties(&cuDevProp,cuDev);
        if(cuErr == cudaSuccess){
            printf("Device Properties:\r\n");
            printf("- Name                 : %s\r\n",cuDevProp.name);
            printf("- Total Global Mem     : %lu\r\n",cuDevProp.totalGlobalMem);
            printf("- Shared Mem Per Block : %lu\r\n",cuDevProp.sharedMemPerBlock);
            printf("- Max Threads Per Block: %d\r\n",cuDevProp.maxThreadsPerBlock);
            printf("- Compute Mode         : %d\r\n",cuDevProp.computeMode);
        }
        else{
            printf("Failed to get Device Properties!\r\n");
        }
    }
    else{
        printf("Failed to get Device!\r\n");
    }
    return 0;
}

Compile and test

nvcc main.cu
./a.out

Result

ltkhanh@ServerTX:~/cuda$ ./a.out 
Device 0
Device Properties:
- Name                 : GeForce GT 630
- Total Global Mem     : 2081619968
- Shared Mem Per Block : 49152
- Max Threads Per Block: 1024
- Compute Mode         : 0 

3. Terminology

  • Host: the CPU and its memory (host memory)
  • Device: the GPU and its memory (device memory)

read more at https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

4. Device code

__global__ void mykernel(void) {
    printf("[Device] Hello Host from Cuda block %d, thread %d\n", blockIdx.x, threadIdx.x);
}

5. Host code

int main(void) {
    mykernel<<<1,1>>>();
    cudaDeviceSynchronize();
    printf("[Host] Device error: %s\n", cudaGetErrorString(cudaGetLastError()));
    return 0;
}

Line 2 calls device function mykernel. ‘<<<‘ and ‘>>>’ tell that we are calling from host to device.
Generic format is <function name><<<N,K>>> with N is the number of blocks in parallel, K is the number of threads per block.

Line 3 blocks until the device has completed all preceding requested tasks, that allows the printf in the device function printing results to the host terminal. Without this line, we will see no output.

In line 4, we get the last error (if have) while calling Cuda functions and convert it to string to print to terminal.

6. Compile and test

nvcc -arch sm_21 -o test main.cu

Option -arch tells NVCC which GPU architecture we are using (sm_21 or Fermi architecture in this case). Option -o to set the output filename. Now we execute the binary file

ltkhanh@ServerTX:~/cuda$ ./test 
[Device] Hello Host from Cuda block 0, thread 0
[Host] Device error: no error

7. Transfer data between host and device

Host and device memory are separate entities. To transfer data between host and device, we need pointers for each memory:

  • Device pointers point to GPU memory. They may be passed to/from host code and may not be dereferenced in host code.
  • Host pointers point to CPU memory. They may be passed to/from device code and may not be dereferenced in device code.

read more at https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf

To handle device memory, we use below functions:

  • cudaMalloc() allocates memory on the device. Syntax:
    cudaMalloc ( void** devPtr, size_t size )
  • cudaFree() frees memory on the device. Syntax:
    cudaFree ( void* devPtr )
  • cudaMemcpy() copies data between host and device. Syntax:
    cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind )

    With cudaMemcpyKind is memory copy type: cudaMemcpyHostToHost (Host -> Host), cudaMemcpyHostToDevice (Host -> Device), cudaMemcpyDeviceToHost (Device -> Host), cudaMemcpyDeviceToDevice (Device -> Device) and cudaMemcpyDefault (inferred from the pointer values).

read more at CUDA MEMORY

Device function for processing device data

__global__ void add(int *a,int *b,int *c) {
     *c = *a + *b;
 }

The above function adds data from two memories and place result at another place in device memory

Host function for passing data to device and getting back result

int main(void) {
    // host variables
    int a, b , c;
    // device pointers
    int *d_a, *d_b, *d_c;
    // Allocate space for device pointers
    cudaMalloc((void **)&d_a, sizeof(int));
    cudaMalloc((void **)&d_b, sizeof(int));
    cudaMalloc((void **)&d_c, sizeof(int));
    printf("cudaMalloc() error: %s\n", cudaGetErrorString(cudaGetLastError()));
    // Set values for host variables
    a = 1;
    b = 2;
    // Transfer data from host to device memory
    cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);
    // Device function
    add<<<1,1>>>(d_a,d_b,d_c);
    printf("Device function error: %s\n", cudaGetErrorString(cudaGetLastError()));
    // Copy result back to host
    cudaMemcpy(&c, d_c, sizeof(int), cudaMemcpyDeviceToHost);
    printf("cudaMemcpy() error: %s\n", cudaGetErrorString(cudaGetLastError()));
    printf("[Host] a + b = %d + %d = %d\n",a,b,c);
    // Cleanup pointers which are allocated by cudaMalloc()
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    return 0;
}

Compile and test

nvcc -arch sm_21 -o test main.cu
./test

Result

cudaMalloc() error: no error
Device function error: no error
cudaMemcpy() error: no error
[Host] a + b = 1 + 2 = 3

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *