{"id":1270,"date":"2022-05-23T09:51:04","date_gmt":"2022-05-23T02:51:04","guid":{"rendered":"https:\/\/bigdolphin.com.vn\/?p=1270"},"modified":"2024-03-26T15:24:59","modified_gmt":"2024-03-26T08:24:59","slug":"learning-cuda-step-02-device-code","status":"publish","type":"post","link":"https:\/\/bigdolphin.com.vn\/?p=1270","title":{"rendered":"Learning CUDA: Step 02: Device code"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\">1. File extension<\/h3>\n\n\n\n<p>File extension must be <em>.cu <\/em>when using with nvcc command.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash line-numbers\">nano main.cu<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp line-numbers\">#include &lt;stdio.h&gt;\n#include &lt;stdlib.h&gt;<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">2. Get GPU card information<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp line-numbers\">int main()\n{\n    cudaError_t cuErr;\n    int cuDev;\n    struct cudaDeviceProp cuDevProp;\n    cuErr = cudaGetDevice(&amp;cuDev);\n    if(cuErr == cudaSuccess){\n        printf(\"Device %d\\r\\n\",cuDev);\n        cuErr = cudaGetDeviceProperties(&amp;cuDevProp,cuDev);\n        if(cuErr == cudaSuccess){\n            printf(\"Device Properties:\\r\\n\");\n            printf(\"- Name                 : %s\\r\\n\",cuDevProp.name);\n            printf(\"- Total Global Mem     : %lu\\r\\n\",cuDevProp.totalGlobalMem);\n            printf(\"- Shared Mem Per Block : %lu\\r\\n\",cuDevProp.sharedMemPerBlock);\n            printf(\"- Max Threads Per Block: %d\\r\\n\",cuDevProp.maxThreadsPerBlock);\n            printf(\"- Compute Mode         : %d\\r\\n\",cuDevProp.computeMode);\n        }\n        else{\n            printf(\"Failed to get Device Properties!\\r\\n\");\n        }\n    }\n    else{\n        printf(\"Failed to get Device!\\r\\n\");\n    }\n    return 0;\n}<\/code><\/pre>\n\n\n\n<p><strong>Compile and test<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash line-numbers\">nvcc main.cu\n.\/a.out<\/code><\/pre>\n\n\n\n<p><strong>Result<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash line-numbers\">ltkhanh@ServerTX:~\/cuda$ .\/a.out \nDevice 0\nDevice Properties:\n- Name                 : GeForce GT 630\n- Total Global Mem     : 2081619968\n- Shared Mem Per Block : 49152\n- Max Threads Per Block: 1024\n- Compute Mode         : 0 <\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">3. Terminology<\/h3>\n\n\n\n<ul><li><strong>Host<\/strong>:  the CPU and its memory (host memory)<\/li><li><strong>Device<\/strong>: the GPU and its memory (device memory)<\/li><\/ul>\n\n\n\n<p>read more at <a href=\"https:\/\/www.nvidia.com\/docs\/IO\/116711\/sc11-cuda-c-basics.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.nvidia.com\/docs\/IO\/116711\/sc11-cuda-c-basics.pdf<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4. Device code<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp line-numbers\">__<kbd>global__ void mykernel(void) {\n    printf(\"[Device] Hello Host from Cuda block %d, thread %d\\n\", blockIdx.x, threadIdx.x);\n}<\/kbd><\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">5. Host code<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp line-numbers\">int main(void) {\n    mykernel&lt;&lt;&lt;1,1&gt;&gt;&gt;();\n    cudaDeviceSynchronize();\n    printf(\"[Host] Device error: %s\\n\", cudaGetErrorString(cudaGetLastError()));\n    return 0;\n}<\/code><\/pre>\n\n\n\n<p>Line 2 calls device function <em>mykernel.<\/em> &#8216;&lt;&lt;&lt;&#8216; and &#8216;&gt;&gt;&gt;&#8217; tell that we are calling from host to device.<br>Generic format is <em>&lt;function name&gt;&lt;&lt;&lt;<strong>N<\/strong>,<strong>K<\/strong>&gt;&gt;&gt;<\/em> with N is the number of blocks in parallel, K is the number of threads per block.<\/p>\n\n\n\n<p>Line 3 blocks until the device has completed all preceding requested tasks, that allows the printf in the device function printing results to the host terminal. Without this line, we will see no output.<\/p>\n\n\n\n<p>In line 4, we get the last error (if have) while calling Cuda functions and convert it to string to print to terminal.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6. Compile and test<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash line-numbers\">nvcc -arch sm_21 -o test main.cu<\/code><\/pre>\n\n\n\n<p>Option -arch tells NVCC which GPU architecture we are using (sm_21 or Fermi architecture in this case). Option -o to set the output filename. Now we execute the binary file<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash line-numbers\">ltkhanh@ServerTX:~\/cuda$ .\/test \n[Device] Hello Host from Cuda block 0, thread 0\n[Host] Device error: no error<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">7. Transfer data between host and device<\/h3>\n\n\n\n<p>Host and device memory are separate entities. To transfer data between host and device, we need pointers for each memory:<\/p>\n\n\n\n<ul><li><strong><em>Device<\/em><\/strong> pointers point to GPU memory. They may be passed to\/from host code and may not be dereferenced in host code.<\/li><li><strong><em>Host<\/em><\/strong> pointers point to CPU memory. They may be passed to\/from device code and may not be dereferenced in device code.<\/li><\/ul>\n\n\n\n<p>read more at <a href=\"https:\/\/www.nvidia.com\/docs\/IO\/116711\/sc11-cuda-c-basics.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/www.nvidia.com\/docs\/IO\/116711\/sc11-cuda-c-basics.pdf<\/a><\/p>\n\n\n\n<p>To handle device memory, we use below functions:<\/p>\n\n\n\n<ul><li><em>cudaMalloc() <\/em>allocates memory on the device. Syntax:<br><center><em>cudaMalloc ( void**&nbsp;devPtr, size_t&nbsp;size )<\/em><\/center><\/li><li><em>cudaFree()<\/em> frees memory on the device. Syntax:<br><center><i>cudaFree ( void* devPtr )<\/i><\/center><\/li><li><em>cudaMemcpy()<\/em> copies data between host and device. Syntax:<br><center><i>cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind )<\/i><\/center><br>With <em>cudaMemcpyKind<\/em> is memory copy type: <em><font color=\"red\">cudaMemcpyHostToHost<\/font><\/em> (Host -&gt; Host), <em><font color=\"red\">cudaMemcpyHostToDevice<\/font><\/em> (Host -&gt; Device), <em><font color=\"red\">cudaMemcpyDeviceToHost<\/font><\/em> (Device -&gt; Host), <em><font color=\"red\">cudaMemcpyDeviceToDevice<\/font><\/em> (Device -&gt; Device) and <em><font color=\"red\">cudaMemcpyDefault<\/font><\/em> (inferred from the pointer values).<\/li><\/ul>\n\n\n\n<p>read more at <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-runtime-api\/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc263dbe6574220cc776b45438fc351e8\" target=\"_blank\" rel=\"noreferrer noopener\" title=\"CUDA MEMORY\">CUDA MEMORY<\/a><\/p>\n\n\n\n<p><strong>Device function for processing device data<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp line-numbers\">__global__ void add(int *a,int *b,int *c) {\n     *c = *a + *b;\n }<\/code><\/pre>\n\n\n\n<p>The above function adds data from two memories and place result at another place in device memory<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/bigdolphin.com.vn\/wp-content\/uploads\/2022\/05\/image.png\" alt=\"\" class=\"wp-image-1308\" width=\"221\" height=\"276\" srcset=\"https:\/\/bigdolphin.com.vn\/wp-content\/uploads\/2022\/05\/image.png 294w, https:\/\/bigdolphin.com.vn\/wp-content\/uploads\/2022\/05\/image-240x300.png 240w\" sizes=\"(max-width: 221px) 100vw, 221px\" \/><\/figure><\/div>\n\n\n\n<p><strong>Host function for passing data to device and getting back result<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp line-numbers\">int main(void) {\n    \/\/ host variables\n    int a, b , c;\n    \/\/ device pointers\n    int *d_a, *d_b, *d_c;\n    \/\/ Allocate space for device pointers\n    cudaMalloc((void **)&amp;d_a, sizeof(int));\n    cudaMalloc((void **)&amp;d_b, sizeof(int));\n    cudaMalloc((void **)&amp;d_c, sizeof(int));\n    printf(\"cudaMalloc() error: %s\\n\", cudaGetErrorString(cudaGetLastError()));\n    \/\/ Set values for host variables\n    a = 1;\n    b = 2;\n    \/\/ Transfer data from host to device memory\n    cudaMemcpy(d_a, &amp;a, sizeof(int), cudaMemcpyHostToDevice);\n    cudaMemcpy(d_b, &amp;b, sizeof(int), cudaMemcpyHostToDevice);\n    \/\/ Device function\n    add&lt;&lt;&lt;1,1&gt;&gt;&gt;(d_a,d_b,d_c);\n    printf(\"Device function error: %s\\n\", cudaGetErrorString(cudaGetLastError()));\n    \/\/ Copy result back to host\n    cudaMemcpy(&amp;c, d_c, sizeof(int), cudaMemcpyDeviceToHost);\n    printf(\"cudaMemcpy() error: %s\\n\", cudaGetErrorString(cudaGetLastError()));\n    printf(\"[Host] a + b = %d + %d = %d\\n\",a,b,c);\n    \/\/ Cleanup pointers which are allocated by cudaMalloc()\n    cudaFree(d_a);\n    cudaFree(d_b);\n    cudaFree(d_c);\n    return 0;\n}<\/code><\/pre>\n\n\n\n<p><strong>Compile and test<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash line-numbers\">nvcc -arch sm_21 -o test main.cu\n.\/test<\/code><\/pre>\n\n\n\n<p><strong>Result<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"bash\" class=\"language-bash line-numbers\">cudaMalloc() error: no error\nDevice function error: no error\ncudaMemcpy() error: no error\n[Host] a + b = 1 + 2 = 3<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Simple programs and data transfer functions between host and device.<\/p>\n","protected":false},"author":2,"featured_media":1308,"comment_status":"open","ping_status":"open","sticky":false,"template":"single-with-sidebar","format":"standard","meta":{"gtb_hide_title":false,"gtb_wrap_title":false,"gtb_class_title":"","gtb_remove_headerfooter":false,"footnotes":""},"categories":[10],"tags":[63,59,61],"_links":{"self":[{"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=\/wp\/v2\/posts\/1270"}],"collection":[{"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1270"}],"version-history":[{"count":39,"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=\/wp\/v2\/posts\/1270\/revisions"}],"predecessor-version":[{"id":1320,"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=\/wp\/v2\/posts\/1270\/revisions\/1320"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=\/wp\/v2\/media\/1308"}],"wp:attachment":[{"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bigdolphin.com.vn\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}