GPGPU with Jcuda the Good, the Bad and … the Ugly

Spyros SakellariouOctober 19th, 2011Last Updated: October 26th, 2012

2 139 10 minutes read

In our previous article GPGPU for Java Programming we showed how to setup an environment to execute CUDA from within java code. However the previous article focused only on setting up the environment leaving the subject of parallelism untouched. In this article we will see how we can utilize a GPU do what is doing best: parallel processing. Through this example we will take some metrics and see where GPU processing is stronger or weaker than using a CPU …and of course as the title suggests there is an ugly part at the end.

Will start our GPU parallelism exploration with devising an example that will differ from samples found in most of the available GPGPU documentation which is authored primarily by people with a strong graphics or science background. Most of these examples talk about vector additions or some other mathematical construct. Let’s work on an example that somewhat resembles business situations. So let’s start with imagining that we have a list of products each with a code tag and a price, and we would like to apply a 10% overhead to all products that their code is “abc”.

We will implement the example first in C to make some performance measurements between CPU and GPU processing. Afterward we will of course implement the same in Java, but we will avoid to make any measurements as they are a bit trickier in Java for we have to take into account things like garbage collection, just in time compilation etc.

//============================================================================
// Name        : StoreDiscountExample.cu
// Author      : Spyros Sakellariou
// Version     : 1.0
// Description : The Good the Bad and the Ugly
//============================================================================

#include <iostream>
#include <sys/time.h>
 
typedef struct {
  char code[3];
 float listPrice;
} product;

void printProducts(long size, product * myProduct){
printf("Price of First item=%f,%s\n",myProduct[0].listPrice,myProduct[0].code);
printf("Price of Second item=%f,%s\n",myProduct[1].listPrice,myProduct[1].code);
printf("Price of Middle item=%f,%s\n",myProduct[(size-1)/2].listPrice,myProduct[(size-1)/2].code);
printf("Price of Almost Last item=%f,%s\n",myProduct[size-2].listPrice,myProduct[size-2].code);
printf("Price of Last item=%f,%s\n",myProduct[size-1].listPrice,myProduct[size-1].code);
}

float calculateMiliseconds (timeval t1,timeval t2) {
        float elapsedTime;
 elapsedTime = (t2.tv_sec - t1.tv_sec) * 1000.0;
  elapsedTime += (t2.tv_usec - t1.tv_usec) / 1000.0;
  return elapsedTime;
}
 
 

__global__ void kernel(long size, product *products)
{
  long kernelid = threadIdx.x + blockIdx.x * blockDim.x;
 while(kernelid < size) {
    if (products[kernelid].code[0]=='a' && products[kernelid].code[1]=='b' && products[kernelid].code[2]=='c')
  products[kernelid].listPrice*=1.10;
    kernelid += blockDim.x * gridDim.x;
  }
}

int main( int argc, char** argv)
{
  timeval t1,t2;
  cudaEvent_t eStart,eStop;
  float elapsedTime;
  long threads = 256;
  long blocks = 1024;
  long size = 9000000;
  char *product1 = "abc";
  char *product2 = "bcd";
  product *myProduct;
  product *dev_Product;

 printf("blocks=%d x threads=%d total threads=%d total number of products=%d\n\n",blocks,threads,threads*blocks,size);

 myProduct=(product*)malloc(sizeof(myProduct)*size);
  cudaMalloc((void**)&dev_Product,sizeof(dev_Product)*size);
 cudaEventCreate(&eStart);
 cudaEventCreate(&eStop);
 
 gettimeofday(&t1, NULL);
  for (long i = 0; i<size; i++){
   if (i%2==0)
    strcpy(myProduct[i].code,product1);
   else
   strcpy(myProduct[i].code,product2);
  myProduct[i].listPrice = i+1;
 }
  gettimeofday(&t2, NULL);
  printf ( "Initialization time %4.2f ms\n", calculateMiliseconds(t1,t2) );
 printProducts(size,myProduct);
  cudaMemcpy(dev_Product,myProduct,sizeof(dev_Product)*size,cudaMemcpyHostToDevice);
  
 cudaEventRecord(eStart,0);
  kernel<<<blocks,threads>>>(size,dev_Product);
  cudaEventRecord(eStop,0);
  cudaEventSynchronize(eStop);
  
  cudaMemcpy(myProduct,dev_Product,sizeof(dev_Product)*size,cudaMemcpyDeviceToHost);
  
  cudaEventElapsedTime(&elapsedTime,eStart,eStop);
  printf ( "\nCuda Kernel Time=%4.2f ms\n", elapsedTime );
  printProducts(size,myProduct);
 
 
  gettimeofday(&t1, NULL);
  long j=0;
   while (j < size){
     if (myProduct[j].code[0]=='a' && myProduct[j].code[1]=='b' && myProduct[j].code[2]=='c')
        myProduct[j].listPrice*=0.5;
     j++;
 
  }
  gettimeofday(&t2, NULL);
 
  printf ( "\nCPU Time=%4.2f ms\n", calculateMiliseconds(t1,t2) );
  printProducts(size,myProduct);
 
  cudaFree(dev_Product);
  free(myProduct);
}

In lines 11-14 there is a definition of a structure containing our product with a character array for the product code and a float for its price.

In lines 16 and 24 there is the definition of two utility methods one that prints some products (so we see if work has been done) and one to convert raw date differences into milliseconds. Note that using the standard C clock function will not work as its granularity is not enough to measure milliseconds.

Line 33 is where our kernel is written. Comparing to the previous article this looks somewhat more complex so lets dissect it further…

In line 35 we define a kernelid parameter. This parameter will hold a unique thread id of the thread being executed. CUDA assigns each thread a thread id and block id number that is unique only to its own dimension. In our example we instruct the GPU to launch 256 threads and 1024 blocks, so in effect the GPU will execute 262144 threads. This is GOOD! Although CUDA provides us with the threadIdx.x and blockIdx.x parameters during execution, we need to manually create a unique thread id in order to know which thread we are currently in. The unique thread id needs to start from 0 up to 262143, so we can easily create it by multiplying the number of threads per block to be executed (using the CUDA parameter blockDim.x) with the current block adding it to the current thread, thus:

unique thread id = current thread id + current block Id * number of threads per block

If you like to read ahead you have already realized that although 262 thousand threads is impressive our data set is made of 9 million items so our threads need to process more than one item at a time. We do this by setting up a loop in line 36 that checks whether our thread id does not overshoot our data array of products. The loop uses the thread id as its index, but we increment it using the following formula:

index increment += threads per block * total number of blocks

Thus each thread will execute each loop 9million/262thousand times, meaning it will process about 34 items.

The rest of the kernel code is pretty simple and self explanatory, whenever we find a product with code “abc” we multiply it by 1.1 (our 10% overhead). Notice that the strcpy function cannot be used inside our kernel. You will get a compilation error if you try it. Not so good!

Going into main function in lines 45 and 46 we define two C timers (t1 and t2) and two CUDA event timers (eStart and eStop). We need the CUDA timers because the kernel is executed asynchronously and our kernel function returns instantly and our timers will measure only the time it took to complete the function call. The fact that kernel code returns instantly means that we allow the CPU to do other tasks during GPU code execution. This is also GOOD!

The parameters following our timers are self explanatory: we define our number of threads, blocks, the size of the products array etc. The myproduct pointer will be used for CPU processing and the dev_product pointer for GPU processing.

In lines 58 to 61 we allocate RAM and GPU memory for myproduct and dev_product and also we create the CUDA timers that will help us measure kernel execution time.

In lines 63 to 73 we initialize myproduct with codes and prices and we print the time it took the CPU to complete the task. We also print some sample products from our array to make sure that the job is done correctly.

In lines 74 to 85 we copy the products array to the GPU memory, execute our kernel indicating the number of threads and blocks we want to execute, and copy the results back to myproduct array. We print the time it took to execute the kernel and some sample products to make sure that we got the job done correctly again.

Finally in lines 88 to 99 we let the CPU do a similar process of what the GPU did, i.e. apply a 50% discount to all products that GPU added overheads to. The we print the time it took to execute the CPU task and print some sample products to make sure the job is done.

Let’s compile and run this code:

# nvcc StoreDiscountExample.cu -o StoreDiscountExample
# ./StoreDiscountExample
blocks=1024 x threads=256 total threads=262144 total number of products=9000000

Initialization time 105.81 ms
Price of First item=1.000000,abc
Price of Second item=2.000000,bcd
Price of Middle item=4500000.000000,bcd
Price of Almost Last item=8999999.000000,abc
Price of Last item=9000000.000000,bcd

Cuda Kernel Time=1.38 ms
Price of First item=1.100000,abc
Price of Second item=2.000000,bcd
Price of Middle item=4500000.000000,bcd
Price of Almost Last item=9899999.000000,abc
Price of Last item=9000000.000000,bcd

CPU Time=59.58 ms
Price of First item=0.550000,abc
Price of Second item=2.000000,bcd
Price of Middle item=4500000.000000,bcd
Price of Almost Last item=4949999.500000,abc
Price of Last item=9000000.000000,bcd
#

Wow! It took the GPU 1.38 milliseconds to do what the CPU took 59.58 milliseconds (numbers will vary depending on your hardware of course). This is GOOD!

Hold your horses for a second! Before you decide to delete all your code and start re-writing everything in CUDA there is a catch: We omitted something serious and that is to measure how long it takes to copy 9 million records from RAM to GPU memory and back. Here is the code from lines 74 to 85 altered to have timers to measure copying the products list to and from the GPU memory:

gettimeofday(&t1, NULL);
cudaMemcpy(dev_Product,myProduct,sizeof(dev_Product)*size,cudaMemcpyHostToDevice);

cudaEventRecord(eStart,0);
kernel<<<blocks,threads>>>(size,dev_Product);
cudaEventRecord(eStop,0);
cudaEventSynchronize(eStop);

cudaMemcpy(myProduct,dev_Product,sizeof(dev_Product)*size,cudaMemcpyDeviceToHost);
gettimeofday(&t2, NULL);
printf ( "\nCuda Total Time=%4.2f ms\n", calculateMiliseconds(t1,t2));
cudaEventElapsedTime(&elapsedTime,eStart,eStop);
printf ( "Cuda Kernel Time=%4.2f ms\n", elapsedTime );
printProducts(size,myProduct);

Let’s compile and run this code:

# nvcc StoreDiscountExample.cu -o StoreDiscountExample
# ./StoreDiscountExample
blocks=1024 x threads=256 total threads=262144 total number of products=9000000

Initialization time 108.31 ms
Price of First item=1.000000,abc
Price of Second item=2.000000,bcd
Price of Middle item=4500000.000000,bcd
Price of Almost Last item=8999999.000000,abc
Price of Last item=9000000.000000,bcd

Cuda Total Time=55.13 ms
Cuda Kernel Time=1.38 ms
Price of First item=1.100000,abc
Price of Second item=2.000000,bcd
Price of Middle item=4500000.000000,bcd
Price of Almost Last item=9899999.000000,abc
Price of Last item=9000000.000000,bcd

CPU Time=59.03 ms
Price of First item=0.550000,abc
Price of Second item=2.000000,bcd
Price of Middle item=4500000.000000,bcd
Price of Almost Last item=4949999.500000,abc
Price of Last item=9000000.000000,bcd

#

Notice the CUDA Total Time is 55 milliseconds, that’s only 4 milliseconds faster than using the CPU in a single thread. This is BAD!

So although the GPU is super-fast when it comes to parallel tasks execution there is a heavy penalty when we copy items to and from RAM and GPU memory. There are some advanced tricks such as direct memory access that can be used but the moral is you have to be very careful when making the decision to use the GPU. If your algorithm requires a lot of data moving then probably GPGPU is not the answer.

Since we have completed with our performance tests let’s have a look at how we can implement the same function using jcuda.

Here is the code for the java part:

import static jcuda.driver.JCudaDriver.*;
import jcuda.*;
import jcuda.driver.*;
import jcuda.runtime.JCuda;


public class StoreDiscountExample {
 public static void main(String[] args) {
  int threads = 256;
 int blocks = 1024;
 final int size = 9000000;
  byte product1[] = "abc".getBytes();
 byte product2[] = "bcd".getBytes();
  byte productList[] = new byte[size*3];
  float productPrices[] = new float[size];
  long size_array[] = {size};
   
  cuInit(0);
  CUcontext pctx = new CUcontext();
  CUdevice dev = new CUdevice();
  cuDeviceGet(dev, 0);
  cuCtxCreate(pctx, 0, dev);
  CUmodule module = new CUmodule();
  cuModuleLoad(module, "StoreDiscountKernel.ptx");
  CUfunction function = new CUfunction();
  cuModuleGetFunction(function, module, "kernel");
 
  
  int j=0;
  for (int i = 0; i<size; i++){
   j=i*3;
  if (i%2==0) {
    productList[j]=product1[0];
    productList[j+1]=product1[1];
    productList[j+2]=product1[2];
   }
   else {
    productList[j]=product2[0];
    productList[j+1]=product2[1];
    productList[j+2]=product2[2];
   }
     
    productPrices[i] = i+1;
    
   }
   
  printSamples(size, productList, productPrices);
   
  CUdeviceptr size_dev = new CUdeviceptr();
  cuMemAlloc(size_dev, Sizeof.LONG);
  cuMemcpyHtoD(size_dev, Pointer.to(size_array), Sizeof.LONG);
   
  CUdeviceptr productList_dev = new CUdeviceptr();
  cuMemAlloc(productList_dev, Sizeof.BYTE*3*size);
  cuMemcpyHtoD(productList_dev, Pointer.to(productList), Sizeof.BYTE*3*size);
   
  CUdeviceptr productPrice_dev = new CUdeviceptr();
  cuMemAlloc(productPrice_dev, Sizeof.FLOAT*size);
  cuMemcpyHtoD(productPrice_dev, Pointer.to(productPrices), Sizeof.FLOAT*size);  
   
  Pointer kernelParameters = Pointer.to( 
   Pointer.to(size_dev),
   Pointer.to(productList_dev),
   Pointer.to(productPrice_dev)
  );
   
  cuLaunchKernel(function, 
   blocks, 1, 1, 
   threads, 1, 1, 
   0, null, 
   kernelParameters, null);
   
  cuMemcpyDtoH(Pointer.to(productPrices), productPrice_dev, Sizeof.FLOAT*size);
   
  printSamples(size, productList, productPrices);
  
  JCuda.cudaFree(productList_dev);
  JCuda.cudaFree(productPrice_dev);
  JCuda.cudaFree(size_dev);
 }
  

 public static void printSamples(int size, byte[] productList, float[] productPrices) {
   System.out.print(String.copyValueOf(new String(productList).toCharArray(), 0, 3));System.out.println(" "+productPrices[0]);
  System.out.print(String.copyValueOf(new String(productList).toCharArray(), 3, 3));System.out.println(" "+productPrices[1]);
  System.out.print(String.copyValueOf(new String(productList).toCharArray(), 6, 3));System.out.println(" "+productPrices[2]);
  System.out.print(String.copyValueOf(new String(productList).toCharArray(), 9, 3));System.out.println(" "+productPrices[3]);
  System.out.print(String.copyValueOf(new String(productList).toCharArray(), (size-2)*3, 3));System.out.println(" "+productPrices[size-2]);
  System.out.print(String.copyValueOf(new String(productList).toCharArray(), (size-1)*3, 3));System.out.println(" "+productPrices[size-1]);
 } 
}

Starting with lines 14, 15 and 16 we see that we cannot use a product structure or class anymore. In fact in jcuda everything that will be passed as a parameter into the kernel has to be a one dimensional array. So in line 14 we create a two dimensional array of bytes represented as a one dimensional array. The product list array size is equal to the number of products times the size in bytes of each product code (in our case that is only three bytes). We also create a second array to store the product prices as floats, and finally the size of our product list also needs to be put into a one dimensional array.
I think by now you have probably guessed what I am about to say: This is just plain UGLY!

In lines 29 to 45 we populate our product list and product price arrays and then we pass them to the kernel for processing by creating CUDA device pointers, allocating memory and copying the data to the GPU memory before calling the kernel function.

Since we had to convert everything into one dimensional arrays of primitives our kernel code needs to change a bit as well:

extern "C"

__global__ void kernel(long *size, char *productCodes, float *productPrices)
{
 long kernelid = threadIdx.x + blockIdx.x * blockDim.x;
 long charIndex = kernelid*3;
 while(kernelid < size[0]) {
    if (productCodes[charIndex]=='a' && productCodes[charIndex+1]=='b' && productCodes[charIndex+2]=='c')
      productPrices[kernelid]*=1.10;
         kernelid += blockDim.x * gridDim.x;
         charIndex = kernelid*3;
        }
}

The only difference is that we multiply the kernelid index by 3 in order to find the correct starting character in our productCodes array.

Let’s compile and run the java example:

# nvcc -ptx StoreDiscountKernel.cu -o StoreDiscountKernel.ptx
# javac -cp ~/GPGPU/jcuda/JCuda-All-0.4.0-beta1-bin-linux-x86_64/jcuda-0.4.0-beta1.jar StoreDiscountExample.java
#java -cp ~/GPGPU/jcuda/JCuda-All-0.4.0-beta1-bin-linux-x86_64/jcuda-0.4.0-beta1.jar StoreDiscountExample
abc 1.0                                                                                                                                                                                
bcd 2.0                                                                                                                                                                                
abc 3.0                                                                                                                                                                                
bcd 4.0                                                                                                                                                                                
abc 8999999.0
bcd 9000000.0
abc 1.1
bcd 2.0
abc 3.3
bcd 4.0
abc 9899999.0
bcd 9000000.0
#

Albeit being ugly the code works in a similar fashion as in C.

So here is a summary of our experience with GPU processing and jcuda:

GOOD: Very fast performance (my H/W was an AMD Phenom II Quad Core 3.4Ghz CPU and an NVIDIA Geforce GTX 560 with 336 Cores)

GOOD: Asynchronous operation letting the CPU do other tasks

BAD: Memory copies impose a considerable performance penalty

UGLY: Jcuda is undoubtedly useful if you want to execute CUDA kernels from within Java, but having to convert everything as one dimensional arrays of primitives is really not convenient.

In our previous article there were some very interesting comments about java tools for OpenCL (the CUDA alternative for GPGPU). In the next article we will have a look at these tools and see if they look any “prettier” than Jcuda.

Reference: GPGPU with Jcuda the Good, the Bad and … the Ugly from our W4G partner Spyros Sakellariou.

Related Articles :