CPU vs. GPGPU

GPGPU is a new trend and it has triggered many new questions in the minds of young computer scientists. In this post, I cover some of the FAQs on this topic.

What types of problems are better suited to regular multicore and what types are better suited to GPGPU?

GPUs contain some fixed function and programmable hardware. While GPUs are trending towards more and more programmable units, todays GPUs perform some common graphics tasks like texture sampling, and rendering using special purpose hardware. In contrast, pixel shading is done using programmable SIMD cores. GPGPU workloads mostly run on the SIMD shader cores.

GPUs are built for very regular throughput workloads, e.g., graphics, dense matrix-matrix multiply, simple photoshop filters, etc. They are good at tolerating long latencies because they are inherently designed to tolerate the latency of texture sampling, a 1000+ cycle operation. GPU cores have a lot of threads: when one thread fires a long latency operation (say a memory access), that thread is put to sleep (and other threads continue to work) until the long latency operation finishes. This allows GPUs to keep their execution units busy a lot more than traditional cores.

GPUs are bad at handling branches because GPUs like to batch “threads” (SIMD lanes if you are not nVidia) into warps and send them down the pipeline together to save on instruction fetch/decode power. If threads encounter a branch, they may diverge, e.g., 2 threads in a 8-thread warp may take the branch while the other 6 may not take it. Now the warp is split into two warps of sizes 2 and 6. These newly formed warps will run inefficiently. The 2-thread warp will run at 25% efficiency and the 6-thread warp will run at 75% efficiency. You can imagine that if a GPU continues to encounter nested branches, its efficiency becomes very low. Therefore, GPUs aren’t good at handling branches and hence code with branches should not be run on GPUs.

GPUs are also bad at co-operative threading because synchronization is not well-supported on GPUs (but nVidia is on it).

Therefore, the worst code for GPU is code with less parallelism or code with lots of branches or synchronization, e.g., databases, operating systems, graph algorithms, etc.

What are the key differences in programming model?

GPUs don’t support interrupts and exception. To me thats the biggest difference. Other than that CUDA is not very different from C. You can write a CUDA program where you ship code to the GPU and run it there. You access memory in CUDA a bit differently but again thats not fundamental to our discussion.

What are the key underlying hardware differences that necessitate any differences in programming model?

I mentioned them already. The biggest is the SIMD nature of GPUs which requires code to be written in a very regular fashion with no branches and inter-thread communication. This is part of why, e.g., CUDA restricts the number of nested branches in the code.

Which one is typically easier to use and by how much?

Depends on what you are coding and what is your target.

Easily vectorizable code: CPU is easier to code but low performance. GPU is slightly harder to code but provides big bang for the buck. For all others, CPU is easier and often better performance as well.

Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft’s task parallel library or D’s std.parallelism?

Task-parallelism, by definition, requires thread communication and has branches as well. The idea of tasks is that different threads do different things. GPUs are designed for lots of threads that are doing identical things. I would not build task parallelism libraries for GPUs.

If GPU computing is so spectacularly efficient, why aren’t CPUs designed more like GPUs?

Lots of problems in the world are branchy and irregular. 1000s of examples. Graph search algorithms, operating systems, web browsers, etc. Just to add — even graphics is becoming more and more branchy and general-purpose like every generation so GPUs will be becoming more and more like CPUs. I am not saying they will becomes just like CPUs, but they will become more programmable. The right model is somewhere in-between the power-inefficient CPUs and the very specialized GPUs.

Reference: CPU vs. GPGPU from our JCG partner Aater Suleman at the Future Chips blog.

Related Articles :

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

One Response to "CPU vs. GPGPU"

  1. streamcomputing says:

    Don’t forget the vector-extensions of the CPU, SSE and AVX. AVX is a 256 bit wide bus, where 8 floats or 4 doubles fit in and work in a SIMD-way. This is the main reason a 3.5GHz CPU gets so many GFLOPS, while only 3.5 GFLOPS is expected. Because you don’t need to transfer to the discrete GPU, I got faster data-processing using the CPU in some cases, where GPU was the expected winner based on GFLOPS.

    Also worth mentioning is the MAD-instruction: a multiply + add in one clock-cycle. If you don’t use this, you won’t get even close to the maximum theoretical GFLOPS. This is for CPU and GPU.

    So actually CPUs are a sort of of GPUs in that way. And the good news: both Intel and AMD CPUs can be easily programmed in OpenCL – the Open Compute Language, for faster computing on CPUs, GPUs, FPGAs and DSPs.

Leave a Reply


9 − one =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close