After several years of trying, graphics processing units (GPUs) are beginning to win over the major server vendors. Dell and IBM are the first tier-one server vendors to adopt GPUs as server processors for high-performance computing (HPC). Here’s a manager’s view of the hardware change and what it might mean to your data center. (Hint: faster servers.)
GPUs are normally used in desktop PCs, where they serve as high-speed graphics accelerators, primarily for games. But it’s slowly dawning on both server makers and end-users that GPUs make great math co-processors for more tasks than just gaming.
In May, IBM announced plans to offer a pair of Tesla M2050 GPUs in its iDataPlex dx360 M3 scale-out servers. Dell followed with an early June announcement that its PowerEdge M610x blade servers would come with a pair of Tesla GPUs. The M610x, equipped with an Intel Xeon 5500 or 5600, can generate up to 400 gigaflops of performance.
These announcements were a big win for Nvidia, which has been pushing the concept of using its graphics chips as massive math co-processors for math-intensive tasks for some time. Until the Dell and IBM announcements, Nvidia didn’t have much tier one hardware support.
If you’ve been around PCs long enough, you remember that the 8086, 80286 and 80386 processors had math co-processors available: the 8087, 80287, and 80387, sold separately. If you bought a PC back in the late 1980s to do a lot of math or scientific work, the PC sales staff would sell you on the notion of the math co-processor. These add-on chips were tuned specifically for fast and accurate calculations, assuming that the software was written to exploit the chip’s capabilities. The primary buyers were spreadsheet users, since the original x86 “killer app,” Lotus 1-2-3, recalculated much faster when a math co-processor was installed.
With the 80486, the math co-processor was integrated into the CPU, and subsequent architectures added instructions to speed up math calculations. Even today, CPU design documentation makes references to a “floating-point unit,” since floating-point math is used in math calculations. This includes graphics processing, signal processing, string processing, and encryption — not just obvious number-crunching.
Computers see a number as either an integer or floating point. Integers have no decimal place (13 people, for example) while floating-point numbers do (3.14159, for example). Fine-grain calculations are all the floating-point unit’s job.
This is especially important in graphics, because calculating the positions of triangles that constitute a smooth 3D image requires extremely precise fractions. If the calculations are off, then you get splits in between the triangles, breaking the image. Graphics software developers need to calculate to the 30th decimal point to get precise fitting, color, and lighting.
Over the years, Nvidia and its rival ATI (a part of AMD since 2006) have built massive multicore math processors. Here’s a little perspective: Intel and AMD are at the four- to six-core level these days; Nvidia’s new Fermi architecture has 483 stream processors (and would have had more except for heat and power concerns) while the ATI Radeon 5000 series has 1,600 stream processors.
Stream processing is a computation technique used in parallel processing designed to use many computational units, whereby the software manages things like memory allocation, data synchronization, and communication. All of those cores are connected by high-speed interconnects.
GPU threads are much smaller than CPU threads because they are just a bunch of math instructions. Often, the math instruction might be as simple as addition, done over and over. That makes GPU cores great for switching threads, because the cores can change from one thread to another in one clock cycle, something CPUs can’t do. A CPU thread is a complex series of instructions, such as a system process or operating system call.
Lately, people with jobs requiring high performance computing are realizing that those 483 to 1,600 math cores might do something besides render videogames. Nvidia and AMD wholeheartedly agreed, and lately have promoted their GPUs as math co-processors, although Nvidia has been far more emphatic in pushing the idea.
The final missing ingredient was double-precision floating-point math, a necessity in complex scientific calculations. Both Nvidia and ATI added double-precision floating-point capability to their chips. A single-precision floating-point digit is 32-bits long, or 2^32, while double precision is 64-bits long, or 2^64. This is irrelevant to gamers but a must-have for a scientific researcher or software developer who is, say, simulating global climate patterns.
Programming a GPU
If the server in your data center is likely to include a GPU, then it follows that the programmers in your IT shop who write server applications should understand the basics of programming for the new hardware. Here are some of the decisions you, and they, need to be aware of.
Adopting GPU computing is not a drop-in task. You can’t just add a few boards and let the processors do the rest, as when you add more CPUs. Some programming work has to be done, and it’s not something that can be accomplished with a few libraries and lines of code.
Video games have always been the primary example for applications using massive amounts of floating-point calculations, but they are far from the only uses. Most math-heavy applications are in visualization-related fields. This includes medical and scientific imaging, 3D imaging and visualization for oil and gas exploration, processing 3D graphics used in entertainment and advertising, and financial modeling.
The process is known as GPGPU computing, or general purpose GPU computing. The task basically involves programming an application to send its calculation work to the GPU instead of the CPU. Up to now, that meant rewriting code. Nvidia handled it with its CUDA development language.
CUDA is a C-like language that gives developers the instructions to write applications that run in parallel and execute on the Nvidia GPU. Unlike an x86 processor, the applications aren’t running two, four, or eight threads in parallel; they run hundreds of threads.
Nvidia made a serious effort to get universities behind CUDA, with more than 350 universities around the world offering courses. But, just as Nvidia gets momentum going, it may prove for naught.
The OpenCL project is an offshoot of OpenGL, the graphics library used by 3D cards (although largely supplanted by Microsoft’s DirectX). Apple was the original author of the Open Computing Language (OpenCL) framework, which is used to write programs that execute across a CPU, GPU, and any other processor. OpenCL includes a language for writing kernels and APIs used in task-based and data-based parallel programming.
OpenCL has strengths and weaknesses when compared to CUDA. For starters, it supports multi-processor computing. CUDA supports Nvidia GPUs only. OpenCL gives any application access to the GPU without having to be rewritten. CUDA means writing in C code for the Nvidia GPU. OpenCL is for any input/output processor, so it could conceivably support everything from Itanium to a Sun UltraSparc to an ARM embedded processor.
The OpenCL framework is newer than CUDA and lacks a lot of CUDA’s features and maturity. Most notably, CUDA has its own Fast Fourier Transform (FFT) kernel but OpenCL does not. FFT is a complex algorithm used in everything from advanced scientific calculations to image processing.
Both frameworks have plusses and minuses. CUDA is in the hands of Nvidia, a hard-charging company that’s its own biggest motivator and competitor, while OpenCL is managed by a standards group, which is a great way to get nothing done, at least in a hurry. There’s also the specter of vendor lock-in with CUDA, and waiting who-knows-how-long for the OpenCL library to get updated.
There’s a third competitor: Microsoft’s DirectCompute. DirectCompute is a component of Microsoft’s DirectX11 API library that was released with Windows 7, which means it works on exactly one operating system. Like OpenCL, it enables applications to take advantage of the GPU’s computing power.
Because it’s only in DirectX 11, DirectCompute use is most limited; it only runs on Windows 7 computers. Since Microsoft has not made DirectX available for any other platform, there aren’t that many DirectX 11 cards on the market. ATI has a lead, but Nvidia is trying to get product out as fast as possible. For now, DirectCompute is not a server technology.
So where is a GPU best? As said previously, it’s ideal for high mathematical usage cases. This includes medical research, where diseases are “reverse engineered;” medical imaging, where an ultrasound image can be rendered in minutes, not hours; and video and image processing, such as movie effects.
Nvidia recently scored significant bragging rights that the three films up for Best Special Effects in the 2010 Oscars — Star Trek, Avatar, and District 9 — were rendered using Nvidia GPUs. In the business arena, Nvidia pointed to GPUs being used in energy research to crunch seismic data for hints of oil and gas reserves, and in finance, where the entire stock market is analyzed in real-time for trends and patterns.
Your data center does not necessarily need GPUs, so don’t feel as though it must be included in your next server purchase. For basic server tasks, such as file serving, Web page serving, or databases, a GPU is not needed at all. Those are not number crunching tasks. Anything I/O intensive, such as an application servers or databases, does not need a GPU. It needs memory, fast connections, and perhaps solid-state storage, but not a GPU.
Some programming work has to be done to take advantage of GPUs, and it’s a new type of programming. That requires some education and training on the part of your programmers. The training won’t necessarily come from the hardware companies; it will come from Nvidia and AMD, which have solid developer programs but are still relatively new to enterprise development.
Copyright 2013 © Godem Online Inc. | Web and server solutions by NewTech Solutions.