Computers have been used in diagnostic imaging for decades, but High-Performance Computing (HPC) in diagnostic imaging is rather rare because of the high cost associated with traditional HPC. Recent advancements in shared memory computers and the large market for commodity graphic cards have created an unprecedented increase of low-cost computing power opening the door to new opportunities to implement Computed Tomography (CT) reconstruction algorithms that were considered intractable before. Because of limited computing power Filtered Back-Projection (FBP) is the algorithm of choice and is usually accelerated by hardware implementation using Field-Programmable-Gate-Arrays (FPGA) in commercial-CT systems. In FBP, each of the one-dimensional views is convolved with a one-dimensional filter kernel to create a set of filtered views. These filtered views are then back-projected to provide the reconstructed image. FBP relies on oversampling the low frequency region to get reasonable reconstruction of the outer edge. In many ways this oversampling is equivalent to unnecessary radiation exposure to patients as they do not really contribute to increase image quality. Other algorithms based on Algebraic Reconstruction Technique (ART) have been shown to require less than 50% of the projections when compared to FBP, corresponding to approximately half of the radiation exposure to the patient. The advantages of the algebraic approach like ART include improved insensitivity to noise and capability of reconstructing an optimal image in the case of incomplete data. The computational problems with ART are often an underdetermined matrix resulting in an infinite number of solutions. An alternative solution is the Simultaneous Algebraic Reconstruction Technique SART which relies on a projection-wise update of the reconstructed image or volume, rather than a ray-wise update of the reconstructed image. Until the event of the GPU, algorithms such as SART were considered impractical using conventional computing power and were deemed too complicated to be implemented in FPGA.
The Advanced Reconstruction Environment for Medical Imaging (AREMI) project started three years ago at the SERVIER Virtual Cardiac Center was an attempt to solve many of those problems and to revisit old algorithms from a modern perspective. The first version of AREMI uses multi-core GPU for the reconstruction from raw attenuation data measured by a Siemens Definition Flash CT scanner. AREMI was implemented on a Dell computer equipped with two NVIDIA Tesla C2070 GPUs, each containing 448 CUDA cores and 14 streaming multiprocessors. The use of this environment allowed us to analyse how various reconstruction techniques (FBP, ART, and SART) can be implemented in a multi-core context. We studied their parallel implementation, speed, and the quality of the reconstruction. Since we also had full control over the reconstruction process, we also studied the effect of High Dynamic Range (HDR) reconstruction on the detection of subtle changes in the CT images.
Following this successful first phase, it became obvious that if we wanted to reach our objective of real-time processing we absolutely need more compute power especially with SART techniques. In order to do so, we are planning to develop an eight GPUs cluster and to rewrite some of the AREMI code to take advantages of the cluster hardware characteristics. The use of multiple GPUs is complicated in nature as it requires careful algorithm design that is hardware dependent. Our preliminary studies show that for a simple FPB reconstruction of a 512 x 512 image we have a speed-up of 830 times compare to a CPU implementation for a but only 11 times (1766.2 s to 163.9 s) for the SART algorithm. With two GPUs the speed-up was 19 times (92.5s) which is not completely linear. By using a more optimized implementation an eight GPUs cluster will speed-up sufficiently the implementation of SART to be useful in clinical settings where fast reconstructions are required (~1 s/slice). In addition, new technologies such as direct GPU to GPU transfer using NVIDIA GPU-Direct memory transfer and the Unified Virtual Addressing (UVA) mechanism mays completely change the scalability of GPU cluster as it avoid the CPU to be involved in massive transfers between GPU memories. From our simple experiment with two GPUs, the new reconstruction technique should scale well as the data shared among GPUs is constant. It is our hope that by using well designed algorithms and state-of-the-art hardware we will be the first group in the world to implement a SART algorithm for real clinical applications. With the help of the SERVIER Virtual Cardiac Center, we will test these GPU accelerated reconstruction technique with clinicians on real-time cardiac CT data.