2009, Multi-Core Programming For Medical Imaging PDF

Enabling Technology of Multi-core Computing for Medical Imaging
Dr. Jun Ni, Ph.D. M.E. Associate Professor, Radiology, Biomedical Engineering, Mechanical Engineering, and Computer Science The University of Iowa, Iowa City, Iowa, USA
Dec. 22, 2009
Harbin Engineering University
Outline
Multi-core Architecture and Programming Environment Enabling Technology of Multi-core Computing for Medical Imaging
Recommended Resources (Wikipedia and Textbook)

Multicore programming on Wikipidia Extracted from Multi-core Programming, by Shameem Akhter and Jason Roberts Professional Multicore Programming: Design and Implementation for C++ Developers (Wrox), 2008 The art of multiprocessor programming, by Maurice Herlihy, Nir Shavit, 2008 Parallel MATLAB for Multicore and Multinode Computers, Jeremy Kepner, 2009 Java Performance on Multi-Core Platforms Charles J. Hunt, Paul Hohensee, Binu John, David Dagastine
Multi-core Computing Theme
Increasing performance through

hardware in multi-core architecture software in multi-threading
Need for Multi-core Architecture
J. Presper Eckert and John Mauchly (1940)
A creative memo on Electronic Discrete Variable Automatic Computer (EDVAC) Suggested a stored-program model of computing
A program is a sequence of instructions stored sequentially in the computers memory Instructions are executed one after the other in a linear, single-threaded fashion
John von Neumann

Mainframe, a Neumann machine Time-sharing operating systems (1960)
Run on large mainframe computers

First introduced operating systems based on concurrent programming Multiple users can submit different jobs for processing on a single mainframe simultaneously
Only process
Operating system handled the details of allocating CPU time for each individual program at a time Concurrency at the process level Systems programmer switches job task
Early PCs
standalone devices with simple, single-user operating systems
Only one program would run at a time

User interaction occurred via simple text based interfaces Programs followed straight-line instruction
Lately, more sophisticated computing platforms

Operating system vendors used the advance in CPU Graphics performance to develop more sophisticated user environments Graphical User Interfaces (GUIs)
Standard and enabled users to start and run multiple programs in the same user environment
Networking on PCs became pervasive
Increased user expectations

enable to run multiple jobs simultaneously have their computing platform to be quick and responsive enable applications to start quickly and handle inconvenient background tasks
Challenges
problems that face hardware and software developers
Most end-users:
Simplistic view of complex computer systems Implementation of such a system is far more difficult
Reality:

Client-server-based computation environment for multimedia streaming and displaying Client side
PC must be able to
download the streaming video data decompress/decode it draw it on the video display
PC also handles any streaming audio that accompanies the video stream and send it to the soundcard.
On the server side, a provider must be able

To receive the original broadcast To encode/compress it in near real-time To send it over the network to potentially hundreds of thousands of clients
A computer system capable of streaming a Web broadcast system
A streaming multimedia delivery service with the end users perspective of the system In order to provide an acceptable end-user experience, system designers must be able to effectively manage many independent subsystems that operate in parallel
It requires job decomposition

Broken into a number of disparate parts Each acting independently from one another Break down each task into a single isolated problem Make the problem much more manageable
Concurrency
A way to manage the sharing of resources used at the same time Important for several reasons:
Concurrency allows for the most efficient use of system resources Efficient resource utilization is the key to maximizing performance of computing systems
Highly inefficient approach
complete idle system while waiting for data to come in from the network. A better approach would be to stage the work so that while the system is waiting for the next job to come in from the network
The previous job is being decoded by the CPU, thereby improving overall resource utilization Multicore processor is on demand and developed recently!
What is Multi-core?
A multi-core processor
a processing system composed of two or more independent integrated circuit to which two or more individual sub-processors (called cores in this sense)
The cores are typically

integrated onto a single integrated circuit die (Chip Multiprocessor or CMP) May be integrated onto multiple dies in a single chip package
What is Multi-core?

A dual-core processor contains two cores A quad-core processor contains four cores. A multi-core processor implements multiprocessing in a single physical package
What is Multi-core?
Cores in a multi-core device may be coupled together tightly or loosely.

Cores may or may not share caches Implement either in
message passing shared memory inter-core communication methods
Common network topologies to interconnect cores include:
bus, ring, 2-dimensional mesh, and crossbar
What is Multi-core?
All cores are identical in homogeneous multi-core systems
Not identical in heterogeneous multi-core systems.
Just as with single-processor systems, cores in multi-core systems may implement different architectures

Superscalar VLIW Vector processing SIMD Multithreading
What is Multi-core?
Multi-core processors are widely used across many application domains including:

general-purpose Embedded Network Digital signal processing Graphics
Medical imaging is our focus!
What is Multi-core?
The amount of performance
strongly dependent on software algorithms and implementation
What is Multi-core?
Parallel Cases:
Limited by the fraction of the software that can be parallelized to run on multiple cores simultaneously Described by Amdahl's law
What is Multi-core?
Parallel Cases:
In the best case, embarrassingly parallel problems may realize speedup factors near the number of cores
What is Multi-core?
Parallel Cases:
Many typical applications do not realize such large speedup factors Parallelization of software is a significant on-going topic of research
Terminology
There is some discrepancy in the semantics by which the terms multi-core and dual-core are defined. Most commonly they are used to refer to some sort of central processing unit (CPU) Sometimes also applied to digital signal processors (DSP) and System-on-a-chip (SoC).
Terminology
Some use these terms to refer only to multi-core microprocessors that are manufactured on the same integrated circuit die. These people generally refer to separate microprocessor dies in the same package by another name, such as multi-chip module Both the terms "multi-core" and "dual-core" to reference microelectronic CPUs manufactured on the same integrated circuit
Terminology
In contrast to multi-core systems, the term multiCPU refers to multiple physically separate processing units
often contain special circuitry to facilitate communication between each other
The terms many-core and massively multi-core are sometimes used to describe multi-core architectures with an especially high number of cores (tens or hundreds).
Terminology
Some systems use many soft microprocessor cores placed on a single FPGA. Each of "cores" can be considered a "semiconductor intellectual property core" as well as a CPU core.
Development
While manufacturing technology continues to improve:

reducing the size of single gates physical limits of semiconductor-based microelectronics have become a major design concern.
these physical limitations can cause significant problems

heat dissipation data synchronization.
Development
The demand for more capable microprocessors causes CPU designers to use various methods of increasing performance instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for many applications
inefficient for others that tend to contain difficult-topredict code.
Development
Many applications are better suited to thread level parallelism (TLP) methods Multiple independent CPUs is one common method used to increase a system's overall TLP. A combination of increased available space due to refined manufacturing processes A demand for increased TLP is the logic behind the creation of multi-core CPUs.
Commercial Incentives
Several business motives drive the development of dual-core architectures. Since symmetric multiprocessing (SMP) designs have long been implemented using discrete CPUs Issues regarding implementing the architecture and supporting it in software are well known. Utilizing a proven processing core design without architectural changes reduces design risk significantly.
For general-purpose processors, much of the motivation for multi-core processors comes from greatly diminished gains in processor performance from increasing the operating frequency.
This is due to three primary factors:
The memory wall; the increasing gap between processor and memory speeds. this effect pushes cache sizes larger in order to mask the latency of memory.
This helps only to the extent that memory bandwidth is not the bottleneck in performance.
The ILP wall; the increasing difficulty of finding enough parallelism in a single instructions stream to keep a high performance single-core processor busy.
The power wall; the trend of consuming exponentially increasing power with each factorial increase of operating frequency.
This increase can be mitigated by "shrinking" the processor by using smaller traces for the same logic. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall.

The terminology "dual-core" (and other multiples) lends itself to marketing efforts. In order to continue delivering regular performance improvements for general-purpose processors, manufacturers such as Intel and AMD have turned to multi-core designs, sacrificing lower manufacturing costs for higher performance in some applications and systems. Multi-core architectures are being developed, but so are the alternatives. An especially strong contender for established markets is the further integration of peripheral functions into the chip.
Advantages

The proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at a much higher clock rate than is possible if the signals have to travel off-chip. Combining equivalent CPUs on a single die significantly improves the performance of cache snoop (alternative: Bus snooping) operations.
Advantages

Put simply, this means that signals between different CPUs travel shorter distances, and therefore those signals degrade less. These higher quality signals allow more data to be sent in a given time period since individual signals can be shorter and do not need to be repeated as often.
Advantages
The largest boost in performance will likely be noticed in improved response time
while running CPU-intensive processes, like antivirus scans, ripping/burning media (requiring file conversion), or searching for folders.
If the automatic virus scan initiates while a movie is being watched, the application running the movie is far less likely to be starved of processor power,
as the antivirus program will be assigned to a different processor core than the one running the movie playback.
Advantages

Assuming that the die can fit into the package, physically, the multi-core CPU designs require much less Printed Circuit Board (PCB) space than multi-chip SMP designs. A dual-core processor uses slightly less power than two coupled single-core processors, principally because of the decreased power required to drive signals external to the chip.
Advantages

Furthermore, the cores share some circuitry, like the L2 cache and the interface to the front side bus (FSB). In terms of competing technologies for the available silicon die area, multi-core design can
Make use of proven CPU core library designs and produce a product with lower risk of design error than devising a new wider core design.
Adding more cache suffers from diminishing returns.
Disadvantages
In addition to operating system (OS) support, adjustments to existing software are required to maximize utilization of the computing resources provided by multi-core processors. Also, the ability of multi-core processors to increase application performance depends on the use of multiple threads within applications.
Disadvantages
The situation is improving
Valve Corporation's Source engine, offers multicore support, Crytek has developed similar technologies for CryEngine 2, which powers their game, Crysis. Emergent Game Technologies' Gamebryo engine includes their Floodgate technology which simplifies multi-core development across game platforms.
Disadvantages

Integration of a multi-core chip drives production yields down and they are more difficult to manage thermally than lower-density single-chip designs. Intel has partially countered this first problem by creating its quad-core designs by combining two dual-core on a single die with a unified cache, Any two working dual-core dies can be used, as opposed to producing four cores on a single die and requiring all four to work to produce a quad-core.
Disadvantages
From an architectural point of view, ultimately, single CPU designs may make better use of the silicon surface area than multiprocessing cores, so a development commitment to this architecture may carry the risk of obsolescence.
Disadvantages

Finally, raw processing power is not the only constraint on system performance. Two processing cores sharing the same system bus and memory bandwidth limits the real-world performance advantage. If a single core is close to being memory bandwidth limited, going to dual-core might only give 30% to 70% improvement.
Disadvantages

If memory bandwidth is not a problem, a 90% improvement can be expected It would be possible for an application that used two CPUs to end up running faster on one dual-core if communication between the CPUs was the limiting factor, which would count as more than 100% improvement.
Hardware
The general trend in processor development has been from multi-core to many-core: from dual-, tri-, quad-, hexa-, octo-core chips to ones with tens or even hundreds of cores. In addition, multi-core chips mixed with simultaneous multithreading, memory-on-chip, and special-purpose "heterogeneous" cores promise further performance and efficiency gains, especially in processing multimedia, recognition and networking applications.
Hardware
There is also a trend of improving energy efficiency by focusing on performance-perwatt with advanced fine-grain or ultra finegrain power management and dynamic voltage and frequency scaling (i.e. laptop computers and portable media players).
Architecture
One of the biggest areas for variety in multi-core architecture is the composition and balance of the cores themselves. Some architectures use one core design which is repeated consistently ("homogeneous"), while others use a mixture of different cores, each optimized for a different role ("heterogeneous").
As an example of this discussion, the article CPU designers debate multi-core future by Rick Merritt, EE Times 2008, includes comments:
"Chuck Moore... suggested computers should be more like cellphones, using a variety of specialty cores to run modular software scheduled by a high-level applications programming interface.
Architecture
Some architectures use one core design which is repeated consistently ("homogeneous"), while others use a mixture of different cores, each optimized for a different role ("heterogeneous").
As an example of this discussion, the article CPU designers debate multi-core future by Rick Merritt, EE Times 2008, includes comments:
Architecture
Some architectures use one core design which is repeated consistently ("homogeneous"), while others use a mixture of different cores, each optimized for a different role ("heterogeneous").
"Chuck Moore... suggested computers should be more like cellphones, using a variety of specialty cores to run modular software scheduled by a high-level applications programming interface.
Architecture
The application may create a new thread for the scan process, while the GUI thread waits for commands from the user (e.g. cancel the scan). In such cases, multicore architecture is of little benefit for the application itself due to the single thread doing all heavy lifting and the inability to balance the work evenly across multiple cores.
Architecture
Programming truly multithreaded code often requires complex co-ordination of threads and can easily introduce subtle and difficult-tofind bugs due to the interleaving of processing on data shared between threads thread-safety). Consequently, such code is much more difficult to debug than single-threaded code when it breaks.
Architecture
There has been a perceived lack of motivation for writing consumer-level threaded applications because of the relative rarity of consumer-level multiprocessor hardware. Although threaded applications incur little additional performance penalty on singleprocessor machines, the extra overhead of development has been difficult to justify due to the preponderance of single-processor machines.
Programming Environment
Given the increasing emphasis on multicore chip design, stemming from the grave thermal and power consumption problems posed by any further significant increase in processor clock speeds, the extent to which software can be multithreaded to take advantage of these new chips is likely to be the single greatest constraint on computer performance in the future. If developers are unable to design software to fully exploit the resources provided by multiple cores, then they will ultimately reach an insurmountable performance ceiling.
The telecommunications market had been one of the first that needed a new design of parallel datapath packet processing because there was a very quick adoption of these multiple core processors for the datapath and the control plane. These MPUs are going to replace the traditional Network Processors that were based on proprietary micro- or pico-code.

Parallel programming techniques can benefit from multiple cores directly. Some existing parallel programming models such as Cilk++, OpenMP, Skandium, MPI can be used on multi-core platforms. Intel introduced a new abstraction for C++ parallelism called TBB. Other research efforts include

Codeplay Sieve System Cray's Chapel Sun's Fortress IBM's X10.
Multi-core processing has also affected the ability of modern day computational software development. Developers programming in newer languages might find that their modern languages do not support multi-core functionality.
This then requires the use of numerical libraries to access code written in languages like C and Fortran, which perform math computations faster than newer languages like C#. Intel's MKL and AMD's ACML are written in these native languages and take advantage of multi-core processing.

Managing concurrency acquires a central role in developing parallel applications. The basic steps in designing parallel applications are: Partitioning The partitioning stage of a design is intended to expose opportunities for parallel execution. Hence, the focus is on defining a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem.
Communication The tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation to be performed in one task will typically require data associated with another task.
Data must then be transferred between tasks so as to allow computation to proceed. This information flow is specified in the communication phase of a design.
Agglomeration In the third stage, we move from the abstract toward the concrete. We revisit decisions made in the partitioning and communication phases with a view to obtaining an algorithm that will execute efficiently on some class of parallel computer.
In particular, we consider whether it is useful to combine, or agglomerate, tasks identified by the partitioning phase, so as to provide a smaller number of tasks, each of greater size. We also determine whether it is worthwhile to replicate data and/or computation.
Mapping In the fourth and final stage of the parallel algorithm design process, we specify where each task is to execute. This mapping problem does not arise on uniprocessors or on shared-memory computers that provide automatic task scheduling.
On the other hand, on the server side, multicore processors are ideal because they allow many users to connect to a site simultaneously and have independent threads of execution. This allows for Web servers and application servers that have much better throughput.
Typically, proprietary enterprise server software is licensed "per processor". In the past a CPU was a processor and most computers had only one CPU, so there was no ambiguity.
Now there is the possibility of counting cores as processors and charging a customer for multiple licenses for a multi-core CPU. However, the trend seems to be counting dual-core chips as processor as Microsoft, Intel, and AMD support this view. Microsoft have said they would treat a socket as a single processor.
Oracle counts an AMD X2 or Intel dual-core CPU as a single processor but has other numbers for other types, especially for processors with more than two cores.
IBM and HP count a multi-chip module as multiple processors. If multi-chip modules count as one processor, CPU makers have an incentive to make large expensive multi-chip modules so their customers save on software licensing. So it seems that the industry is slowly heading towards counting each die (see Integrated circuit) as a processor, no matter how many cores each die has.
An area of processor technology distinct from "mainstream" PCs is that of embedded computing. The same technological drivers towards multicore apply here too. Indeed, in many cases the application is a "natural" fit for multicore technologies, if the task can easily be partitioned between the different processors.
In addition, embedded software is typically developed for a specific hardware release, making issues of software portability, legacy code or supporting independent developers less critical than is the case for PC or enterprise computing. As a result, it is easier for developers to adopt new technologies and as a result there is a greater variety of multicore processing architectures and suppliers.
In network processing, it is now mainstream for devices to be multi-core, with companies such as Freescale Semiconductor, Cavium Networks, and Broadcom all manufacturing products with eight processors.

Texas Instruments
Three-core TMS320C6488 and four-core TMS320C5441, Four-core MSC8144 (eight-core successors). Newer entries include the Storm-1 family from with 40 and 80 general purpose ALUs per chip All programmable in C as a SIMD engine Three-hundred processors on a single die, focused on communication applications
Freescale
Stream Processors, Inc

Picochip
Commercial Hardware
SPARC
Amulti-core
that exists in fault tolerant version. multi-core physics processing unit.
Ageia PhysX
A
Ambric
Am2045,
a 336-core Massively Parallel Processor Array (MPPA)
Commercial Hardware
AMD Athlon 64, Athlon 64 FX and Athlon 64 X2 family, dual-core desktop processors.
Dual-,
Opteron
quad-, and hex-core server/workstation processors
Commercial Hardware
Phenom
dual-, triple-, and quad-core desktop processors, dual-core entry level processors. dual-core laptop processors. multi-core GPU/GPGPU (10 cores, 16 5issue wide superscalar stream processors per core)
Turion 64 X2

Radeon and FireStream
Commercial Hardware
Analog Devices Blackfin
BF561, a symmetrical dual-core processor. a fully synthesizable multicore container for and ARM Cortex-A9 MPCore processor cores, intended for high-performance embedded and entertainment applications.
ARM
ModemX, up to 128 cores, wireless applications. Azul Systems

Vega 1, a 24-core processor, released in 2005. Vega 2, a 48-core processor, released in 2006. Vega 3, a 54-core processor, released in 2008.
Broadcom SiByte SB1250, SB1255 and SB1455. Cradle Technologies CT3400 and CT3600, both multi-core DSPs. Cavium Networks Octeon, a 16-core MIPS MPU. Freescale Semiconductor QorIQ series processors, up to 8 cores, Power Architecture MPU. Hewlett-Packard PA-8800 and PA-8900, dual core PA-RISC processors.
Commercial Hardware
IBM

POWER4, the world's first non-embedded dualcore processor, released in 2001. POWER5, a dual-core processor, released in 2004. POWER6, a dual-core processor, released in 2007. PowerPC 970MP, a dual-core processor, used in the Apple Power Mac G5. Xenon, a triple-core, SMT-capable, PowerPC microprocessor used in the Microsoft Xbox 360 game console.
IBM, Sony, and Toshiba Cell processor, a nine-core processor with one general purpose PowerPC core and eight specialized SPUs (Synergystic Processing Unit) optimized for vector operations used in the Sony PlayStation 3. Infineon Danube, a dual-core, MIPS-based, home gateway processor
Commercial Hardware
Intel
Celeron Dual-Core, the first dual-core processor for the budget/entry-level market. Core Duo, a dual-core processor. Core 2 Duo, a dual-core processor. Core 2 Quad, a quad-core processor. core i3, Core i5, Core i7 and Core i9, a family of multicore processors, the successor of the Core 2 Duo and the Core 2 Quad. Itanium 2, a dual-core processor.
Commercial Hardware

Pentium D, 2 single-core dies packaged in a multichip module. Pentium Dual-Core, a dual-core processor. Teraflops Research Chip (Polaris), a 3.16 GHz, 80core processor prototype, which the company says will be released within the next five years[8]. Xeon dual-, quad- and hexa-core processors. SEAforth 40C18, a 40-core processor [9] SEAforth24, a 24-core processor designed by Charles H. Moore
IntellaSys

Commercial Hardware
Nvidia
GeForce
9 multi-core GPU (8 cores, 16 scalar stream processors per core) GeForce 200 multi-core GPU (10 cores, 24 scalar stream processors per core) Tesla multi-core GPGPU (10 cores, 24 scalar stream processors per core)
Commercial Hardware

Parallax Propeller P8X32, an eight-core microcontroller. picoChip PC200 series 200300 cores per device for DSP & wireless Plurality HAL series tightly coupled 16-256 cores, L1 shared memory, hardware synchronized processor. Rapport Kilocore KC256
`a 257-core microcontroller with a PowerPC core and 256 8bit "processing elements". Is now out of business.
Raza Microelectronics XLR, an eight-core MIPS MPU
Commercial Hardware

SiCortex "SiCortex node" has six MIPS64 cores on a single chip. Sun Microsystems
MAJC 5200, two-core VLIW processor UltraSPARC IV and UltraSPARC IV+, dual-core processors. UltraSPARC T1, an eight-core, 32-thread processor. UltraSPARC T2, an eight-core, 64concurrent-thread processor.
Commercial Hardware

Texas Instruments TMS320C80 MVP, a five-core multimedia video processor. Tilera TILE64, a 64-core processor XMOS Software Defined Silicon quad-core XS1-G4
Commercial Hardware
Academic

MIT, 16-core RAW processor University of California, Davis, Asynchronous array of simple processors (AsAP)
36-core
610 MHz AsAP 1.2 GHz AsAP2
167-core
Keywords

Multicore Association Multithreading (computer hardware) Multiprocessing Hyper-threading Symmetric multiprocessing (SMP) Simultaneous multithreading (SMT) Multitasking Parallel computing PureMVC MultiCore a modular programming framework XMTC Parallel Random Access Machine
References

TechTarget --- multi-core processor Multi-core in the Source Engine AMD: dual-core not for gamers... yet Gamebryo's Floodgate page CPU designers debate multi-core future", by Rick Merritt, EE Times 2008 Multicore packet processing Forum
References

Multicore Packet Processing Forum Parallel Computing Research wiki: "Chip Multiprocessor Comparison Chart" (Additions
welcome)
A Berkeley View on the Parallel Computing Landscape Argues for the desperate need to innovate around "manycore". BMDFM: Binary Modular Dataflow Machine Multi-core Runtime Environment (BMDFM)
References
Intel Tera-scale Computing Research Program Overview of Intel's Dual Core CPUs' Specifications (Intel's Website) Multi-core Programming blog e-Book on Multicore Programming e-Book outlining multicore programming challenges, and the leading programming approaches to deal with them.
References

XMTC: PRAM-like Programming Software release Online multicore community IEEE: Multicore Is Bad News For Supercomputers for some computing tasks, 8 cores aren't (yet) much better than 4 Muticore short course at MIT Diploma thesis: A Virtual Platform for High Speed Message-Passing-Hardware Research A virtual network interface for many core CPUs
Medical Imaging Applications
IBM and Mayo Clinic announced their collaboration to explore parallel computer architecture and memory bandwidth for the processing of 3-D medical images Graphic chips Sony, Toshiba, and IBM made for gaming can be employed for improving health care services.
Mayo Clinic scientists utilized the IBM Cell processors to align two medical images obtained at different dates and by using different imaging devices Mayo Clinic radiologists can more easily detect structural changes such as the growth or shrinkage of tumors.
"This alignment of images both improves the accuracy of interpretation and improves radiologist efficiency, particularly for diseases like cancer," says Mayo radiology researcher Bradley Erickson, M.D., Ph.D. who initially contacted IBM to discuss Mayo's computing needs.
Through porting and optimization of Mayo Clinic's Image Registration Application on the IBM BladeCenter QS20, the image registration results is50 times faster than the application running on a traditional processor configuration.
This breakout event inspirits the UI medical imaging researchers to seek a high-end computing facility for accelerating their current NIH-funded research projects.
Presently, there is no supercomputer or HPC cluster which is available to these project investigators.
This project is completely driven by UIs endusers in medical imaging and informatics. They are classified into two groups: 5 major user groups.
Medical imaging application profiles help us to define the systems basic requirements:

(1) a high-end supercomputer with multiple computing nodes; (2) multi-core, graphic accelerator processors to speed up and handle multi-threads; (3) a high-performance interconnection; (4) capability for graphic computing and data visualization; (5) certain data storage capacity and connection to PACS is required; (6) multi-core programming environment, selective medical imaging software, parallel libraries or tools, and administration/management suits (accounting, job scheduling and monitoring, etc); (7) strong technical support; and (8) parallel application supports.

Medical Imaging Registration using CellBE/GPU processors. The cell processors or graphic accelerators, initially for game industrial, began to replace traditional CPUs in some applications. Such recent trend allows port a medical imaging application on a Cell or GPU-based system.

For example, Sony, Toshiba, and IBM recently established a joint effort in developing Cell Broadband Engine (Cell/BE). In 2007, IBM and Mayo Clinic conducted a linear image registration of 98 sets of medical images using IBM Cell QS20 processors as regular processors.
They used their own application software of MRIcroviewer, Mayo Clinic Image (ImageFile and Mayo Open Source-ITK) to register a moving image to a fixed image on a IBM cellbased cluster.
They received 60 times speed-up for the total registration time of 98 data sets from hours to 516 seconds.
It was critical to restructure the entire program to achieve this performance gain, such as to maximize the SPE usage, to minimize the memory traffic, and to optimize the code for the SPE pipeline structure with SIMD intrinsic.
(IBM Research Report RC24138, 2007, Ohara et al, 2007a,b, Gong et al., 2008). Collaborating between IBM and Mayo Clinic achieves the ability and enhance the facility to register medical images up to 50 times quicker and provides critical diagnosis, such as during the growth or shrinkage of tumors, in seconds instead of hours.
With the IBM Cell/BE cluster, they are tackling couple clinically-potential projects, including maximum-resolution of organ imaging, imageguided tumor ablation, automated change detection and analysis.
This successful study encourages many of our UI users who need high performance registrations. It inspirits us to conduct preliminary study in how to develop efficient parallel algorithms, data decomposition utilizing the Cells intrinsic (PPE and SPE) architecture for multithreading data fetching.

Alternatively, people use GPU processors to handle image registration. For example, Samant et al (2008) compared the traditional CPU-based with GPU-based deformable image registration (DIR) for an adaptive radiotherapy, they concluded a GPU registration is about 50 times faster than the one using a single thread CPU and 30 times faster than the one using multi-thread CPU.
Yang (2009) described a robust and accurate 2D/3D image registration algorithm. The 2D version of the image registration algorithm is implemented on the IBM Cell/B.E. Yang achieved about 10 times speed up, which allows their registration algorithm to complete the nonlinear registration of a pair of images (192 192) in less than five seconds.
On cell or on GPU, which each is the best solution is unclear. It depends on clusters internal architecture and multiprogramming skills. For this application, we will take our three-step strategy. First, we implement our existing parallel registration codes on the CPU-based processors, we us have a basic solution.
Then, we will exploit the deployment of our registration programs on the Cell/BE of GPU processors to have comparison. We conduct our parallel productive registrations on the system for the NIH projects. The experience, parallel algorithms, implementation procedures and tips as well as software programs will be open for public use.
Medical Imaging Reconstruction on Cell/B.E. processors. Image reconstruction is one of our technical tasks. Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. Sonte et al (2008) presented their acceleration algorithm on a single NVIDIAs Quadro FX 5600.
The reconstruction of a 3D image with 1283 voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twentyone times slower. The Cell processor technology offers the advantages of a cost-effective, high-performance platform for medical reconstruction and imaging.
A research group in the Inst. of Medical Physics, Erlangen, Germany worked with Mercury Computer System in Germany to experiment many cases of the CT medical imaging reconstruction of 5123 volume on Cell/BE. They achieve sufficient computing performance for high image quality. (Knaup and Kachelrieb, 2007; Kachelrieb et al, 2007a,b; Knaup et al, 2007b, Kachelrie et al, 2007a,b; Kaup et al, 2007). The systematically compared the performance of CT reconstruction with GPU, Filed Programmable Gate Arrays (FPGA), and Cells.
Their recent study shows that the cone-beam backprojection of 512 projections into the 5123 volume took 3.2 min on the PC and is as fast as 13.6s on Cells. Thereby, the cell greatly outperforms todays top-notch back-projections based on GPUs. Using both CBEs of our dual cell-based blade provided by Mercury Computer Systems allows to 2D backproject 330 images/s and one can complete the 3D cone-beam back-projection in 6.8 s (Kachelrieb, 2007).
We will deploy many image reconstruction algorithms (2D to 3D, parallel beam, fan, beam, to spiral cone beams, FBP to EM, etc) on the systems. We would like to compare our results with the ones we had before. We begin to program our algorithms on the cell processors to study the performance benchmarks. The best option will be recommended for the production for the major users.
Medical Image Segmentation on GPU/Cell/B.E processors. As we discussed, besides of the image registration, image segmentation is one of our desired applications used frequently by our major users.
The medical image segmentation using high-end computing technology is still at a very early stage, although it is extremely important to clinical practices. Baggia et al. (2007) presented their performance comparison of image segmentations between different multi-core architectures, namely Cell, GPU, and SIMD.
For single processors, their results show that for a a256 by 256, 3.3ms on CPU, 2.4 ms on CPU with SIMD, 1.0ms on GPU (8 pixel shaders), 0.87 ms on PS3 Cell (6 SPEs) and 0,4 ms in another GPU (32 pixel shaders).
Again, the effective and fundamental parallelism for segmentation is the key to open the door for advanced computations on the cell or GPU processors. The major concern for the Cell is the programming effort. People suggested that if using Nvidas CUBAAPI for the implement of code on a GPU-GPU cluster, one may gain 9 times faster than the CPU processors.
They strongly suggested the GPU-based image libraries should be available on GPU and Cell processors in the future. We already start to work on the parallel algorithms for segmentation and run on the NCSA TeraGrid clusters. We will immigrate our segmentation program on the proposed system using the CPU processors first, and then test on the cell processors in multi-core programming.
Bioinformatics on Cell/B.E. processors. Recently Sachdeva et al (2008) systematically evaluated the performance of three popular bioinformatics applications (namely, FASTA, ClutalW, and HMMer) on the Cell/B.E. They preliminary results show the cell-based cluster is a promising power-efficient platform for future bioinformatics.
Zola et al (2009) Constructed Gene Regulatory Networks, across multiple Cells, multiple cores within each Cell, and vector units within the cores to develop a high performance implementation that they presented experimental results comparing the Cell implementation with a standard uniprocessor implementation and an implementation on a conventional supercomputer. They concluded a Cell cluster outperforms the BlueGgene/L system. Computation time with 64 SPE cores on the Cell cluster is the same as that with 128 PPC440 cores on BG/L, which shows a factor of 2 performance gain.
Martin (2008), in the University of Aarhus, Denmark, evaluated the applicability of the Cell processor in Phylogenetics and other computational intensive problems. Martin concluded Cell processor has an impressive performance and it's an interesting alternative to mainstream processors like x86 processors.
However, Cell architecture makes software development hard and time consuming compared to mainstream architectures. Libraries and compilers that could make software development easier are under development. Although there exist development tools like Cell SDK, debuggers, and optimizations tools for Cell software development, an extensive knowledge of the Cell architecture are strongly needed.
Martin concluded that the Cell should be seen as a hybrid between an x86 processor and a GPU, and is therefore suited for problems that require the properties of both architectures. If a suitable problem can be found and there is plenty of time for software development the Cell processor is worth considering but otherwise x86 processors are a better choice. GPU becomes more useful with support for branching and wider FP calculations.
Regarding the technical feasibility, applicability, and cost/performance efforts, Buehrer and Parthasaraphy (2007) conducted a NSF project to study the potential of Cell/BE for data mining. They report cell processors is up to 34 times more efficient than the competing technologies in general.
However, for major data mining algorithms, their preliminary investigation indicated that it not quite ready to employ the Cell technology for end-user applications, although it has great potentials. Therefore, for this application, we will only use CPUs, while keep eye open to seeking new solution. We believe, very soon, the genetics linkage codes will be portable on GPU and Cell processors.
Fast Fourier Transform and Discrete Wavelet Transform on Cell/BE processors. FFT is of primary importance and a fundamental kernel in medical imaging applications. Dave Bader investigated its performance on the Cell/BE, a heterogeneous multicore chip architected for intensive gaming applications and high performance computing (Bader and Agarwal, 2007).
After a careful design of their FFTC parallelism on the Cell and a partition of work loads among the SPEs within the Cell, they achieved the performance results outperforms Intel Duo Core for inputs of greater than 2K samples; thus to developed the fastest FFT parallel algorithm. Although this development and programming need strong knowledge in multi-core programming, it provides the high-end FFT methodology for future medical imaging, where the Fourier transformations are used.
In addition to Baders contribution to FFT on cell processors, Baders group also studied the Discrete Wavelet Transform (DWT) on JPEG2000. He received 34 and 56 times speedup using Cell/BE chip to the baseline code for the lossless and lossy transforms, compared with AMD Barcelona (Quad-core Operron) processors (Bader et al, 2009).
Dr. Baders research team at George Institute of Tech. joins the workforce in STI Center of Competence for the Cell/BE, sponsored by Sony, Toshiba, and IBM. They provide useful resources to the community who are using multi-core and cell processors for scientific computing. Their CellBuzz project provides a great resource which allows cell developers to obtain the first-hand training and warm-up skills in Cell/BE-based programming and application developments.
Baders contributions to FFT and Wavelet on Cell processors are wonderful research. The PI already applied an account to use Baders labs cell cluster and begin to collaborate with Blade to learn to conduct such useful transforms on cell bladders, and quickly implement on the proposed system for users production.
Isosurface Extraction for Medical Volume Dataset on Cell/BE. The sizes of volumetric data generated by medical imaging and scientific simulations exponentially increases. The medical volumetric data can be visualized by many graphic algorithms.
For example, one of popular algorithms is Marching Cubes Algorithm (MC). It can be efficiently handle isosurface extraction for medical clinical uses. The MC algorithm requires an intensive computing. Since, the Cell/BE processor with 8-SPE cores can handle extremely demanding computations with a large stream of data. A research group in China, led by Dr. Hai Jin conducted a streaming-based model and scheme which can efficiently map the MC algorithm on Cell/processors (Jin, 2009).
They introduced a block- filter run on PPE as a preprocessing stage to avoid unnecessary data transfer and computation. They implement their code on MC kernels on the SPEs as the subsequent stage.

Through tuning size of the block, the workload of PPE and SPE is orchestrated. Their experimental results demonstrate that the overall isosurface extraction can achieve a speedup of more than 10 times, compared with conventional CPUs. They start soon to implement this technique for many medical imaging research and even clinical uses.
Although we parallelized the MC algorithm, we will install it on the system, using CPU processors. We may soon experiment the MC algorithm on the cell processors to verify the Hais approach. It is important to compare the results on cell processors with GPU processors. The critical issue is how soon to apply this 3D graphic visualization for clinical use.
Graphic ray-tracing on GPU/Cell/BE. Complex 3D imaging brought from devices, such as MRIs, to the desktop. The importance of volume rendering has been increased as the amount of data grows due to widespread use of 3-D imaging devices such as CT, 3D laser scanners and MRI equipment. The technique, called ray-casting, recognized as one of the best for image quality, has been limited to a set amount of data due to its slowness.
Dr. Jusub Kim from University of Maryland at College Park (USA) accomplished his Ph.D. Thesis using for interactive rendering of volumetric data on multi-core CPUs and programmable GPUs.
In volume ray casting with NVIDIA CUDAGPU, he recently deployed his ray-tracing algorithm on the Cell/BE architecture provided opportunities to finally put the ray-casting into the practical use at the desktop computers of scientists and engineers. He presented a new volume ray-casting algorithm designed to fully take advantage of Cell/BE benefits.
His research showed Cell/BE is the main enabling technology in providing the-finest-image-quality volume rendering on practical data size. Experimental results also showed one can interactively render 256256x256 data onto a 256256 image at 15-frames/sec with one Cell/B.E processor, which was about 100 times faster than the same implementation at Intel Xeon 3GHz. This is a very important application in molecular imaging. Although the major users many not use this function in their research, it is worth to experiment this on cell processors. A minor effort may be made to investigate this potential application.

Micro-fluid flow on Cell/BE processors The study of micro-fluid flows either in pulmonary or in blood vessels becomes an important research field in biophysics and biomedical engineering. Sturmer et al, (2007) deployed the Latttice-Boltmann Method (LBM) on a single CellBE system. Their results show that Cell with 6 SPEs approach has the best performance, compared with conventional CPU and SMP nodes.
The obtained a consistent result on HP DL140G3, PS3, and IBM QS20, respectively. It is an interesting topic. If the cell processor is suitable to the LBM, the computations on cell processors will bring a significant merit to dynamic micro-fluid such as blood perfusion, vessel flows, or pulmonary airflow.
Isosurface extraction (a streaming-data block filter on PPE, and the partition classification, surface formation, and vertex interpolation, and computation kernels on the SPEs with multi-threads; MRI data is Skull and Trunks displayed with two different isovalue (1250 for Skeleton, 800 for Skin); the performance on the Cell/BE and a regular dual core CPU in terms of execution time for each SPEs.
(a) Registration
(b) Performance analysis
(c) Parallel of 16 SPEs, SIMD, and Optimizing memory
Registration using IBM Cell/BE processor (Ohara et al, 2007a,b)
Lattice Boltmann Method for blood-vessel flows on Cell processor, (a) 3D Vessel tree formation, (b) Lattice with 1 central and 18 neighboring points; (c) performance comparison between the Core 2 due processor and the Cell processor.
Proposed System Design and Its Architecture:

The users are mainly from two fields: medical imaging, and biomedical informatics (genetic linkage). For medical imaging users, most of them need highspeed computational nodes and graphic computing nodes.
For the medical informatics group, their jobs are loosely coupled, and it is relatively easy to configure a system for their satisfactions. Each job can be queried though job scheduling tools, or Suns Grid Engine.

In order to meets user needs, we design two major computing modules. One is for intensive computation in imaging processing (such as parallel registration, segmentation, reconstruction, enhancement, background removal) medical informatics (such as statistical data analysis, image and biological, genetic data mining, etc).
The another one is for graphic computing, such as isosurface extraction, 3D object rendering, data visualization, object animations, as well as other graphic processing. The system should have fact network protocols to the end-user systems and to the existing research database systems and PACS system.
The master node handles the management function, while computing nodes are dedicated for intensive computation or graphic computing. The balance between intensive computation and graphic computing depends on not only the processors instruction, but also precisions required. Most of the computations need floating point double procession, while graphic computing can even use single-precision or even integer calculation.
The multi-core cell technology is the high-end microprocessor technology. Cell combines a general-purpose core of modest performance with streamlined co-processing elements which greatly accelerate graphic and vector processing applications, as well as many other forms of dedicated computation.
The cell-based microprocessor is designed to bridge the gap between conventional desktop processors (such as the Athlon 64, and Core 2 families) and more specialized high-performance processors, such as the NVIDIA and ATI graphics-processors (GPUs).
It is designed for current and future digital distribution systems and suited to digital imaging systems (medical, scientific, etc.) and scientific simulations. The major challenge to scientific computing community is how to make invest your future computing system, which can enhance the computing power for todays computation, while remaining high sustainability can scalability for future uses.
This question also applies to the communities of medical imaging and biomedical informatics.
Architectur e CPU processors GeneralPurposed GPU (GPGPU)
Multi-core
MultiProcessors
Cluster
Major Vendors and Systems
Graphic Accelerator Technique No
Development Environment
yes Large number of cores (like 240) 9 cores (1 PPE + 8 SPEs)
Dual-/qua
Linux cluster
All HPC vendor
MPI/Multi-core threading/pThread MPI/Multi-core threading; Cuda Toolkit and SDK; AMDs GPU Tools and Libraries; OpenCL tools for graphcis MPI/Multi-core threading; IBMs Cell/SDK; Mercurys MultiCore Plus SDK MPI/Multi-core threading; Cuda Toolkit and SDK; AMDs GPU Tools and Libraries; MPI/Multi-core threading; IBMs Cell/SDK; Mercurys MultiCore Plus SDK I do not know
yes
GPGPU cluster
Nvidia (Cuda/Tesla, TSUBAME), with GeForce 8 and 9
Yes
Cell/BE
yes
Cell/B E cluster Nvidia GPGPU cluster Yes (our system) no
Mercury
Yes
Hybrid CPU-GPU
yes
yes
Nvidia/IBM (Tesla/x3550), TeamHPC, Dell (PowerEdge) IBM (Roadrunner), IBM, Mercury
Yes
Hybrid CPUCell/BE* Hybrid GPUCell/BE
yes
yes
Yes
yes
yes
no
Yes
Thanks
Questions?

2009, Multi-Core Programming For Medical Imaging PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2009, Multi-Core Programming For Medical Imaging PDF

Uploaded by

Copyright:

Available Formats

Enabling Technology of Multi-core Computing for Medical Imaging

Dec. 22, 2009

Harbin Engineering University

Recommended Resources (Wikipedia and Textbook)

Multi-core Computing Theme

Increasing performance through

Need for Multi-core Architecture

J. Presper Eckert and John Mauchly (1940)

John von Neumann

Need for Multi-core Architecture

Mainframe, a Neumann machine Time-sharing operating systems (1960)

Run on large mainframe computers

Need for Multi-core Architecture

Need for Multi-core Architecture

standalone devices with simple, single-user operating systems

Only one program would run at a time

Need for Multi-core Architecture

Lately, more sophisticated computing platforms

Networking on PCs became pervasive

Need for Multi-core Architecture

Increased user expectations

problems that face hardware and software developers

Need for Multi-core Architecture

Need for Multi-core Architecture

Need for Multi-core Architecture

On the server side, a provider must be able

A computer system capable of streaming a Web broadcast system

Need for Multi-core Architecture

Need for Multi-core Architecture

It requires job decomposition

Need for Multi-core Architecture

Need for Multi-core Architecture

Highly inefficient approach

The cores are typically

Cores in a multi-core device may be coupled together tightly or loosely.

message passing shared memory inter-core communication methods

Common network topologies to interconnect cores include:

bus, ring, 2-dimensional mesh, and crossbar

All cores are identical in homogeneous multi-core systems

Not identical in heterogeneous multi-core systems.

Superscalar VLIW Vector processing SIMD Multithreading

general-purpose Embedded Network Digital signal processing Graphics

Medical imaging is our focus!

The amount of performance

strongly dependent on software algorithms and implementation

often contain special circuitry to facilitate communication between each other

While manufacturing technology continues to improve:

these physical limitations can cause significant problems

heat dissipation data synchronization.

inefficient for others that tend to contain difficult-topredict code.

This is due to three primary factors:

This is due to three primary factors:

This is due to three primary factors:

Adding more cache suffers from diminishing returns.

The situation is improving

Codeplay Sieve System Cray's Chapel Sun's Fortress IBM's X10.

Stream Processors, Inc

that exists in fault tolerant version. multi-core physics processing unit.

a 336-core Massively Parallel Processor Array (MPPA)

quad-, and hex-core server/workstation processors

Radeon and FireStream

Analog Devices Blackfin

ModemX, up to 128 cores, wireless applications. Azul Systems