GPU Computing

tools and libraries

K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication,” in HWWS ’04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 2004, pp. 133–137.

J. Paul Walters, V. Balu, S. Kompalli, and V. Chaudhary, “Evaluating the use of GPUs in liver image segmentation and HMMER database searches,” Parallel and Distributed Processing Symposium, International, vol. 0, pp. 1–12, 2009.

C. Liu, “cuHMM: a CUDA Implementation of Hidden Markov Model Training and Classiﬁcation.” .

N. Ganesan, R. D. Chamberlain, J. Buhler, and M. Taufer., “Accelerating HMMER on GPUs by Implementing Hybrid Data and Task Parallelism,” in International Conference on Bioinformatics and Computational Biology (ACM-BCB), 2010.

Z. Du, Zhaoming Yin, and D. A. Bader, “A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA ,” in Ninth IEEE International Workshop on High Performance Computational Biology (HiCOMB), 2010.

F. Ries, T. De Marco, M. Zivieri, and R. Guerrieri, “Triangular matrix inversion on Graphics Processing Unit,” in SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, pp. 1–10.

N. Arora, A. Shringarpure, and R. W. Vuduc, “Direct N-body Kernels for Multicore Platforms,” in ICPP ’09: Proceedings of the 2009 International Conference on Parallel Processing, 2009, pp. 379–387.

K. Fujiwara and N. Nakasato, “Fast Simulations of Gravitational Many-body Problem on RV770 GPU,” 2009.

P. R. Dixon, T. Oonishi, and S. Furui, “Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition,” Comput. Speech Lang., vol. 23, no. 4, pp. 510–526, 2009.

D. Göddeke and R. Strzodka, “Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid,” IEEE Transactions on Parallel and Distributed Systems, Mar. 2010.

M. Geveler, D. Ribbrock, D. Göddeke, and S. Turek, “Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors,” in Lecture Notes in Computer Science, vol. 6310, R. Keller, D. Kramer, and J.-P. Weiß, Eds. Springer, 2010, pp. 92–104.

V. Demchik, “Pseudo-random number generators for Monte Carlo simulations on Graphics Processing Units,” 2010.

J. Kong, M. Dimitrov, Y. Yang, J. Liyanage, L. Cao, J. Staples, M. Mantor, and H. Zhou, “Accelerating MATLAB Image Processing Toolbox functions on GPUs.,” in GPGPU, 2010, vol. 425, pp. 75–85.

R. Capuzzo-Dolcetta, A. Mastrobuono-Battisti, and D. Maschietti, “NBSymple, a double parallel, symplectic N-body code running on Graphic Processing Units,” 2010.

A. R. Brodtkorb, T. R. Hagen, K.-A. Lie, and J. R. Natvig, “Simulation and Visualization of the Saint-Venant System using GPUs.” .

F. Zafar, A. Curtis, and M. Olano, “GPU Random Numbers via the Tiny Encryption Algorithm,” in Proceedings of the ACM SIGGRAPH/Eurographics Symposium on High Performance Graphics (HPG ), 2010.

D. Göddeke, “Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters,” 2010.

D. Komatisch, G. Erlebacher, D. Göddeke, and D. Michéa, “High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster,” 2010.

G. Vasiliadis, M. Polychronakis, S. Antonatos, E. P. Markatos, and S. Ioannidis, Regular Expression Matching on Graphics Hardware for Intrusion Detection. 2009.

G. Vasiliadis and S. Ioannidis, “GrAVity: A Massively Parallel Antivirus Engine,” in 13th International Symposium On Recent Advances In Intrusion Detection (RAID), 2010.

B. Block, P. Virnau, and T. Preis, “Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model,” Computer Physics Communications, vol. 181, no. 9, pp. 1549–1556, Sep. 2010.

C. Dick, J. Georgii, and R. Westermann, “A Real-Time Multigrid Finite Hexahedra Method for Elasticity Simulation using CUDA,” Computer Graphics and Visualization Group, Technische Universität München, Germany, Jul. 2010.

M. Mehri Dehnavi, D. Fernandez, and D. Giannacopoulos, “Finite-Element Sparse Matrix Vector Multiplication on Graphic Processing Units,” 2010.

A. Moore and A. C. Quillen, “QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters.”

P. Emeliyanenko, “A complete modular resultant algorithm targeted for realization on graphics hardware,” in PASCO ’10: Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, 2010, pp. 35–43.

P. B. Volk, D. Habich, and W. Lehner, “GPU-Based Speculative Query Processing for Database Operations,” in First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS), 2010.

W. Fang, B. He, and Q. Luo, “Database Compression on Graphics Processors,” in PVLDB/VLDB , 2010.

S. Han, K. Jang, K. Park, and S. Moon, “PacketShader: a GPU-accelerated software router,” in SIGCOMM ’10: Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM, 2010, pp. 195–206.

A. Stivala, P. Stuckey, and A. Wirth, “Fast and accurate protein substructure searching with simulated annealing and GPUs,” 2010.

O. Roy, I. Jovanovic, and R. Parhizkar, “WaveTomography: 2D time-domain waveform tomography reconstruction algorithm.” .

J. Singh and I. Aruni, “GPU computing for R Statistical Environment.”

T. Kocak and N. Hinitt, “Exploiting the Power of GPUs for Multi-gigabit Wireless Baseband Processing,” in ISPDC ’10: Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing, 2010, pp. 56–62.

U. Erra, B. Frola, and V. Scarano, BehaveRT: A GPU-Based Library for Autonomous Characters. 2010.

G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “GPU-Assisted Malware.”

A. Sen, B. Aksanli, M. Bozkurt, and M. Mert, “Parallel Cycle Based Logic Simulation Using Graphics Processing Units,” in ISPDC ’10: Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing, 2010, pp. 71–78.

Nvidia, “CUDA Show Case.” .

M. Tang and D. Manocha et al., “Fast GPU-based Collision Detection for Deformable Models,” in Proceedings of ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (i3D 2011), 2010.

G. G. and R. Montella et al., “A GPGPU transparent virtualization component for high performance computing clouds,” in Euro-Par 2010 – Parallel Processing.

N. Nakasato, “A Fast GEMM Implementation on a Cypress GPU,” in 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10).

J. E. Stone and D. J. Hardy et al., “GPU-Accelerated Molecular Modeling Coming Of Age,” 2010.

P. Gwosdek and H. Zimmer et al., “A Highly Efficient GPU Implementation for Variational Optic Flow Based on the Euler-Lagrange Framework,” in Proceedings of the ECCV Workshop for Computer Vision with GPUs, 2010.

Nexiwave.com, “Nexiwave.com and UbiCast Partner to Offer GPU-Accelerated Deep Audio Search.” 2010.

A. R. Brodtkorb, “PhD Thesis: Scientific Computing on Heterogeneous Architectures,” University of Oslo, 2010.

V. Pham and P. Vo et al., “GPU Implementation of Extended Gaussian Mixture Model for Background Subtraction,” in IEEE International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2010.

IST/SPIE, “GPGPU papers from Parallel Processing for Imaging Applications conference,” 2011. .

E. E. Ferrero and J. Pablo De Francesco et al., “q-state Potts model metastability study using optimized GPU-based Monte Carlo algorithms,” 2011.

L. Cheng and M. Gong et al., “Real-time Discriminative Background Subtraction,” in IEEE Transactions on Image Processing, 2011.

J. Sung Yoon and W.-H. Jung, “A GPU-accelerated bioinformatics application for large-scale protein networks,” in Asia Pacific Bioinformatics Conference, 2011.

A. Dziekonski and A. Lamecki et al., “GPU Acceleration of Multilevel Solvers for Analysis of Microwave Components With Finite Element Method,” in IEEE Microwave and Wireless Components Letters, 2011.

J. Singh and I. Aruni, “Accelerating Power Flow studies on Graphics Processing Unit,” in Annual IEEE India Conference (INDICON), 2010, pp. 1–5.

D. Rossinelli, B. Hejazialhosseini, D. G. Spampinato, and P. Koumoutsakos, “Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids,” SIAM Journal of Scientific Computing, vol. 33, no. 2, pp. 512–540, 2011.

L. Gelisio, C. L. Azanza Ricardo, M. Leoni, and P. Scardi, “Real-space calculation of powder diffraction patterns on graphics processing units.,” Journal of Applied Crystallography, vol. 43, no. 3, pp. 647–653, 2010.

A. Dziekonski, A. Lamecki, and M. Mrozowski, “A memory efficient and fast sparse matrix vector product on a GPU,” Progress In Electromagnetics Research, vol. 116, pp. 49–63, 2011.

D. Rossinelli, C. Conti, and P. Koumoutsakos, “Mesh–particle interpolations on graphics processing units and multicore central processing units,” Phil. Trans. R. Soc. A, vol. 369, no. 1944, pp. 2164–2175, 2011.

G. Pratx and L. Xing, “GPU computing in medical physics: A review ,” Medical Physics, vol. 38, no. 5, 2011.

C. Nugteren, G.-J. van den Braak, H. Corporaal, and B. Mesman, “High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs,” in Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4).

J. C. Bowden, “Application of the OpenCL API for Implementation of the NIPALS Algorithm for Principal Component Analysis of Large Data Sets,” in 2010 Sixth IEEE International Conference on e-Science Workshops, 2010, pp. 25–30.

N. Satish, M. Harris, and M. Garland, “Designing efficient sorting algorithms for manycore GPUs,” in 2009 IEEE International Symposium on ParallelProcessing, 2009, pp. 1–10.

M. Andrecut, “Parallel GPU Implementation of Iterative PCA Algorithms ,” arXiv, no. 0811.1081, Nov. 2008.

readings

J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable Parallel Programming with CUDA,” Queue, vol. 6, no. 2, pp. 40–53, 2008.

S. Lee, S.-J. Min, and R. Eigenmann, “OpenMP to GPGPU: a compiler framework for automatic translation and optimization,” in PPoPP ’09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2009, pp. 101–110.

M. Harris, “Mapping computational concepts to GPUs,” 2005.

Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover, “GPU Cluster for High Performance Computing,” in SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, 2004.

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: stream computing on graphics hardware,” 2004, pp. 777–786.

J. A. Stratton, S. S. Stone, and W. W. Hwu, “MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs,” Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing, Aug. 2008.

C. Lamb, OpenCL for NVIDIA GPUs. 2009.

B. Catanzaro, “OpenCLTM Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication.” 05-Oct-2010.

Q. Hou, K. Zhou, and B. Guo, “BSGP: bulk-synchronous GPU programming,” in SIGGRAPH ’08: ACM SIGGRAPH 2008 papers, 2008, pp. 1–12.

J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan, “GRAMPS: A programming model for graphics pipelines,” ACM Trans. Graph., vol. 28, no. 1, pp. 1–11, 2009.

S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu, “CUDA-Lite: Reducing GPU Programming Complexity,” in 21th International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008, pp. 1–15.

Advanced Micro Devices, ATI Stream Computing OpenCL Programming Guide - Version 1.03. 2010.

K. Fatahalian and M. Houston, “GPUs: A Closer Look,” ACM Queue, vol. 6, no. 2, pp. 18–28, 2008.

A. Kerr, G. Diamos, and S. Yalamanchili, “A characterization and analysis of PTX kernels,” IEEE Workload Characterization Symposium, vol. 0, pp. 3–12, 2009.

A. Kerr, G. Diamos, and S. Yalamanchili, “Modeling GPU-CPU workloads and systems,” in GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010, pp. 31–42.

GE Intelligent Platforms, “Many-Core Processors Report Ready for Duty,” 2010.

F. Feinbube, “GPU Readings List,” 2010. [Online]. Available: http://www.dcl.hpi.uni-potsdam.de/research/gpureadings/.

R. Bordawekar, U. Bondhugula, and R. Rao, “Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!,” Apr. 2010.

A. Rigland Brodtkorb, C. Dyken, T. Runar Hagen, J. M. Hjelmervik, and O. O. Storaasli, “State-of-the-art in heterogeneous computing,” Scientific Programming, vol. 18, no. 1, pp. 1–33, 2010.

R. Garg and J. N. Amaral, “Compiling Python to a hybrid execution environment,” in GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, 2010, pp. 19–30.

A. Kerr, G. Diamos, and S. Yalamanchili, “ Modelling GPU-CPU Workloads and Systems,” in Third Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3), 2010.

I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W. W. Hwu, “An asymmetric distributed shared memory model for heterogeneous parallel systems,” SIGARCH Comput. Archit. News, vol. 38, no. 1, pp. 347–358, 2010.

K. Asanovic, R. Bodik, B. Christopher Catanzaro, J. James Gebis, P. Husb, s, K. Keutzer, D. A. Patterson, W. Lester Plishker, J. Shalf, S. Webb Williams, and K. A. Yelick, “The Landscape of Parallel Computing Research: A View from Berkeley,” Electrical Engineering and Computer Sciences, Dec. 2006.

W. Hwu, S. Ryoo, S.-Z. Ueng, J. H. Kelm, I. Gelado, S. S. Stone, R. E. Kidd, S. S. Baghsorkhi, A. A. Mahesri, S. C. Tsao, N. Navarro, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Implicitly parallel programming models for thousand-core microprocessors,” in DAC ’07: Proceedings of the 44th annual Design Automation Conference, 2007, pp. 754–759.

D. H. Woo and H.-H. S. Lee, “COMPASS: a programmable data prefetcher using idle GPU shaders,” in ASPLOS ’10: Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, 2010, pp. 297–310.

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W. W. Hwu, “Program optimization space pruning for a multithreaded gpu,” in CGO ’08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, 2008, pp. 195–204.

NVIDIA, Optimizing CUDA. 2009.

V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, “Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU,” in ISCA ’10: Proceedings of the 37th annual international symposium on Computer architecture, 2010, pp. 451–460.

D. Behr, AMD GPU Architecture: OpenCLTM Tutorial, PPAM 2009. 2009.

T. Mattson, The Future of Many Core Computing: Software for many core processors. 2010.

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA,” in PPoPP ’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, 2008, pp. 73–82.

W. Mark, “Future Graphics Architectures,” Queue, vol. 6, no. 2, pp. 54–64, 2008.

J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming , 1st ed. Addison-Wesley Professional, 2010.

F. Feinbube, P. Tröger, and A. Polze, “Joint Forces: From Multithreaded Programming to GPU Computing,” IEEE Software (Software), vol. 28, no. 1, pp. 51–57, Oct. 2010.

J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Rigel: an architecture and scalable programming interface for a 1000-core accelerator,” SIGARCH Comput. Archit. News, vol. 37, no. 3, pp. 140–151, 2009.

C.-K. Luk, S. Hong, and H. Kim, “Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping,” in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 45–55.

F. Feinbube, B. Rabe, M. von Löwis, and A. Polze, “NQueens on CUDA: Optimization Issues,” in 2010 Ninth International Symposium on Parallel and Distributed Computing, 2010, pp. 63–70.

NVIDIA, OpenCL Programming for the CUDA Architecture - Version 2.3. 2009.

Nvidia, NVIDIA CUDA C Programming Guide 3.2. NVIDIA Corporation, 2010.

D. B. Kirk and W. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 1st ed. Morgan Kaufmann, 2010.

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, “A performance study of general-purpose applications on graphics processors using CUDA,” Journal of Parallel and Distributed Computing, vol. 68, no. 10, 2008.

A. Munshi, Ed., The OpenCL Specification - Version 1.1. The Khronos Group Inc., 2010.

NVIDIA, NVIDIA OpenCL Best Practices Guide - Version 2.3. 2009.

MAVERICK PR, “PEER 1 Hosting: Large-Scale Hosted NVIDIA GPU Cloud,” 2011. .

UK GPU Computing Conference, Presentations from the 2nd UK GPU Computing Conference. 2011.

PASI, Pan-American Advanced Studies Institutes: Materials now online. 2011.

top500.org , “3 of the 5 fastest supercomputers in the world use GPUs,” 2010. .

M. Garland and D. B. Kirk, “Understanding throughput-oriented architectures,” in Commununications of the ACM 53, 2010.

R. Bordawekar and U. Bondhugula et al., “‘Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU’,” in Technical Report, IBM T. J. Watson Research Center, 2010.

NVIDIA, “NVIDIA Tesla GPUs Power World’s Fastest Supercomputer,” 2010. .

NVIDIA, GPU Technology Conference Session Archive Available. 2010.

S. J. Pennycook and S. D. Harmond et al., “Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark,” in 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), 2010.

SciComp Inc., “SciComp Speeds Derivatives Performance with Support for New NVIDIA Hardware and Software.” 2010.

HipHaC’11, CfP: New Frontiers in High-performance and Hardware-aware Computing. .

Insilicos, “Insilicos Awarded NIH Grant Applying GPU Computing to Human Disease.” .

P. Shivam, S. Babu, and J. S. Chase, “Learning Application Models for Utility Resource Planning,” in International Conference on Autonomic Computing (ICAC), 2006, pp. 255–264.

Amazon, “Amazon announces GPUs for Cloud Computing,” 2010. .

J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W. W. Hwu, “Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs,” in CGO ’10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, 2010, pp. 111–119.

M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. Morgan Kaufmann, 2008.

E. Karlson, Optimizing OpenCL for nVidia GPGPU. 2010.

NVIDIA, “CUDA Developer Zone,” 2011. [Online]. Available: http://developer.nvidia.com/category/zone/cuda-zone. [Accessed: 06-Oct-2011].

use cases