
Readings collection for GPU computing
J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable Parallel Programming with CUDA,” Queue, vol. 6, no. 2, pp. 40–53, 2008.
S. Lee, S.J. Min, and R. Eigenmann, “OpenMP to GPGPU: a compiler framework for automatic translation and optimization,” in PPoPP ’09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2009, pp. 101–110.
M. Harris, “Mapping computational concepts to GPUs,” 2005.
Z. Fan, F. Qiu, A. Kaufman, and S. YoakumStover, “GPU Cluster for High Performance Computing,” in SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, 2004.
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: stream computing on graphics hardware,” 2004, pp. 777–786.
J. A. Stratton, S. S. Stone, and W. W. Hwu, “MCUDA: An Efficient Implementation of CUDA Kernels for MultiCore CPUs,” Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing, Aug. 2008.
C. Lamb, OpenCL for NVIDIA GPUs. 2009.
B. Catanzaro, “OpenCLTM Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication.” 05Oct2010.
Q. Hou, K. Zhou, and B. Guo, “BSGP: bulksynchronous GPU programming,” in SIGGRAPH ’08: ACM SIGGRAPH 2008 papers, 2008, pp. 1–12.
J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan, “GRAMPS: A programming model for graphics pipelines,” ACM Trans. Graph., vol. 28, no. 1, pp. 1–11, 2009.
S.Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.M. W. Hwu, “CUDALite: Reducing GPU Programming Complexity,” in 21th International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2008, pp. 1–15.
Advanced Micro Devices, ATI Stream Computing OpenCL Programming Guide  Version 1.03. 2010.
K. Fatahalian and M. Houston, “GPUs: A Closer Look,” ACM Queue, vol. 6, no. 2, pp. 18–28, 2008.
A. Kerr, G. Diamos, and S. Yalamanchili, “A characterization and analysis of PTX kernels,” IEEE Workload Characterization Symposium, vol. 0, pp. 3–12, 2009.
A. Kerr, G. Diamos, and S. Yalamanchili, “Modeling GPUCPU workloads and systems,” in GPGPU ’10: Proceedings of the 3rd Workshop on GeneralPurpose Computation on Graphics Processing Units, 2010, pp. 31–42.
GE Intelligent Platforms, “ManyCore Processors Report Ready for Duty,” 2010.
F. Feinbube, “GPU Readings List,” 2010. [Online]. Available: http://www.dcl.hpi.unipotsdam.de/research/gpureadings/.
R. Bordawekar, U. Bondhugula, and R. Rao, “Believe it or Not! Multicore CPUs Can Match GPU Performance for FLOPintensive Application!,” Apr. 2010.
A. Rigland Brodtkorb, C. Dyken, T. Runar Hagen, J. M. Hjelmervik, and O. O. Storaasli, “Stateoftheart in heterogeneous computing,” Scientific Programming, vol. 18, no. 1, pp. 1–33, 2010.
R. Garg and J. N. Amaral, “Compiling Python to a hybrid execution environment,” in GPGPU ’10: Proceedings of the 3rd Workshop on GeneralPurpose Computation on Graphics Processing Units, 2010, pp. 19–30.
A. Kerr, G. Diamos, and S. Yalamanchili, “ Modelling GPUCPU Workloads and Systems,” in Third Workshop on GeneralPurpose Computation on Graphics Processing Units (GPGPU3), 2010.
I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W. W. Hwu, “An asymmetric distributed shared memory model for heterogeneous parallel systems,” SIGARCH Comput. Archit. News, vol. 38, no. 1, pp. 347–358, 2010.
K. Asanovic, R. Bodik, B. Christopher Catanzaro, J. James Gebis, P. Husb, s, K. Keutzer, D. A. Patterson, W. Lester Plishker, J. Shalf, S. Webb Williams, and K. A. Yelick, “The Landscape of Parallel Computing Research: A View from Berkeley,” Electrical Engineering and Computer Sciences, Dec. 2006.
W. Hwu, S. Ryoo, S.Z. Ueng, J. H. Kelm, I. Gelado, S. S. Stone, R. E. Kidd, S. S. Baghsorkhi, A. A. Mahesri, S. C. Tsao, N. Navarro, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Implicitly parallel programming models for thousandcore microprocessors,” in DAC ’07: Proceedings of the 44th annual Design Automation Conference, 2007, pp. 754–759.
D. H. Woo and H.H. S. Lee, “COMPASS: a programmable data prefetcher using idle GPU shaders,” in ASPLOS ’10: Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, 2010, pp. 297–310.
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.Z. Ueng, J. A. Stratton, and W. W. Hwu, “Program optimization space pruning for a multithreaded gpu,” in CGO ’08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, 2008, pp. 195–204.
NVIDIA, Optimizing CUDA. 2009.
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, “Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU,” in ISCA ’10: Proceedings of the 37th annual international symposium on Computer architecture, 2010, pp. 451–460.
D. Behr, AMD GPU Architecture: OpenCLTM Tutorial, PPAM 2009. 2009.
T. Mattson, The Future of Many Core Computing: Software for many core processors. 2010.
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA,” in PPoPP ’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, 2008, pp. 73–82.
W. Mark, “Future Graphics Architectures,” Queue, vol. 6, no. 2, pp. 54–64, 2008.
J. Sanders and E. Kandrot, CUDA by Example: An Introduction to GeneralPurpose GPU Programming , 1st ed. AddisonWesley Professional, 2010.
F. Feinbube, P. Tröger, and A. Polze, “Joint Forces: From Multithreaded Programming to GPU Computing,” IEEE Software (Software), vol. 28, no. 1, pp. 51–57, Oct. 2010.
J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, “Rigel: an architecture and scalable programming interface for a 1000core accelerator,” SIGARCH Comput. Archit. News, vol. 37, no. 3, pp. 140–151, 2009.
C.K. Luk, S. Hong, and H. Kim, “Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping,” in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 45–55.
F. Feinbube, B. Rabe, M. von Löwis, and A. Polze, “NQueens on CUDA: Optimization Issues,” in 2010 Ninth International Symposium on Parallel and Distributed Computing, 2010, pp. 63–70.
NVIDIA, OpenCL Programming for the CUDA Architecture  Version 2.3. 2009.
Nvidia, NVIDIA CUDA C Programming Guide 3.2. NVIDIA Corporation, 2010.
D. B. Kirk and W. W. Hwu, Programming Massively Parallel Processors: A Handson Approach, 1st ed. Morgan Kaufmann, 2010.
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, “A performance study of generalpurpose applications on graphics processors using CUDA,” Journal of Parallel and Distributed Computing, vol. 68, no. 10, 2008.
A. Munshi, Ed., The OpenCL Specification  Version 1.1. The Khronos Group Inc., 2010.
NVIDIA, NVIDIA OpenCL Best Practices Guide  Version 2.3. 2009.
MAVERICK PR, “PEER 1 Hosting: LargeScale Hosted NVIDIA GPU Cloud,” 2011. .
UK GPU Computing Conference, Presentations from the 2nd UK GPU Computing Conference. 2011.
PASI, PanAmerican Advanced Studies Institutes: Materials now online. 2011.
top500.org
, “3 of the 5 fastest supercomputers in the world use GPUs,” 2010. .
M. Garland and D. B. Kirk, “Understanding throughputoriented architectures,” in Commununications of the ACM 53, 2010.
R. Bordawekar and U. Bondhugula et al., “‘Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOPintensive Application on CPUs and GPU’,” in Technical Report, IBM T. J. Watson Research Center, 2010.
NVIDIA, “NVIDIA Tesla GPUs Power World’s Fastest Supercomputer,” 2010. .
NVIDIA, GPU Technology Conference Session Archive Available. 2010.
S. J. Pennycook and S. D. Harmond et al., “Performance Analysis of a Hybrid MPI/CUDA Implementation of the NASLU Benchmark,” in 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10), 2010.
SciComp Inc., “SciComp Speeds Derivatives Performance with Support for New NVIDIA Hardware and Software.” 2010.
HipHaC’11, CfP: New Frontiers in Highperformance and Hardwareaware Computing. .
Insilicos, “Insilicos Awarded NIH Grant Applying GPU Computing to Human Disease.” .
P. Shivam, S. Babu, and J. S. Chase, “Learning Application Models for Utility Resource Planning,” in International Conference on Autonomic Computing (ICAC), 2006, pp. 255–264.
Amazon, “Amazon announces GPUs for Cloud Computing,” 2010. .
J. A. Stratton, V. Grover, J. Marathe, B. Aarts, M. Murphy, Z. Hu, and W. W. Hwu, “Efficient compilation of finegrained SPMDthreaded programs for multicore CPUs,” in CGO ’10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, 2010, pp. 111–119.
M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. Morgan Kaufmann, 2008.
E. Karlson, Optimizing OpenCL for nVidia GPGPU. 2010.
NVIDIA, “CUDA Developer Zone,” 2011. [Online]. Available: http://developer.nvidia.com/category/zone/cudazone. [Accessed: 06Oct2011].
GPU Compute Use Cases
K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrixmatrix multiplication,” in HWWS ’04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, 2004, pp. 133–137.
J. Paul Walters, V. Balu, S. Kompalli, and V. Chaudhary, “Evaluating the use of GPUs in liver image segmentation and HMMER database searches,” Parallel and Distributed Processing Symposium, International, vol. 0, pp. 1–12, 2009.
C. Liu, “cuHMM: a CUDA Implementation of Hidden Markov Model Training and Classiﬁcation.” .
N. Ganesan, R. D. Chamberlain, J. Buhler, and M. Taufer., “Accelerating HMMER on GPUs by Implementing Hybrid Data and Task Parallelism,” in International Conference on Bioinformatics and Computational Biology (ACMBCB), 2010.
Z. Du, Zhaoming Yin, and D. A. Bader, “A Tilebased Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA ,” in Ninth IEEE International Workshop on High Performance Computational Biology (HiCOMB), 2010.
F. Ries, T. De Marco, M. Zivieri, and R. Guerrieri, “Triangular matrix inversion on Graphics Processing Unit,” in SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009, pp. 1–10.
N. Arora, A. Shringarpure, and R. W. Vuduc, “Direct Nbody Kernels for Multicore Platforms,” in ICPP ’09: Proceedings of the 2009 International Conference on Parallel Processing, 2009, pp. 379–387.
K. Fujiwara and N. Nakasato, “Fast Simulations of Gravitational Manybody Problem on RV770 GPU,” 2009.
P. R. Dixon, T. Oonishi, and S. Furui, “Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition,” Comput. Speech Lang., vol. 23, no. 4, pp. 510–526, 2009.
D. Göddeke and R. Strzodka, “Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid,” IEEE Transactions on Parallel and Distributed Systems, Mar. 2010.
M. Geveler, D. Ribbrock, D. Göddeke, and S. Turek, “LatticeBoltzmann Simulation of the ShallowWater Equations with FluidStructure Interaction on Multi and Manycore Processors,” in Lecture Notes in Computer Science, vol. 6310, R. Keller, D. Kramer, and J.P. Weiß, Eds. Springer, 2010, pp. 92–104.
V. Demchik, “Pseudorandom number generators for Monte Carlo simulations on Graphics Processing Units,” 2010.
J. Kong, M. Dimitrov, Y. Yang, J. Liyanage, L. Cao, J. Staples, M. Mantor, and H. Zhou, “Accelerating MATLAB Image Processing Toolbox functions on GPUs.,” in GPGPU, 2010, vol. 425, pp. 75–85.
R. CapuzzoDolcetta, A. MastrobuonoBattisti, and D. Maschietti, “NBSymple, a double parallel, symplectic Nbody code running on Graphic Processing Units,” 2010.
A. R. Brodtkorb, T. R. Hagen, K.A. Lie, and J. R. Natvig, “Simulation and Visualization of the SaintVenant System using GPUs.” .
F. Zafar, A. Curtis, and M. Olano, “GPU Random Numbers via the Tiny Encryption Algorithm,” in Proceedings of the ACM SIGGRAPH/Eurographics Symposium on High Performance Graphics (HPG ), 2010.
D. Göddeke, “Fast and Accurate FiniteElement Multigrid Solvers for PDE Simulations on GPU Clusters,” 2010.
D. Komatisch, G. Erlebacher, D. Göddeke, and D. Michéa, “Highorder finiteelement seismic wave propagation modeling with MPI on a large GPU cluster,” 2010.
G. Vasiliadis, M. Polychronakis, S. Antonatos, E. P. Markatos, and S. Ioannidis, Regular Expression Matching on Graphics Hardware for Intrusion Detection. 2009.
G. Vasiliadis and S. Ioannidis, “GrAVity: A Massively Parallel Antivirus Engine,” in 13th International Symposium On Recent Advances In Intrusion Detection (RAID), 2010.
B. Block, P. Virnau, and T. Preis, “MultiGPU accelerated multispin Monte Carlo simulations of the 2D Ising model,” Computer Physics Communications, vol. 181, no. 9, pp. 1549–1556, Sep. 2010.
C. Dick, J. Georgii, and R. Westermann, “A RealTime Multigrid Finite Hexahedra Method for Elasticity Simulation using CUDA,” Computer Graphics and Visualization Group, Technische Universität München, Germany, Jul. 2010.
M. Mehri Dehnavi, D. Fernandez, and D. Giannacopoulos, “FiniteElement Sparse Matrix Vector Multiplication on Graphic Processing Units,” 2010.
A. Moore and A. C. Quillen, “QYMSYM: A GPUAccelerated Hybrid Symplectic Integrator That Permits Close Encounters.”
P. Emeliyanenko, “A complete modular resultant algorithm targeted for realization on graphics hardware,” in PASCO ’10: Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, 2010, pp. 35–43.
P. B. Volk, D. Habich, and W. Lehner, “GPUBased Speculative Query Processing for Database Operations,” in First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS), 2010.
W. Fang, B. He, and Q. Luo, “Database Compression on Graphics Processors,” in PVLDB/VLDB , 2010.
S. Han, K. Jang, K. Park, and S. Moon, “PacketShader: a GPUaccelerated software router,” in SIGCOMM ’10: Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM, 2010, pp. 195–206.
A. Stivala, P. Stuckey, and A. Wirth, “Fast and accurate protein substructure searching with simulated annealing and GPUs,” 2010.
O. Roy, I. Jovanovic, and R. Parhizkar, “WaveTomography: 2D timedomain waveform tomography reconstruction algorithm.” .
J. Singh and I. Aruni, “GPU computing for R Statistical Environment.”
T. Kocak and N. Hinitt, “Exploiting the Power of GPUs for Multigigabit Wireless Baseband Processing,” in ISPDC ’10: Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing, 2010, pp. 56–62.
U. Erra, B. Frola, and V. Scarano, BehaveRT: A GPUBased Library for Autonomous Characters. 2010.
G. Vasiliadis, M. Polychronakis, and S. Ioannidis, “GPUAssisted Malware.”
A. Sen, B. Aksanli, M. Bozkurt, and M. Mert, “Parallel Cycle Based Logic Simulation Using Graphics Processing Units,” in ISPDC ’10: Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing, 2010, pp. 71–78.
Nvidia, “CUDA Show Case.” .
M. Tang and D. Manocha et al., “Fast GPUbased Collision Detection for Deformable Models,” in Proceedings of ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (i3D 2011), 2010.
G. G. and R. Montella et al., “A GPGPU transparent virtualization component for high performance computing clouds,” in EuroPar 2010 – Parallel Processing.
N. Nakasato, “A Fast GEMM Implementation on a Cypress GPU,” in 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10).
J. E. Stone and D. J. Hardy et al., “GPUAccelerated Molecular Modeling Coming Of Age,” 2010.
P. Gwosdek and H. Zimmer et al., “A Highly Efficient GPU Implementation for Variational Optic Flow Based on the EulerLagrange Framework,” in Proceedings of the ECCV Workshop for Computer Vision with GPUs, 2010.
Nexiwave.com, “Nexiwave.com and UbiCast Partner to Offer GPUAccelerated Deep Audio Search.” 2010.
A. R. Brodtkorb, “PhD Thesis: Scientific Computing on Heterogeneous Architectures,” University of Oslo, 2010.
V. Pham and P. Vo et al., “GPU Implementation of Extended Gaussian Mixture Model for Background Subtraction,” in IEEE International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2010.
IST/SPIE, “GPGPU papers from Parallel Processing for Imaging Applications conference,” 2011. .
E. E. Ferrero and J. Pablo De Francesco et al., “qstate Potts model metastability study using optimized GPUbased Monte Carlo algorithms,” 2011.
L. Cheng and M. Gong et al., “Realtime Discriminative Background Subtraction,” in IEEE Transactions on Image Processing, 2011.
J. Sung Yoon and W.H. Jung, “A GPUaccelerated bioinformatics application for largescale protein networks,” in Asia Pacific Bioinformatics Conference, 2011.
A. Dziekonski and A. Lamecki et al., “GPU Acceleration of Multilevel Solvers for Analysis of Microwave Components With Finite Element Method,” in IEEE Microwave and Wireless Components Letters, 2011.
J. Singh and I. Aruni, “Accelerating Power Flow studies on Graphics Processing Unit,” in Annual IEEE India Conference (INDICON), 2010, pp. 1–5.
D. Rossinelli, B. Hejazialhosseini, D. G. Spampinato, and P. Koumoutsakos, “Multicore/MultiGPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids,” SIAM Journal of Scientific Computing, vol. 33, no. 2, pp. 512–540, 2011.
L. Gelisio, C. L. Azanza Ricardo, M. Leoni, and P. Scardi, “Realspace calculation of powder diffraction patterns on graphics processing units.,” Journal of Applied Crystallography, vol. 43, no. 3, pp. 647–653, 2010.
A. Dziekonski, A. Lamecki, and M. Mrozowski, “A memory efficient and fast sparse matrix vector product on a GPU,” Progress In Electromagnetics Research, vol. 116, pp. 49–63, 2011.
D. Rossinelli, C. Conti, and P. Koumoutsakos, “Mesh–particle interpolations on graphics processing units and multicore central processing units,” Phil. Trans. R. Soc. A, vol. 369, no. 1944, pp. 2164–2175, 2011.
G. Pratx and L. Xing, “GPU computing in medical physics: A review ,” Medical Physics, vol. 38, no. 5, 2011.
C. Nugteren, G.J. van den Braak, H. Corporaal, and B. Mesman, “High performance predictable histogramming on GPUs: exploring and evaluating algorithm tradeoffs,” in Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU4).
J. C. Bowden, “Application of the OpenCL API for Implementation of the NIPALS Algorithm for Principal Component Analysis of Large Data Sets,” in 2010 Sixth IEEE International Conference on eScience Workshops, 2010, pp. 25–30.
N. Satish, M. Harris, and M. Garland, “Designing efficient sorting algorithms for manycore GPUs,” in 2009 IEEE International Symposium on ParallelProcessing, 2009, pp. 1–10.
M. Andrecut, “Parallel GPU Implementation of Iterative PCA Algorithms ,” arXiv, no. 0811.1081, Nov. 2008.
GPU Computing Tools and Libraries
NVIDIA, “NVIDIA GPU computing downloads.” .
D. van Dyk, M. Geveler, S. Mallach, D. Ribbrock, D. Göddeke, and C. Gutwenger, “HONEI: A collection of libraries for numerical computations targeting multiple processor architectures,” Computer Physics Communications, vol. 180, no. 12, pp. 2534–2543, 2009.
OpenNL, “Open Numerical Library (OpenNL),” 2010. .
CLyther, “CLyther.” [Online]. Available: http://srossross.github.com/Clyther/.
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” IEEE Workload Characterization Symposium, vol. 0, pp. 44–54, 2009.
cudpp, “CUDA Data Parallel Primitives Library (cudpp).” .
A. Barak and A. Shiloh, “MOSIX Virtual OpenCL (VCL).” .
Graphic Remedy, “gDEBugger CL.” .
AccelerEyes, “Jacket: The GPU Engine for MATLAB.” .
I. Gelado, J. Cabezas, J. Stone, S. Patel, N. Navarro, and W. Hwu, “An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems,” in Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ), 2010.
multiscalelab, “Swan: A simple tool for porting CUDA to OpenCL.” .
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, “The Scalable Heterogeneous Computing (SHOC) benchmark suite,” in GPGPU ’10: Proceedings of the 3rd Workshop on GeneralPurpose Computation on Graphics Processing Units, 2010, pp. 63–74.
Fixstars Corporation, “Yellow Dog Enterprise Linux for CUDA.” .
Palix Technologies LLC, “Advanced Numerical Design Solver (ANDSolver).” .
J. Hoberock and N. Bell, “Thrust: opensource template library for developing CUDA applications.” .
Vratis.com, “SpeedIT Tools library.” .
A. J. Peña, “rCUDA Framework: concurrent usage of CUDAcompatible devices remotely.” .
Geist Software Labs Inc., “OpenCL Studio.” .
J. Cohen, “OpenCurrent: open source C++ library for solving Partial Differential Equations (PDEs) over regular grids.” .
University of Michigan, “ Highly Optimized Objectoriented Manyparticle Dynamics  Blue Edition.” .
Advanced Micro Devices, Inc., “ATI Stream Profiler.” .
CAPS entreprise, “HMPP Workbench: a directivebased compiler for hybrid computing.” .
Gpu Systems, “Libra Technology compiler and runtime architecture.” .
Tuna Code, “CUDA Vision and Imaging Library (CUVILib).” .
Institute for Microelectronics, TU Wien, “Vienna Computing Library (ViennaCL): scientific computing library.” .
NIH Center for Biomedical Computation at Stanford University, “OpenMM: Accelerate Molecular Dynamics.” .
EM Photonics, Inc., “CULA: GPUaccelerated linear algebra library.” .
NVIDIA Corporation , “NVIDIA® Parallel NsightTM.” .
G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: a dynamic optimization framework for bulksynchronous applications in heterogeneous systems,” in PACT ’10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, 2010, pp. 353–364.
Advanced Micro Devices, Inc., “ATI GPU Services (AGS) Library .” .
GE Intelligent Platforms, “AXISLibGPU.” .
The MathWorks, Inc., “MATLAB GPU Computing.” .
Advanced Micro Devices, Inc., “Aparapi .” .
AMD, “ ATI Stream Software Development Kit (SDK).” .
ASU/Temple Zeolite Project, “PyCULA: Python Bindings for CULA GPGPU LAPACK.” .
Thrust, “Thrust v1.3 release.” .
HOOMDblue, “HOOMDblue: particle dynamics simulations.” .
IMPETUS Afea, “IMPETUS Afea Solver: A novel Finite Element code adapted to GPU technology.” 2010.
ACUSIM Software, Inc., “ACUSIM Software Releases Latest Version of AcuSolve CFD Solver.” 2010.
MathWorks, “MATLAB Adds GPU Support.” .
GPU Systems, “GPU Systems release MATLAB CPUGPU Support.” 2010.
NVIDIA, “CUDA 3.2 Released.” 2010.
GAP, “rCUDA 2.0 released.” 2010.
VratisLTD, “OpenFOAM SpeedIT plugin 1.1 released.” 2010.
Innovative Computing Laboratory, “MAGMA 1.0 – LAPACK for GPUs – has been released.” 2010.
TidePowerd, “Announcing GPU.NET from TidePowerd: ‘Native’ GPU computing for .NET.”2010.
MOSIX group, “MOSIX Virtual OpenCL (VCL) Cluster Platform.” 2010.
H. Bauke, “TRNGA library for parallel Monte Carlo on NVIDIA graphics cards.” 2011.
NVIDIA, “CUDA Libraries Performance Report Now Available.” 2011.
VratisLTD, “SpeedIT 1.2 released.” 2011.
Ocelot, “GPUOcelot 2.0 Released.” 2011.
GMAC , “GMAC 0.0.20 Released.” 2011.
OpenCLcc, “OpenCLcc: Offline OpenCL Compilation.” 2011.
U. Verner, A. Schuster, and M. Silberstein, “Processing data streams with hard realtime constraints on heterogeneous systems,” in International Conference on Supercomputing (ICS), 2011.
KGPU, “KGPU: enabling GPU computing in Linux kernel,” 2011. [Online]. Available: http://code.google.com/p/kgpu/. [Accessed: 2011].
symscape, “GPU Linear Solver Library for OpenFOAM,” 2011. [Online]. Available: http://www.symscape.com/gpuopenfoam. [Accessed: 2011].
xman, “SGC Ruby CUDA.” .
TidePowerd Ltd., “GPU.NET,” 2011. [Online]. Available: http://www.tidepowerd.com/. [Accessed: Oct2011].
G. Dotzler, R. Veldema, and M. Klemm, “JCudaMP: OpenMP/Java on CUDA,” in 3rd International Workshop on Multicore Software Engineering, 2010, pp. 10–17.
C. Sánchez de La Lama, “Portable OpenCL ,” 2011. [Online]. Available: https://launchpad.net/pocl.
“Extending MPI to Accelerators,” PACT 2011 Workshop Series: Architectures and Systems for Big Data, Oct. 2011.
