Hasso-Plattner-Institut Potsdam Operating Systems and Middleware Group at HPI University of Potsdam, Germany
Operating Systems and Middleware Group at HPI

Readings collection for GPU computing

Stratton, J. A., Stone, S. S. & Hwu, W.-mei W., 2008. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs. Proceedings of the 21st International Workshop on Languages and Compilers for Parallel Computing.

Buck, I. et al., 2004. Brook for GPUs: stream computing on graphics hardware. , p.777-786.

Fan, Z. et al., 2004. GPU Cluster for High Performance Computing. SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing.

Harris, M., 2005. Mapping computational concepts to GPUs.

Lee, S., Min, S.-J. & Eigenmann, R., 2009. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. PPoPP ’09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, p.101-110.

Nickolls, J. et al., 2008. Scalable Parallel Programming with CUDA. Queue, 6(2), p.40-53.

Fatahalian, K. & Houston, M., 2008. GPUs: A Closer Look. ACM Queue, 6(2), p.18-28.

Advanced Micro Devices, 2010. ATI Stream Computing OpenCL Programming Guide - Version 1.03,

Ueng, S.-Z. et al., 2008. CUDA-Lite: Reducing GPU Programming Complexity. 21th International Workshop on Languages and Compilers for Parallel Computing (LCPC), p.1-15.

Sugerman, J. et al., 2009. GRAMPS: A programming model for graphics pipelines. ACM Trans. Graph., 28(1), p.1-11.

Hou, Q., Zhou, K. & Guo, B., 2008. BSGP: bulk-synchronous GPU programming. SIGGRAPH ’08: ACM SIGGRAPH 2008 papers, p.1-12.

Catanzaro, B., 2010. OpenCL™ Optimization Case Study: Diagonal Sparse Matrix Vector Multiplication.

Lamb, C., 2009. OpenCL for NVIDIA GPUs.

Sanders, J. & Kandrot, E., 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming 1st ed., Addison-Wesley Professional.

Mark, W., 2008. Future Graphics Architectures. Queue, 6(2), p.54-64.

Ryoo, S. et al., 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. PPoPP ’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, p.73-82.

Mattson, T., 2010. The Future of Many Core Computing: Software for many core processors.

Behr, D., 2009. AMD GPU Architecture: OpenCL™ Tutorial, PPAM 2009.

Lee, V.W. et al., 2010. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. ISCA ’10: Proceedings of the 37th annual international symposium on Computer architecture, p.451-460.

NVIDIA, 2009c. Optimizing CUDA.

Ryoo, S. et al., 2008. Program optimization space pruning for a multithreaded gpu. CGO ’08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, p.195-204.

Woo, D.H. & Lee, H.-H.S., 2010. COMPASS: a programmable data prefetcher using idle GPU shaders. ASPLOS ’10: Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, p.297-310.

Hwu, W.-mei et al., 2007. Implicitly parallel programming models for thousand-core microprocessors. DAC ’07: Proceedings of the 44th annual Design Automation Conference, p.754-759.

Keutzer, K. et al., with Asanovic, K. et al., s, 2006. The Landscape of Parallel Computing Research: A View from Berkeley.

Gelado, I. et al., 2010. An asymmetric distributed shared memory model for heterogeneous parallel systems. SIGARCH Comput. Archit. News, 38(1), p.347-358.

Kerr, Andrew , Diamos, Gregory & Yalamanchili, Sudakhar , 2010. Modelling GPU-CPU Workloads and Systems. Third Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3).

Garg, R. & Amaral, J.N., 2010. Compiling Python to a hybrid execution environment. GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.19-30.

Brodtkorb, A. Rigland et al., 2010. State-of-the-art in heterogeneous computing. Scientific Programming, 18(1), p.1-33.

Bordawekar, Rajesh , Bondhugula, Uday & Rao, R., 2010. Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!

Feinbube, F., 2010. GPU Readings List. Available at: http://www.dcl.hpi.uni-potsdam.de/research/gpureadings/.

GE Intelligent Platforms, 2010. Many-Core Processors Report Ready for Duty.

Kerr, Andrew, Diamos, Gregory & Yalamanchili, Sudhakar, 2010. Modeling GPU-CPU workloads and systems. GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.31-42.

Kerr, Andrew, Diamos, Gregory & Yalamanchili, Sudhakar, 2009. A characterization and analysis of PTX kernels. IEEE Workload Characterization Symposium, 0, p.3-12.

Luk, C.-K., Hong, S. & Kim, H., 2009. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, p.45-55.

Kelm, J.H. et al., 2009. Rigel: an architecture and scalable programming interface for a 1000-core accelerator. SIGARCH Comput. Archit. News, 37(3), p.140-151.

Feinbube, F., Tröger, P. & Polze, A., 2010. Joint Forces: From Multithreaded Programming to GPU Computing. IEEE Software (Software), 28(1), p.51-57. Available at: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5601687.

NVIDIA, 2009a. NVIDIA OpenCL Best Practices Guide - Version 2.3,

Munshi, A. ed., 2010. The OpenCL Specification - Version 1.1, The Khronos Group Inc.

Che, S. et al., 2008. A performance study of general-purpose applications on graphics processors using CUDA. Journal of Parallel and Distributed Computing, 68(10).

Kirk, David B. & Hwu, W.-mei W., 2010. Programming Massively Parallel Processors: A Hands-on Approach 1st ed., Morgan Kaufmann.

Nvidia, 2010. NVIDIA CUDA C Programming Guide 3.2, NVIDIA Corporation.

NVIDIA, 2009b. OpenCL Programming for the CUDA Architecture - Version 2.3,

Feinbube, F. et al., 2010. NQueens on CUDA: Optimization Issues. 2010 Ninth International Symposium on Parallel and Distributed Computing, p.63-70. Available at: http://dl.acm.org/citation.cfm?id=1848298.

Insilicos, Insilicos Awarded NIH Grant Applying GPU Computing to Human Disease.

HipHaC’11, CfP: New Frontiers in High-performance and Hardware-aware Computing,

SciComp Inc., 2010. SciComp Speeds Derivatives Performance with Support for New NVIDIA Hardware and Software.

Pennycook, S. J. & Harmond, S. D., et al., 2010. Performance Analysis of a Hybrid MPI/CUDA Implementation of the NAS-LU Benchmark. 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10).

NVIDIA, 2010a. GPU Technology Conference Session Archive Available.

NVIDIA, 2010b. NVIDIA Tesla GPUs Power World’s Fastest Supercomputer.

Bordawekar, R. & Bondhugula, U., et al., 2010. “Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU.” Technical Report, IBM T. J. Watson Research Center.

Garland, Michael & Kirk, David B. , 2010. Understanding throughput-oriented architectures. In Commununications of the ACM 53.

top500.org , 2010. 3 of the 5 fastest supercomputers in the world use GPUs.

PASI, 2011. Pan-American Advanced Studies Institutes: Materials now online.

UK GPU Computing Conference, 2011. Presentations from the 2nd UK GPU Computing Conference.

MAVERICK PR, 2011. PEER 1 Hosting: Large-Scale Hosted NVIDIA GPU Cloud.

Shivam, P., Babu, S. & Chase, J. S., 2006. Learning Application Models for Utility Resource Planning. International Conference on Autonomic Computing (ICAC), p.255-264.

Amazon, 2010. Amazon announces GPUs for Cloud Computing.

Stratton, J.A. et al., 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. CGO ’10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, p.111-119.

Herlihy, M. & Shavit, N., 2008. The Art of Multiprocessor Programming, Morgan Kaufmann.

Karlson, E., 2010. Optimizing OpenCL for nVidia GPGPU.

NVIDIA, 2011. CUDA Developer Zone. Available at: http://developer.nvidia.com/category/zone/cuda-zone [Accessed October 6, 2011].

GPU Compute Use Cases

Fatahalian, K., Sugerman, J. & Hanrahan, P., 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. HWWS ’04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, p.133-137.

Walters, J. Paul et al., 2009. Evaluating the use of GPUs in liver image segmentation and HMMER database searches. Parallel and Distributed Processing Symposium, International, 0, p.1-12.

Liu, C., cuHMM: a CUDA Implementation of Hidden Markov Model Training and Classification.

Ganesan, N. et al., 2010. Accelerating HMMER on GPUs by Implementing Hybrid Data and Task Parallelism. International Conference on Bioinformatics and Computational Biology (ACM-BCB).

Du, Z., Yin, Z. & Bader, D. A., 2010. A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA . Ninth IEEE International Workshop on High Performance Computational Biology (HiCOMB).

Ries, F. et al., 2009. Triangular matrix inversion on Graphics Processing Unit. SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p.1-10.

Arora, N., Shringarpure, A. & Vuduc, R.W., 2009. Direct N-body Kernels for Multicore Platforms. ICPP ’09: Proceedings of the 2009 International Conference on Parallel Processing, p.379-387.

Fujiwara, K. & Nakasato, Naohito , 2009. Fast Simulations of Gravitational Many-body Problem on RV770 GPU.

Dixon, P.R., Oonishi, T. & Furui, S., 2009. Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition. Comput. Speech Lang., 23(4), p.510-526.

Göddeke, Dominik & Strzodka, R., 2010. Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid. IEEE Transactions on Parallel and Distributed Systems.

Geveler, M. et al., 2010. Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors. In R. Keller, D. Kramer, & J.-P. Weiß, eds. Lecture Notes in Computer Science. Springer, pp. 92-104.

Demchik, V., 2010. Pseudo-random number generators for Monte Carlo simulations on Graphics Processing Units.

Kong, J. et al., 2010. Accelerating MATLAB Image Processing Toolbox functions on GPUs. D. R. Kaeli & M. Leeser, eds. GPGPU, 425, p.75-85.

Capuzzo-Dolcetta, R., Mastrobuono-Battisti, A. & Maschietti, D., 2010. NBSymple, a double parallel, symplectic N-body code running on Graphic Processing Units.

Brodtkorb, A.R. et al., Simulation and Visualization of the Saint-Venant System using GPUs.

Zafar, F., Curtis, A. & Olano, M., 2010. GPU Random Numbers via the Tiny Encryption Algorithm. Proceedings of the ACM SIGGRAPH/Eurographics Symposium on High Performance Graphics (HPG ).

Göddeke, Dominik , 2010. Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters.

Komatisch, D. et al., 2010. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster.

Vasiliadis, G. et al., 2009. Regular Expression Matching on Graphics Hardware for Intrusion Detection,

Vasiliadis, G. & Ioannidis, S., 2010. GrAVity: A Massively Parallel Antivirus Engine. 13th International Symposium On Recent Advances In Intrusion Detection (RAID).

Block, B., Virnau, P. & Preis, T., 2010. Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model. Computer Physics Communications, 181(9), p.1549-1556.

Dick, C., Georgii, J. & Westermann, R., 2010. A Real-Time Multigrid Finite Hexahedra Method for Elasticity Simulation using CUDA.

Dehnavi, M. Mehri , Fernandez, D. & Giannacopoulos, D., 2010. Finite-Element Sparse Matrix Vector Multiplication on Graphic Processing Units.

Moore, A. & Quillen, A. C., QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters.

Emeliyanenko, P., 2010. A complete modular resultant algorithm targeted for realization on graphics hardware. PASCO ’10: Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, p.35-43.

Volk, P. B., Habich, D. & Lehner, W., 2010. GPU-Based Speculative Query Processing for Database Operations. First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS).

Fang, W., He, B. & Luo, Q., 2010. Database Compression on Graphics Processors. PVLDB/VLDB .

Han, S. et al., 2010. PacketShader: a GPU-accelerated software router. SIGCOMM ’10: Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM, p.195-206.

Stivala, A., Stuckey, P. & Wirth, A., 2010. Fast and accurate protein substructure searching with simulated annealing and GPUs.

Roy, O., Jovanovic, I. & Parhizkar, R., WaveTomography: 2D time-domain waveform tomography reconstruction algorithm.

Singh, J. & Aruni, I., GPU computing for R Statistical Environment.

Kocak, T. & Hinitt, N., 2010. Exploiting the Power of GPUs for Multi-gigabit Wireless Baseband Processing. ISPDC ’10: Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing, p.56-62.

Erra, U., Frola, B. & Scarano, V., 2010. BehaveRT: A GPU-Based Library for Autonomous Characters,

Vasiliadis, G., Polychronakis, M. & Ioannidis, S., GPU-Assisted Malware.

Sen, A. et al., 2010. Parallel Cycle Based Logic Simulation Using Graphics Processing Units. ISPDC ’10: Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing, p.71-78.

Nvidia, CUDA Show Case.

Tang, M. & Manocha, D., et al., 2010. Fast GPU-based Collision Detection for Deformable Models. Proceedings of ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (i3D 2011).

G., G. & Montella, R., et al., A GPGPU transparent virtualization component for high performance computing clouds. Euro-Par 2010 – Parallel Processing.

Nakasato, N., A Fast GEMM Implementation on a Cypress GPU. 1st International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 10).

Stone, J. E. & Hardy, D. J., et al., 2010. GPU-Accelerated Molecular Modeling Coming Of Age.

Gwosdek, P. & Zimmer, H., et al., 2010. A Highly Efficient GPU Implementation for Variational Optic Flow Based on the Euler-Lagrange Framework. Proceedings of the ECCV Workshop for Computer Vision with GPUs.

Nexiwave.com, 2010. Nexiwave.com and UbiCast Partner to Offer GPU-Accelerated Deep Audio Search.

Brodtkorb, A. R. , 2010. PhD Thesis: Scientific Computing on Heterogeneous Architectures. University of Oslo.

Pham, V. & Vo, P., et al., 2010. GPU Implementation of Extended Gaussian Mixture Model for Background Subtraction. IEEE International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).

IS&T/SPIE, 2011. GPGPU papers from Parallel Processing for Imaging Applications conference.

Ferrero, E. E. & De Francesco, J. Pablo, et al., 2011. q-state Potts model metastability study using optimized GPU-based Monte Carlo algorithms.

Cheng, L. & Gong, M., et al., 2011. Real-time Discriminative Background Subtraction. IEEE Transactions on Image Processing. To appear.

Yoon, J. Sung & Jung, W.-H., 2011. A GPU-accelerated bioinformatics application for large-scale protein networks. Asia Pacific Bioinformatics Conference.

Dziekonski, A. & Lamecki, A., et al., 2011. GPU Acceleration of Multilevel Solvers for Analysis of Microwave Components With Finite Element Method. In IEEE Microwave and Wireless Components Letters.

Singh, J. & Aruni, I., 2010. Accelerating Power Flow studies on Graphics Processing Unit. Annual IEEE India Conference (INDICON), p.1-5.

Rossinelli, D. et al., 2011. Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids. SIAM Journal of Scientific Computing, 33(2), p.512-540.

Gelisio, L. et al., 2010. Real-space calculation of powder diffraction patterns on graphics processing units. Journal of Applied Crystallography, 43(3), p.647-653.

Dziekonski, A., Lamecki, A. & Mrozowski, M., 2011. A memory efficient and fast sparse matrix vector product on a GPU. Progress In Electromagnetics Research, 116, p.49-63.

Rossinelli, D., Conti, C. & Koumoutsakos, P., 2011. Mesh–particle interpolations on graphics processing units and multicore central processing units. Phil. Trans. R. Soc. A, 369(1944), p.2164-2175.

Pratx , G. & Xing, L., 2011. GPU computing in medical physics: A review . Medical Physics, 38(5).

Nugteren, C. et al., High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs. Fourth Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-4).

Bowden, J. C., 2010. Application of the OpenCL API for Implementation of the NIPALS Algorithm for Principal Component Analysis of Large Data Sets. 2010 Sixth IEEE International Conference on e-Science Workshops, p.25-30. Available at: http://dx.doi.org/10.1109/eScienceW.2010.14.

Satish, N., Harris, M. & Garland, M., 2009. Designing efficient sorting algorithms for manycore GPUs. 2009 IEEE International Symposium on ParallelProcessing, p.1-10. Available at: http://dx.doi.org/10.1109/IPDPS.2009.5161005.

Andrecut, M., 2008. Parallel GPU Implementation of Iterative PCA Algorithms . arXiv, (0811.1081). Available at: http://arxiv.org/abs/0811.1081.

GPU Computing Tools and Libraries

NVIDIA, NVIDIA GPU computing downloads.

Dyk, D. van et al., 2009. HONEI: A collection of libraries for numerical computations targeting multiple processor architectures. Computer Physics Communications, 180(12), p.2534-2543.

OpenNL, 2010. Open Numerical Library (OpenNL).

CLyther, CLyther.

Che, S. et al., 2009. Rodinia: A benchmark suite for heterogeneous computing. IEEE Workload Characterization Symposium, 0, p.44-54.

cudpp, CUDA Data Parallel Primitives Library (cudpp).

Barak, A. & Shiloh, A., MOSIX Virtual OpenCL (VCL).

Graphic Remedy, gDEBugger CL.

AccelerEyes, Jacket: The GPU Engine for MATLAB.

Gelado, I. et al., 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ).

multiscalelab, Swan: A simple tool for porting CUDA to OpenCL.

Danalis, A. et al., 2010. The Scalable Heterogeneous Computing (SHOC) benchmark suite. GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63-74.

Fixstars Corporation, Yellow Dog Enterprise Linux for CUDA.

Palix Technologies LLC, Advanced Numerical Design Solver (ANDSolver).

Hoberock, J. & Bell, N., Thrust: open-source template library for developing CUDA applications.

Vratis.com, SpeedIT Tools library.

Peña, A. J. , rCUDA Framework: concurrent usage of CUDA-compatible devices remotely.

Geist Software Labs Inc., OpenCL Studio.

Cohen, J., OpenCurrent: open source C++ library for solving Partial Differential Equations (PDEs) over regular grids.

University of Michigan, Highly Optimized Object-oriented Many-particle Dynamics - Blue Edition.

Advanced Micro Devices, Inc., c. ATI Stream Profiler.

CAPS entreprise, HMPP Workbench: a directive-based compiler for hybrid computing.

Gpu Systems, Libra Technology compiler and runtime architecture.

Tuna Code, CUDA Vision and Imaging Library (CUVILib).

Institute for Microelectronics, TU Wien, Vienna Computing Library (ViennaCL): scientific computing library.

NIH Center for Biomedical Computation at Stanford University, OpenMM: Accelerate Molecular Dynamics.

EM Photonics, Inc., CULA: GPU-accelerated linear algebra library.

NVIDIA Corporation , NVIDIA® Parallel Nsight™.

Diamos, G.F. et al., 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. PACT ’10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques, p.353-364.

Advanced Micro Devices, Inc., b. ATI GPU Services (AGS) Library .

GE Intelligent Platforms, AXISLib-GPU.

The MathWorks, Inc., MATLAB GPU Computing.

Advanced Micro Devices, Inc., a. Aparapi .

AMD, ATI Stream Software Development Kit (SDK).

ASU/Temple Zeolite Project, PyCULA: Python Bindings for CULA GPGPU LAPACK.

Thrust, Thrust v1.3 release.

HOOMD-blue, HOOMD-blue: particle dynamics simulations.

IMPETUS Afea, 2010. IMPETUS Afea Solver: A novel Finite Element code adapted to GPU technology.

ACUSIM Software, Inc., 2010. ACUSIM Software Releases Latest Version of AcuSolve CFD Solver.

MathWorks, MATLAB Adds GPU Support.

GPU Systems, 2010. GPU Systems release MATLAB CPU-GPU Support.

NVIDIA, 2010. CUDA 3.2 Released.

GAP, 2010. rCUDA 2.0 released.

VratisLTD, 2010. OpenFOAM SpeedIT plugin 1.1 released.

Innovative Computing Laboratory, 2010. MAGMA 1.0 – LAPACK for GPUs – has been released.

TidePowerd, 2010. Announcing GPU.NET from TidePowerd: “Native” GPU computing for .NET.

MOSIX group, 2010. MOSIX Virtual OpenCL (VCL) Cluster Platform.

Bauke, H., 2011. TRNG-A library for parallel Monte Carlo on NVIDIA graphics cards.

NVIDIA, 2011. CUDA Libraries Performance Report Now Available.

VratisLTD, 2011. SpeedIT 1.2 released.

Ocelot, 2011. GPU-Ocelot 2.0 Released.

GMAC , 2011. GMAC 0.0.20 Released.

OpenCLcc, 2011. OpenCLcc: Offline OpenCL Compilation.

Verner, U., Schuster, A. & Silberstein, M., 2011. Processing data streams with hard real-time constraints on heterogeneous systems. International Conference on Supercomputing (ICS). To appear.

KGPU, 2011. KGPU: enabling GPU computing in Linux kernel. Available at: http://code.google.com/p/kgpu/ [Accessed 2011].

symscape , 2011. GPU Linear Solver Library for OpenFOAM. Available at: http://www.symscape.com/gpu-openfoam [Accessed 2011].

xman, SGC Ruby CUDA.

TidePowerd Ltd., 2011. GPU.NET. Available at: http://www.tidepowerd.com/ [Accessed October 2011].

Dotzler, G., Veldema, R. & Klemm, M., 2010. JCudaMP: OpenMP/Java on CUDA. 3rd International Workshop on Multicore Software Engineering, p.10-17. Available at: http://doi.acm.org/10.1145/1808954.1808959.

La Lama, C. Sánchez de, 2011. Portable OpenCL . Available at: https://launchpad.net/pocl.

Anon., 2011. Extending MPI to Accelerators. PACT 2011 Workshop Series: Architectures and Systems for Big Data.