书签分享收藏举报版权申诉 / 13

立即下载

当前位置：首页 > 教育专区 > 小学资料 > Hybrid Parallel Programming on GPU ClustersGPU集群的混合并行编程（英汉对照）.docx

Hybrid Parallel Programming on GPU ClustersGPU集群的混合并行编程（英汉对照）.docx

上传人：豆****

文档编号：29942636

上传时间：2022-08-02

格式：DOCX

页数：13

大小：71.53KB

( 4.5 )

《Hybrid Parallel Programming on GPU ClustersGPU集群的混合并行编程（英汉对照）.docx》由会员分享，可在线阅读，更多相关《Hybrid Parallel Programming on GPU ClustersGPU集群的混合并行编程（英汉对照）.docx（13页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、Hybrid Parallel Programming on GPU ClustersAbstractNowadays, NVIDIAs CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has p

2、roven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a hybrid parallel programming

3、approach using hybrid CUDA and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same

4、computational node.Keywords: CUDA, GPU, MPI, OpenMP, hybrid, parallel programmingI. INTRODUCTIONNowadays, NVIDIAs CUDA 1, 16 is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions a hierarchy of thread blocks, shared me

5、mory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA 1, 16 to achieve dramatic speedups on production and research codes.

6、In NVDIA the CUDA chip, all to the core of hundreds of ways to construct their chips, in here we will try to use NVIDIA to provide computing equipment for parallel computing. This paper proposes a solution to not only simplify the use of hardware acceleration in conventional general purpose applicat

7、ions, but also to keep the application code portable. In this paper, we propose a parallel programming approach using hybrid CUDA, OpenMP and MPI 3 programming, which partition loop iterations according to the performance weighting of multi-core 4 nodes in a cluster. Because iterations assigned to o

8、ne MPI process are processed in parallel by OpenMP threads run by the processor cores in the same computational node, the number of loop iterations allocated to one computational node at each scheduling step depends on the number of processor cores in that node.In this paper, we propose a general ap

9、proach that uses performance functions to estimate performance weights for each node. To verify the proposed approach, a heterogeneous cluster and a homogeneous cluster were built. In ourimplementation, the master node also participates in computation, whereas in previous schemes, only slave nodes d

10、o computation work. Empirical results show that in heterogeneous and homogeneous clusters environments, the proposed approach improved performance over all previous schemes.The rest of this paper is organized as follows. In Section 2, we introduce several typical and well-known self-scheduling schem

11、es, and a famous benchmark used to analyze computer system performance. In Section 3, we define our model and describe our approach. Our system configuration is then specified in Section 4, and experimental results for three types of application program are presented. Concluding remarks and future w

12、ork are given in Section 5.II. BACKGROUND REVIEWA. History of GPU and CUDAIn the past, we have to use more than one computer to multiple CPU parallel computing, as shown in the last chip in the history of the beginning of the show does not need a lot of computation, then gradually the need for the g

13、ame and even the graphics were and the need for 3D, 3D accelerator card appeared, and gradually we began to display chip for processing, began to show separate chips, and even made asimilar in their CPU chips, that is GPU. We know that GPU computing could be used to get the answers we want, but why

14、do we choose to use the GPU? This slide shows the current CPU and GPU comparison. First, we can see only a maximum of eight core CPU now, but the GPU has grown to 260 core, the core number, well know a lot of parallel programs for GPU computing, despite his relatively low frequency of core, we I bel

15、ieve a large number of parallel computing power could be weaker than a single issue. Next, we know that there are within the GPU memory, and more access to main memory and GPU CPU GPU access on the memory capacity, we find that the speed of accessing GPU faster than CPU by 10 times, a whole worse 90

16、GB / s, This isquite alarming gap, of course, this also means that when computing the time required to access large amounts of data can have a good GPU to improve.CPU using advanced flow control such as branch predict or delay branch and a large cache to reduce memory access latency, and GPUs cache

17、and a relatively small number of flow control nor his simple, so the method is to use a lot of GPU computing devices to cover up the problem of memory latency, that is, assuming an access memory GPU takes 5 seconds of the time, but if there are 100 thread simultaneous access to, the time is 5 second

18、s, but the assumption that CPU time memory access time is 0.1 seconds, if the 100 thread access, the time is 10 seconds, therefore, GPU parallel processing can be used to hide even in access memory thanCPU speed. GPU is designed such that more transistors are devoted to data processing rather than d

19、ata caching and flow control, as schematically illustrated by Figure 1.Therefore, we in the arithmetic logic by GPU advantage, trying to use NVIDIAs multi-core available to help us a lot of computation, and we will provide NVIDIA with so many core programs, and NVIDIA Corporation to provide the API

20、of parallel programming large number of operations to carry out. We must use the form provided by NVIDIA Corporation GPU computing to run it? Not really. We can use NVIDIA CUDA, ATI CTM and apple made OpenCL (Open Computing Language), is the development of CUDA is one of the earliest and most people

21、 at this stage in the language but with the NVIDIA CUDA only supports its own graphics card, from where we You can see at this stage to use GPU graphics card with the operator of almost all of NVIDIA, ATI also has developed its own language of CTM, APPLE also proposed OpenCL (Open Computing Language

22、), which OpenCL has been supported by NVIDIA and ATI, but ATI CTM has also given up the language of another, by the use of the previous relationship between the GPU, usually only support singleprecision floating-point operations, and in science, precision is a very important indicator, therefore, in

23、troduced this year computing graphics card has to support a Double precision floating-point operations.B. CUDA ProgrammingCUDA (an acronym for Compute Unified Device Architecture) is a parallel computing 2 architecture developed by NVIDIA. CUDA is the computing engine in NVIDIA graphics processing u

24、nits or GPUs that is accessible to software developers through industry standard programming languages. The CUDA software stack is composed of several layers as illustrated in Figure 2: a hardware driver, an application programming interface (API) and its runtime, and two higher-level mathematical l

25、ibraries of common usage, CUFFT 17 and CUBLAS 18. The hardware has been designed to support lightweight driver and runtime layers, resulting in high performance. CUDA architecture supports a range of computational interfaces including OpenGL 9 and Direct Compute. CUDAs parallel programming model is

26、designed to overcome this challenge while maintaining a low learning curve for programmer familiar with standard programming languages such as C. At its core are three key abstractions a hierarchy of thread groups, shared memories, and barrier synchronization that are simply exposed to the programme

27、r as a minimal set oflanguage extensions.These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently i

28、n parallel, and then into finer pieces that can be solved cooperatively in parallel. Such a decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables transparent scalability since each sub-problem can be scheduled to be

29、 solved on any of the available processor cores: A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count.C. CUDA Processing flowIn follow illustration, CUDA processing flow is described as Figure 3 16. The

30、 first step: copy data from main memory to GPU memory, second: CPU instructs the process to GPU, third: GPU execute parallel in each core, finally: copy the result from GPU memory to main memory.III. SYSTEM HARDWAREA. Tesla C1060 GPU Computing ProcessorThe NVIDIA Tesla C1060 transforms a workstation

31、 into a high-performance computer that outperforms a small cluster. This gives technical professionals a dedicated computing resource at their desk-side that is much faster and more energy-efficient than a shared cluster in the data center. The NVIDIA Tesla C1060 computing processor board which cons

32、ists of 240 cores is a PCI Express 2.0 form factor computing add-in card based on the NVIDIA Tesla T10 graphics processing unit (GPU). This board is targeted as high-performance computing (HPC) solution for PCI Express systems. The Tesla C1060 15 is capable of 933GFLOPs/s13 of processing performance

33、 and comes standard with 4GB of GDDR3 memory at 102 GB/s bandwidth.A computer system with an available PCI Express *16 slot is required for the Tesla C1060. For the best system bandwidth between the host processor and the Tesla C1060, it is recommended (but not required) that theTesla C1060 be insta

34、lled in a PCI Express 16 Gen2 slot. The Tesla C1060 is based on the massively parallel, many-core Tesla processor, which is coupled with the standard CUDA C Programming 14 environment to simplify many-core programming.B. Tesla S1070 GPU Computing SystemThe NVIDIA Tesla S1070 12 computing system spee

35、ds the transition to energy-efficient parallel computing 2. With 960 processor cores and a standard simplifies application development, Tesla solve the worlds most important computing challenges-more quickly and accurately. The NVIDIAComputing System is a rack-mount Tesla T10 computing processors. T

36、his system connects to one or two host systems via one or two PCI Express cables. A Host Interface Card (HIC) 5 is used to connect each PCI Express cable to a host. The host interface cards are compatible with both PCI Express 1x and PCI 2x systems.The Tesla S1070 GPU computing system is based on th

37、e T10 GPU from NVIDIA. It can be connected to a single host system via two PCI Express connections to that connected to two separate host systems via connection to each host. Each NVID corresponding PCI Express cable connects to GPUs in the Tesla S1070. If only one PCIconnected to the Tesla S1070, o

38、nly two of the GPUs will be used.VI COCLUSIONSIn conclusion, we propose a parallel programming approach using hybrid CUDA and MPI programming, hich partition loop iterations according to the number of C1060 GPU nodes n a GPU cluster which consist of one C1060 and one S1070.During the experiments, lo

39、op progress assigned to one MPI processor cores in the same experiments reveal that the hybrid parallel multi-core GPU currently processing with OpenMP and MPI as a powerful approach of composing high performance clusters.V CONCLUSIONS1 Download cuda, 2 D. Gddeke, R. Strzodka, J. Mohd-Yusof, P. McCo

40、rmick, S. uijssen,M. Grajewski, and S. Tureka, “Exploring weak scalability for EM calculations on a GPU-enhanced cluster,” Parallel Computing,vol. 33, pp. 685-699, Nov 2007.3 P. Alonso, R. Cortina, F.J. Martnez-Zaldvar, J. Ranilla “Neville limination on multi- and many-core systems: OpenMP, MPI and

41、UDA”, Jorunal of Supercomputing, ttp:/4 Francois Bodin and Stephane Bihan, “Heterogeneous multicore arallel programming for graphics processing units”, Scientific rogramming, Volume 17, Number 4 / 2009, 325-336, Nov. 2009.5 Specification esla S1070 GPU Computing System 6 Open MP Specification, http:

42、/openmp.org/wp/about-openmp/7 Message Passing Interface (MPI)8 MPICH, A Portable Implementation of MPI9 OpenGL, D. Shreiner, M. Woo, J. Neider and T. Davis, OpenGL(R) rogramming Guide: The Official Guide to Learning OpenGL(R), Addison-Wesley, Reading, MA, August 2005. 10 (2008) Intel 64 Tesla Linux

43、Cluster Lincoln webpage. OnlineAvailable: ttp:/www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64TeslCluster/11 Romain Dolbeau, Stphane Bihan, and Franois Bodin, HMPP: Alti-core Parallel Programming Environment12 The NVIDIA Tesla S1070 1U Computing System - Scalable Many Core Supercomputing f

44、or Data Centers 13 Top 500 Super Computer Sites, What is Gflop/s,14 NVIDIA_CUDA_Programming_Guide_2.3.pdf15 16 CUDA, http:/en.wikipedia.org/wiki/CUDA17 CUFFT, CUDA Fast Fourier Transform (FFT) library.18 CUBLAS, BLAS(Basic Linear Algebra Subprograms) on CUDAHybrid Parallel Programming on GPU Cluster

45、sGPU集群的混合并行编程摘要目前，NVIDIA的CUDA是一种用于编写高度并行的应用程序的通用的可扩展的并行编程模型。它提供了几个关键的抽象线程阻塞，共享内存和屏障同步的层次结构。这在编程模式已经被证明在多线程多核GP和透明的扩展数以百计的核心的编程上相当成功：在整个工业界和学术界的科学家已经使用CUDA实现在生产上的显著的速度提升和代码研究。在本文中，我们提出了一个使用混合CUDA和MPI编程的混合编程方法，分区循环迭代在一个GPU集群中的C1060 GPU节点的数目，其中包括在一个C1060和S1070。循环迭代分配给一个由处理器在相同的计算节点的核心运行的CUDA并行处理过的MPI进程

46、。关键词：CUDA，GPU，MPI，OpenMP，混合，并行编程1.介绍如今，NVIDIA（英伟达）的CUDA1，16是一种通用的编写高度可扩展的并行编程并行应用程序的模型。它提供了几个关键的抽象层次的线程块，共享内存和障碍同步。这种模式已经被证实在多线程多核心GPU编程和从小规模扩展到数百个内核是非常成功的：科学家在工业界和学术界都已经使用CUDA1，16，生产和研发代码有实现显着的速度提升。在NVDIA的CUDA芯片的所有的数百种方法构建自己的芯片里，在这里我们将尝试使用NVIDIA（英伟达）提供用于并行计算的计算设备。本文提出了一个解决方案不仅简化在传统的硬件加速通用应用程序的使用,而且

47、还保持应用程序代码的便携性。在这篇论文里,我们提出一种使用混合CUDA,OpenMP和MPI3的并行编程方法, 根据在一个集群中的性能叠加的多核4节点，它会分区循环迭代。因为迭代处理分配给一个MPI进程是在并行的相同的计算节点上由OpenMP线程的处理器内核运行的，则循环迭代的次数分配给一个计算节点，每个调度步骤取决于在该节点的处理器内核的数量。在本文中，我们提出了一种通用的方法，使用性能函数估计每个节点的性能权重。为了验证所提出的方法，我们建立了不同种类的集群和一个同构集群。在我们的实现中，主节点也参与计算，而在以往的计划，只有从节点做计算工作。实证结果显示，在异构和同构集群环境中，提出的方

48、法改进性能超过以往任何计划。本文的其余部分安排如下：在第2节，我们介绍几种典型和著名的自排计划，和一个著名的用于分析计算机性能的基准系统。在第3节中，我们定义我们的模型和说明我们的方法。然后我们的系统配置的三种类型放在第4节，同时在第4节还有实验结果的应用程序。结束语和今后的工作安排在第5节。2.背景回顾A.GPU和CUDA的历史在过去，我们必须使用多台计算机的多个CPU并行计算，如所示的最后一个芯片中历史开始并不需要大量的计算，然后逐渐人们有了游戏的需求，最后图形和3D，3D加速器卡的需要出现了，渐渐地，我们开始显示芯片的加工，开始表现出独立的芯片，甚至在他们的CPU芯片中做了一个类似的显示芯片，这是GPU。我们知道，用GPU

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Hybrid Parallel Programming on GPU Clusters GPU集群的混合并行编程英汉对照集群混合并行编程英汉对照

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：Hybrid Parallel Programming on GPU ClustersGPU集群的混合并行编程（英汉对照）.docx
链接地址：https://www.taowenge.com/p-29942636.html