lect4_华科并行编程课件.pdf
Parallel Programming Principle and PracticeLecture 4 Parallel Programming MethodologyJin,HaiSchool of Computer Science and TechnologyHuazhong University of Science and Technology2Outline Motivating Problems Steps in Creating a Parallel Program What a Simple Parallel Program Looks Like3MOTIVATING PROBLEMSParallel programming methodology4Motivating Problems Simulating Ocean CurrentsRegular structure,scientific computing Simulating the Evolution of GalaxiesIrregular structure,scientific computing Rendering Scenes by Ray TracingIrregular structure,computer graphics5Simulating Ocean Currents Model as two-dimensional grids Discretize in space and timefiner spatial and temporal resolution-greater accuracy Many different computations per time stepset up and solve equations Concurrency across and within grid computations6Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Computing forces is expensive O(n2)brute force approach Hierarchical methods take advantage of force law:G Many time-steps,plenty of concurrency across stars within one7Rendering Scenes by Ray Tracing Shoot rays into scene through pixels in image plane Follow their pathsThey bounce around as they strike objectsThey generate new rays:ray tree per input ray Result is color and opacity for that pixel Parallelism across raysAll case studies have abundant concurrency8Creating a Parallel Program Assumption:Sequential algorithm is givenSometimes need very different algorithm,but beyond scope Pieces of the jobIdentify work that can be done in parallelPartition work and perhaps data among processesManage data access,communication and synchronizationNote:work includes computation,data access,and I/O Main goal:Speedup(plus low prog.effort and resource needs)For a fixed problem9STEPS IN CREATING PARALLEL PROGRAMParallel programming methodology10Some Important Concepts TaskArbitrary piece of undecomposed work in parallel computationExecuted sequentially;concurrency is only across taskse.g.a particle/cell in Barnes-Hut,a ray or ray group in RaytraceFine-grained versus coarse-grained tasks Process(thread)Abstract entity that performs the tasks assigned to processesProcesses communicate and synchronize to perform their tasks ProcessorPhysical engine on which process executesProcesses virtualize machine to programmerfirst write program in terms of processes,then map to processors11Limited Concurrency:Amdahls Law Fundamental limitation on parallel speedup If s=fraction of sequential execution that is inherently serialthen speedup 1/s12Amdahls Law Example 2-phase computation over an n-by-n gridPhase 1:perform an independent computation on each grid elementeasy to parallelizePhase 2:add a value from each grid element into a global summore difficult to parallelize;serial by default Sequential Executionboth phases take n2time;2n2 total13First Attempt at Parallelization StrategyPhase 1:execute in paralleltime for phase 1=n2/pPhase 2:execute seriallytime for phase 2=n2 Overall PerformanceSpeedupi.e.no more than 214Parallelizing Phase 2 Trick:divide second phase into two stepsStep 1:accumulate into private sum during sweepStep 2:add per-process private sum into global sum Overall Performance:Parallel time=n2/p+n2/p+pSpeedupclose to p if np15Concurrency Profiles Cannot usually divide into serial and fully parallel parts Area under curve is total work done,or time with 1 processor Horizontal extent is lower bound on time(infinite processors)Speedup is the ratio:,base case:Amdahls law applies to any overhead,not just limited concurrency16Steps in Creating a Parallel Program4 steps:Decomposition,Assignment,Orchestration,MappingDone by programmer or system software(compiler,runtime,.)Issues are the same,so assume programmer does it all explicitly17Decomposition Break up computation into tasks to be divided among processesi.e.Identify concurrency and decide level at which to exploit it Tasks may or may not be defined staticallyTasks may become available dynamicallyLots of available tasks may vary with time Goal:Enough tasks to keep processes busy,but not too manyLots of tasks available at a time is upper bound on achievable speedup18Steps in Creating a Parallel Program 4 steps:Decomposition,Assignment,Orchestration,Mapping19Assignment Specifying mechanism to divide work up among processese.g.which process computes forces on which stars,or which raysTogether with decomposition,also called partitioningGoals:balance workload,reduce communication and management cost Structured approaches usually work wellCode inspection(parallel loops)or understanding of applicationWell-known heuristicsStatic versus dynamic assignment As programmers,we worry about partitioning firstUsually independent of architecture or programming modelBut cost and complexity of using primitives may affect decisions As architects,we assume program does reasonable job of it20Steps in Creating a Parallel Program 4 steps:Decomposition,Assignment,Orchestration,Mapping21OrchestrationMain taskNaming dataStructuring communicationSynchronizationOrganizing data structures and scheduling tasks temporallyGoalsReduce cost of communication and synchronization as seen by processorsPreserve locality of data reference(incl.data structure organization)Schedule tasks to satisfy dependences earlyReduce overhead of parallelism managementClosest to architecture(and programming model&language)Choices depend a lot on communication abstraction,efficiency of primitivesArchitects should provide appropriate primitives efficiently22Steps in Creating a Parallel Program 4 steps:Decomposition,Assignment,Orchestration,Mapping23MappingAfter orchestration,already have parallel programTwo aspects of mappingWhich processes will run on same processor,if necessaryWhich process runs on which particular processormapping to a network topologyOne extreme:space-sharingMachine divided into subsets,only one application at a time in a subsetProcesses can be pinned to processors,or left to OSAnother extreme:complete resource management control to OSOS uses the performance techniques we will discuss laterReal world is between the twoUser specifies desires in some aspects,system may ignoreUsually adopt the view:process processor24Parallelizing Computation vs.Data Above view is centered around computationComputation is decomposed and assigned(partitioned)Partitioning data is often a natural view tooComputation follows data:owner computesGrid example;data mining;High Performance Fortran(HPF)But not general enoughDistinction between computation and data stronger in many applicationsRetain computation-centric viewData access and communication is part of orchestration25High-level Goals High performance(speedup over sequential program)But low resource usage and development effort Implications for algorithm designers and architectsAlgorithm designers:high-performance,low resource needsArchitects:high-performance,low cost,reduced programming efforte.g.gradually improving performance with programming effort may be preferable to sudden threshold after large programming effort26WHAT PARALLEL PROGRAMS LOOK LIKEParallel programming methodology27Parallelization of an Example Program Motivating problems all lead to large,complex programs Examine simplified version of a piece of Ocean simulationIterative equation solver Illustrate parallel program in low-level parallel languageC-like pseudocode with simple extensions for parallelismExpose basic comm.and synch.primitives that must be supportedState of most real parallel programming today28Grid Solver ExampleSimplified version of solver in Ocean simulationGauss-Seidel(near-neighbor)sweeps to convergenceinterior n-by-n points of(n+2)-by-(n+2)updated in each sweepupdates done in-place in grid,and diff from previous value computedaccumulate partial diffs into global diff at end of every sweepcheck if error has converged to(within a tolerance parameter)if so,exit solver;if not,do another sweep2930Decomposition Simple way to identify concurrency is to look at loop iterationsdependence analysis;if not enough concurrency,then look further Not much concurrency here at this level(all loops sequential)Examine fundamental dependences,ignoring loop structureConcurrency O(n)along anti-diagonals,serialization O(n)along diagRetain loop structure,use pt-to-pt synch;Problem:too many synch opsRestructure loops,use global synch;imbalance and too much synch31Exploit Application KnowledgeReorder grid traversal:red-black orderingDifferent ordering of updates:may converge quicker or slower Red sweep and black sweep are each fully parallelGlobal synch between them(conservative but convenient)Ocean uses red-black;we use simpler,asynchronous one to illustrateno red-black,simply ignore dependences within sweepsequential order same as original,parallel program nondeterministic32Decomposition Only Decomposition into elements:degree of concurrency n2 To decompose into rows,make line 18 loop sequential;degree n for_all leaves assignment to the systembut implicit global synch.at end of for_all loop33AssignmentStatic assignments(given decomposition into rows)block assignment of rows:Row i is assigned to process cyclic assignment of rows:process i is assigned rows i,i+p,and so onDynamic assignmentget a row index,work on the row,get a new row,and so onStatic assignment into rows reduces concurrency(from n to p)block assignment reduces communication by keeping adjacent rows togetherLets dig into orchestration under three programming models34Data Parallel Solver35Shared Address Space SolverSingle Program Multiple Data(SPMD)Assignment controlled by values of variables used as loop bounds3637Notes on SAS Program SPMD:not lockstep or even necessarily same instructions Assignment controlled by values of variables used as loop boundsUnique pid per process,used to control assignment“Done”condition evaluated redundantly by all Code that does the update identical to sequential programEach process has private mydiff variable Most interesting special operations are for synchronizationAccumulations into shared diff have to be mutually exclusiveWhy the need for all the barriers?38Need for Mutual Exclusion Code each process executesload the value of diff into register r1add the register r2 to register r1store the value of register r1 into diff A possible interleaving Need the sets of operations to be atomic(mutually exclusive)39Mutual Exclusion Provided by LOCK-UNLOCK around critical sectionSet of operations we want to execute atomicallyImplementation of LOCK/UNLOCK must guarantee mutual exclusive Can lead to significant serialization if contendedEspecially since expect non-local accesses in critical sectionAnother reason to use private mydiff for partial accumulation40Global Event Synchronization BARRIER(nprocs):wait here till nprocs processes get hereBuilt using lower level primitivesGlobal sum example:wait for all to accumulate before using sumOften used to separate phases of computationConservative form of preserving dependences,but easy to use WAIT_FOR_END(nprocs-1)41Pt-to-pt Event Synch One process notifies another event so it can proceedCommon example:producer-consumer(bounded buffer)Concurrent programming on uniprocessor:semaphoresShared address space parallel programs:semaphores,or use ordinary variables as flagsBusy-waiting or spinning42Group Event Synchronization Subset of processes involvedCan use flags or barriers(involving only the subset)Concept of producers and consumers Major typesSingle-producer,multiple-consumerMultiple-producer,single-consumerMultiple-producer,multiple-consumer43Message Passing Grid Solver Cannot declare A to be shared array any more Need to compose it logically from per-process private arraysUsually allocated in accordance with the assignment of workProcess assigned a set of rows allocates them locally Transfers of entire rows between traversals Structurally similar to SAS(e.g.SPMD),but orchestration differentData structures and data access/namingCommunicationSynchronization4445Notes on Message Passing Program Use of ghost rows Receive does not transfer data,send doesUnlike SAS which is usually receiver-initiated (load fetches data)Communication done at beginning of iteration,so no asynchrony Communication in whole rows,not element at a time Core similar,but indices/bounds in local rather than global space Synchronization through sends and receives Update of global diff and event synch for done conditionCould implement locks and barriers with messages Can use REDUCE and BROADCAST library calls to simplify code46Ghost Points and Ghost Row47Send and Receive AlternativesCan extend functionality:stride,scatter-gather,groupsSemantic flavors:based on when control is returnedAffect when data structures or buffers can be reused at either endAffect event synch(mutual exclusive:only one process touches data)Affect ease of programming and performanceSynchronous messages provide built-in synchronous through matchSeparate event synchronization needed with asynchronous messagesWith synchronous messages,our code is deadlocked48Orchestration:Comparison Shared address spaceShared and private data explicitly separateCommunication implicit in access patternsNo correctness need for data distributionSynchronization via atomic operations on shared dataSynchronization explicit and distinct from data communication Message passingData distribution among local address spaces neededNo explicit shared structures(implicit in communication patterns)Communication is explicitSynchronization implicit in communication(at least in synchronous case)49Summary in Grid Solver ProgramDecomposition and assignment similar in SAS and message-passingOrchestration is differentData structures,data access/naming,communication,synchronizationRequirements for performance are another story 50References The content expressed in this chapter comes from Carnegie Mellon Universitys public course,Parallel Computer Architecture and Programming,(CS 418)(http:/www.cs.cmu.edu/afs/cs/academic/class/15418-s11/public/lectures/)