计算机体系结构课后习题.docx
_计算机体系结构课后习题1.1 Three enhancements with the following speedups are proposed for a new architecture :Speedup1=30Speedup2=20Speedup3=15Only one enhancement is usable at a time.(1) If enhancements 1 and 2 are each usable for 25% of the time ,what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10?(2)Assume the enhancements can be used 25%,35% and 10% of the time for enhancements 1,2,and 3,respectively .For what fraction of the reduced execution time is no enhancement in use?(3)Assume ,for some benchmark,the possible fraction of use is 15% for each of enhancements 1 and 2 and 70% for enhancement 3.We want to maximize performance .If only one enhancement can be implemented ,which should it be ?If two enhancements can be implemented ,which should be chosen?答:(1)Assume: the fraction of the time enhancement 3 must be used to achieve an overall speedup of 10 is x.Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhanced10=11-25%-25%-x+25%30+25%20+x15 So , x=45%(2)Assume:The total execution time before the three enhancements can be used is Timebefore ,The execution time for no enhancement is Timeno.Timeno=1-25%-35%-10%×TimebeforeThe total execution time after the three enhancements can be used is TimeafterTimeafter=Timeno+25%30×Timebefore+35%20×Timebefore+10%15×TimebeforeSo,TimenoTimeafter=90.2%(3)By Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhancedIf only one enhancement can be implemented:Speedupoverall1=11-15%+15%30=1.17Speedupoverall2=11-15%+15%20=1.166Speedupoverall3=11-15%+15%15=2.88So,we must select enhancement 1 and 3 to maximize performance.Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhancedSpeedupoverall12=11-15%-15%+15%30+15%20=1.40Speedupoverall13=11-15%-70%+15%30+70%15=4.96Speedupoverall23=11-15%-70%+15%20+70%15=4.90So,we must select enhancement 1 and 3 to maximize performance.1.2 Suppose there is a graphics operation that accounts for 10% of execution time in an application ,and by adding special hardware we can speed this up by a factor of 18 . In further ,we could use twice as much hardware ,and make the graphics operation run 36 times faster.Give the reason of whether it is worth exploring such an further architectural change?答:Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhancedSpeedupoverall1=11-10%+10%18=10.9+0.0055555=1.104Speedupoverall2=11-10%+10%36=10.9+0.0027777=1.108So,It is not worth exploring such an further architectural change.1.3 In many practical applications that demand a real-time response,the computational workload W is often fixed.As the number of processors increases in a parallel computer,the fixed workload is distributed to more processors for parallel execution.Assume 20 percent of W must be executed sequentially ,and 80 percent can be executed by 4 nodes simultaneously .What is a fixed-load speedup?答:Speedupoverall=11-Fracionenhanced+FrationenhancedSpeedupenhancedSpeedupoverall1=WW×20%+W×80%4=10.2+0.2=2.5So,a fixed-load speedup is 2.5.2.1 There is a model machine with nine instructions,which frequencies are ADD(0.3), SUB(0.24), JOM(0.06), STO(0.07), JMP(0.07), SHR(0.02), CIL(0.03), CLA(0.2), STP(0.01),respectively. There are several GPRs in the machine.Memory is byte addressable,with accessed addresses aligned .And the memory word width is 16 bit.Suppose the nine instructions with the characteristics as following :nTwo operands instructionsnTwo kinds of instruction lengthnExtended codingnShorter instruction operands format:R(register)-R(register)nLonger instruction operands format:R(register)-M(memory)nWith displacement memory addressing modeA. Encode the nine instructions with Huffman-coding, and give the average code length.B. Designed the practical instruction codes,and give the average code length.C. Write the two instruction word formats in detail.D. What is the maximum offset for accessing memory address?答: Huffman coding by Huffman treenADD30%01nSUB24% 11nCLA 20% 10nJOM6% 0001nSTO7%0011nJMP7%0010nSHR2%000001nCIL3%00001nSTP1%000000So,the average code length isi=19pi×li=2.61bits12_(B)Two kinds of instruction length extended codingnADD30%01nSUB 24% 11nCLA20% 10nJOM6% 11000nSTO7%11001nJMP7%11010nSHR2%11011nCIL3%11100nSTP1%11101So,the average code length is(C)Shorter instruction format:Opcode2bitsRegister3bitsRegister3bitsLonger instruction format:opcode5bitsRegister3bitsRegister3bitsoffset5bits(D)The maximum offset for accessing memory address is 32 bytes.3.1Identify all of the data dependences in the following code .Which dependences are data hazards that will be resolved via forwarding?ADDR2,R5,R4ADDR4,R2,R5SW R5,100(R2)ADDR3,R2,R4答:3.2How could we modify the following code to make use of a delayed branch slot?Loop: LW R2,100(R3)ADDI R3,R3,#4BEQ R3,R4,Loop答:LW R2,100(R3)Loop:ADDI R3,R3,#4BEQ R3,R4,LoopDelayed branch slotàLW R2,100(R3)3.3Consider the following reservation table for a four-stage pipeline with a clock cycle t=20ns.A. What are the forbidden latencies and the initial collision vector?B. Draw the state transition diagram for scheduling the pipeline.C. Determine the MAL associated with the shortest greedy cycle.D. Determine the pipeline maximumthroughput corresponding to the MAL and given t.s1s2s3s4123456×××××××答:A. the forbidden latencies F=1,2,5 the initial collision vectorC=(10011)B.the state transition diagramC. MAL (Minimal Average Latency)=3 clock cyclesD. The pipeline maximum throughput Hk=1/(3×20ns)3.4Using the following code fragment:Loop: LW R1,0(R2); load R1 from address 0+R2ADDI R1,R1,#1;R1=R1+1SW0(R2),R1;store R1 at address 0+R2ADDI R2,R2,#4;R2=R2+4SUBR4,R3,R2;R4=R3-R2BNEZ R4,Loop;Branch to loop if R4!=0Assume that the initial value of R3 is R2+396.Throughout this exercise use the classic RISC five-stage integer pipeline and assume all memory access take 1 clock cycle.A. Show the timing of this instruction sequence for the RISC pipeline without any forwarding or bypassing hardwarebut assuming a register read and a write in the same clock cycle “forwards”through the register file. Assume that the branch is handled by flushing the pipeline. If all memory references take 1 cycle, how many cycles does this loop take to execute?B. Show the timing of this instruction sequence for the RISC pipeline with normal forwarding and bypassing hardware. Assume that the branch is handled by predicting it as not taken. If all memory reference take 1 cycle, how many cycles does this loop take to execute?C. Assume the RISC pipeline with a single-cycle delayed branchand normal forwarding and bypassing hardware. Schedule the instructions in the loop including the branch delay slot. You may reorder instructions and modify the individual instruction operands, but do not undertake other loop transformations that change the number or opcode of the instructions in the loop. Show a pipeline timing diagram and compute the number of cycles needed to execute the entire loop.答:A. ·The loop iterates 396/4=99 times.·Go through one complete iteration of the loop and the first instruction in the next iteration.·Total length=the length of iterations 0 through 97(The first 98 iterations should be of the same length) +the length of the last iteration.·We have assumed the version of DLX described in Figure 3.21(Page 97) in the book,which resolves branches in MEM.·From this Figure, the second iteration begin 17 clocks after the first iteration and the last iteration takes 18 cycles to complete.·Total length=17×98+18=1684 clock cyclesB. ·From this Figure, the second iteration begin 10 clocks after the first iteration and the last iteration takes 11 cycles to complete.·Total length=10×98+11=991 clock cyclesC. Loop: LW R1,0(R2);load R1 from address 0+R2ADDI R1,R1,#1;R1=R1+1SW0(R2),R1;store R1 at address 0+R2ADDI R2,R2,#4;R2=R2+4SUBR4,R3,R2;R4=R3-R2BNEZ R4,Loop;Branch to loop if R4!=0Reorder instructions to :Loop: LW R1,0(R2); load R1 from address 0+R2ADDI R2,R2,#4; R2=R2+4SUBR4,R3,R2;R4=R3-R2ADDI R1,R1,#1;R1=R1+1BNEZ R4,Loop;Branch to loop if R4!=0SW-4(R2),R1;store R1 at address 0+R2·From Figure the second iteration begin 6 clocks after the first iteration and the last iteration takes 10 cycles to complete.·Total length=6×98+10=598 clock cyclesLoop: LW R1,0(R2); load R1 from address 0+R2stallADDI R1,R1,#1;R1=R1+1SW0(R2),R1;store R1 at address 0+R2ADDI R2,R2,#4; R2=R2+4SUBR4,R3,R2;R4=R3-R2stallBNEZ R4,Loop;Branch to loop if R4!=0stallLoop: LW R1,0(R2);load R1 from address 0+R2(stall)ADDI R2,R2,#4;R2=R2+4ADDI R1,R1,#1;R1=R1+1SW-4(R2),R1;store R1 at address 0+R2SUBR4,R3,R2;R4=R3-R2stallBNEZ R4,Loop;Branch to loop if R4!=0stallLoop: LW R1,0(R2);load R1 from address 0+R2(stall)ADDI R2,R2,#4;R2=R2+4SUBR4,R3,R2;R4=R3-R2(stall)ADDI R1,R1,#1;R1=R1+1BNEZ R4,Loop;Branch to loop if R4!=0(stall)SW-4(R2),R1;store R1 at address 0+R23.5Consider the following reservation table for a four-stage pipeline.A. What are the forbidden latencies and the initial collision vector?B. Draw the state transition diagram for scheduling the pipeline.C. Determine the MAL associated with the shortest greedy cycle.D. Determine the pipeline maximum throughput corresponding to the MAL.E. According to the shortest greedy cycle , put six tasks into the pipeline ,determine the pipeline actual throughput.1234567s1s2s3s4答:A. the forbidden latencies are 2,4,6 the initial collision vector C=(101010)B.the state transition diagram:C.the MAL associated with the shortest greedy cycle is 4 cycles.schedulingAverage latency(1,7)4(3,5)4(5,3)4(5)5(3,7)5(5,7)6(7)7D. the pipeline maximum throughput corresponding to the MAL :Hk=1/(4 clock cycles)E. According to the shortest greedy cycle , put six tasks into the pipeline.The best scheduling is the greedy cycle(l,7).because :according to (1,7) scheduling :actual throughput Hk=6/(1+7+1+7+1+7)=6/(24 cycles)according to (3,5) scheduling :actual throughput Hk=6/(3+5+3+5+3+7)=6/(26 cycles)according to (5,3) scheduling :actual throughput Hk=6/(5+3+5+3+5+7)=6/(28 cycles)4.1 The following C program is run (with no optimizations) on a machine with a cache that has four-word(16-byte)blocksand holds 256 bytes of data:inti,j,c,stride,array256;for(i=0;i<10000;i+)for(j=0;j<256;j=j+stride)c=arrayj+5;if we consider only the cache activity generated by references to the array and we assume that integer sare words, what is the expected miss rate when the cache is direct-mapped and stride=132? How about if stride=131? Would either of these change if the cache were two-way set associative?答:If stride=132 and the cache is direct-mappedPage 201、211·The block number of the cache is 256/16=16·The block address of array0= 0/16 =0·The block number that array0maps to cache : 0 mod16=0·The block address of array132= 132×4/16 =33·The block number that array132maps to cache : 33 mod 16=1So,miss rate=2/2´10000=1/10000If stride=131 and the cache is direct-mappedPage 201、211·The block number of the cache is 256/16=16·The block address of array0= 0/16 =0·The block number that array0maps to cache : 0 mod16=0·The block address of array131= 131×4/16 =32·The block number that array131maps to cache:32 mod 16=0So,miss rate=2´10000/2´10000=1If stride=132 and the cache is two-way set associativePage 224-227、211·The block number of the cache is 256/16=16·The set number of the cache is 16/2=8·The block address of array0= 0/16 =0·The set number that array0maps to cache : 0 mod 8=0·The block address of array132= 132×4/16 =33·The set number that array132maps to cache :33 mod 8=1So,miss rate=2/2´10000=1/10000If stride=131 and the cache is two-way setassociativePage 224-227、211·The block number of the cache is 256/16=16·The set number of the cache is 16/2=8·The block address of array0= 0/16 =0·The set number that array0maps to cache : 0 mod 8=0·The block address of array131= 131×4/16 =32·The set number that array131maps to cache :32 mod 8=0So,miss rate=2/2´10000=1/100004.2 Consider a virtual memory system with the following properties:n40-bitvirtualbyteaddressn16-KBpagesn36-bitphysicalbyteaddress(1)whatisthetotalsizeofthepagetableforeachprocessonthismachine,assumingthatthevalid,protection,dirty,andusebitstakeatotalof4bitsandthatallthevirtualpagesareinuse?(Assumethatdiskaddressesarenotstoredinthepagetable)(2)Assumethatthevirtualmemorysystemisimplementedwithatwo-wayset-associativeTLBwithatotalof256TLBentries.Showthevirtual-to-physicalmappingwithafigure.Makesuretolabelthewidthofallfieldsandsignals.答:So,the total size of the page table for each process on this machine is:2(40-14) ×(4+(36-14)bit=226×26bit=208M(Byte)