高级操作系统高级操作系统 (34).pdf
第 12 讲:Scalable Synchronization on Shared-MemoryMultiprocessors-IIMultiprocessor Memory Model&Multiprocessor ProgrammingRecapCache Coherent in Multi-processorMSI 一致性协议MESI 一致性协议ref:Some info are fromPaul McKenney(IBM)Tom Hart(University of Toronto),Frans Kaashoek(MIT),Daniel J.Sorin”A Primer on Memory Consistency and Cache Coherence”,Fabian Giesen”Cache coherency primer”,Mingyu Gao(Tsinghua),Yubin Xia(SJTU)”The C/C+Memory Model:Overview and Formalization”,Mark Batty,etc.”C+11 Memory ConsistencyModel”,Sebastian Gerstenberg;”HOW UBISOFT MONTREAL DEVELOPS GAMES FOR MULTICORE Before&After C+11”,Jeff Preshing;etc.RecapMemory Consistency in Multi-processorSequential ConsistencyTotal Store Order ConsistencyAcquire/Release ConsistencyRelaxed/Weak Consistency.Multiprocessor Programming陈渝(清华大学)第 12 讲2020 年 5 月 9 日4/55.Ways to Achieve Synchronizes-WithC+11/17/20,Go,RUST,Java,陈渝(清华大学)第 12 讲2020 年 5 月 9 日5/55.Real Hardware/CompilerReal hardware doesnt run the code that you wrote.Real compiler doesnt produce the code that you wrote.陈渝(清华大学)第 12 讲2020 年 5 月 9 日6/55.Real Hardware/CompilerReal hardware doesnt run the code that you wrote.Real compiler doesnt produce the code that you wrote.http:/ 12 讲2020 年 5 月 9 日7/55.Real Hardware/CompilerReal hardware doesnt run the code that you wrote.Real compiler doesnt produce the code that you wrote.根据英特尔的规范:在本示例的末尾,r1 和 r2 都等于 0 是合法的陈渝(清华大学)第 12 讲2020 年 5 月 9 日8/55.Real Hardware/CompilerReal hardware doesnt run the code that you wrote.Real compiler doesnt produce the code that you wrote.陈渝(清华大学)第 12 讲2020 年 5 月 9 日9/55.Real Hardware/CompilerReal hardware doesnt run the code that you wrote.Real compiler doesnt produce the code that you wrote.https:/ 12 讲2020 年 5 月 9 日10/55.Real Hardware/CompilerReal hardware doesnt run the code that you wrote.Real compiler doesnt produce the code that you wrote.陈渝(清华大学)第 12 讲2020 年 5 月 9 日11/55.Real Hardware/CompilerReal hardware doesnt run the code that you wrote.Real compiler doesnt produce the code that you wrote.陈渝(清华大学)第 12 讲2020 年 5 月 9 日12/55.Real Hardware/CompilerReal hardware doesnt run the code that you wrote.Real compiler doesnt produce the code that you wrote.Dekkers Algorithm,g+-O2陈渝(清华大学)第 12 讲2020 年 5 月 9 日13/55.C+11 Memory ModelHistoryIn 2011,new versions of the ISO standards for C andC+,informally known as C11 and C+11,wereratified.These standards define a memory model for C/C+Support for this model has recently become availablein popular compilers(GCC 4.4,Intel C+13.0,MSVC11.0,Clang 3.1).陈渝(清华大学)第 12 讲2020 年 5 月 9 日14/55.C+11 Memory ModelWhy C+11 Memory Model在 C+11 之前,其实是没有定义内存模型的,我们所使用的都是一些处理器/编译器暴露的同步原语,比如 GCC 的内联汇编,内建函数之类的,有着不同的实现。C+11 在标准库中引入了 memory model 的意义在于在 High Level Language 层面实现对在多处理器中多线程共享内存的访问,实现跨编译器,OS 和硬件的差异性。陈渝(清华大学)第 12 讲2020 年 5 月 9 日15/55.C+11 Memory ModelHappens-beforeAn important fundamental concept inunderstanding the memory modelA guarantee that memory writes by one specificstatement are visible to another specific statement陈渝(清华大学)第 12 讲2020 年 5 月 9 日16/55.C+11 Memory ModelHappens-beforeint A=0;int B=0;void foo()A=B+1;/(1)B=1;/(2)陈渝(清华大学)第 12 讲2020 年 5 月 9 日17/55.C+11 Memory ModelHappens-before陈渝(清华大学)第 12 讲2020 年 5 月 9 日18/55.C+11 Memory ModelHappens-beforeint A,B;void foo()A=B+1;B=0;陈渝(清华大学)第 12 讲2020 年 5 月 9 日19/55.C+11 Memory ModelHappens-beforeGCC 4.6.1$gcc-S-masm=intel foo.c$cat foo.s.moveax,DWORD PTR _B(redo this at home.)addeax,1movDWORD PTR _A,eaxmovDWORD PTR _B,0.陈渝(清华大学)第 12 讲2020 年 5 月 9 日20/55.C+11 Memory ModelHappens-beforeGCC 4.6.1$gcc-O2-S-masm=intel foo.c$cat foo.s.moveax,DWORD PTR BmovDWORD PTR B,0addeax,1movDWORD PTR A,eax.陈渝(清华大学)第 12 讲2020 年 5 月 9 日21/55.C+11 Memory ModelHappens-beforeGCC 4.6.1int A,B;void foo()A=B+1;asm volatile(:memory);B=0;陈渝(清华大学)第 12 讲2020 年 5 月 9 日22/55.C+11 Memory ModelHappens-beforeGCC 4.6.1$gcc-O2-S-masm=intel foo.c$cat foo.s.moveax,DWORD PTR _Baddeax,1movDWORD PTR _A,eaxmovDWORD PTR _B,0.陈渝(清华大学)第 12 讲2020 年 5 月 9 日23/55.C+11 Memory ModelHappens-before/x86#define COMPILER_BARRIER()asm volatile(:memory)/PowerPC#define RELEASE_FENCE()asm volatile(lwsync:memory)=int Value;std:atomic IsPublished(0);void sendValue(int x)Value=x;/-reordering is prevented here!IsPublished.store(1,std:memory_order_release);陈渝(清华大学)第 12 讲2020 年 5 月 9 日24/55.C+11 Memory Modelstd:atomicx.load(memory order)x.store(T,memory order)Concurrent accesses on atomic locations do not race.The memory order argument specifies orderingconstraints between atomic and non-atomic memoryaccesses in different threads.陈渝(清华大学)第 12 讲2020 年 5 月 9 日25/55.C+11 Memory ModelIf multiple threads access the same variable concurrently,and at leastone thread modifies it,all threads must use C+11 atomic operations.陈渝(清华大学)第 12 讲2020 年 5 月 9 日26/55.C+11 Memory Modelstd:atomic陈渝(清华大学)第 12 讲2020 年 5 月 9 日27/55.C+11 Memory Modelstd:atomic陈渝(清华大学)第 12 讲2020 年 5 月 9 日28/55.C+11 Memory Modelstd:memory_order陈渝(清华大学)第 12 讲2020 年 5 月 9 日29/55.C+11 Memory Modelstd:memory_order_seq_cstThere is a total order over all seq cst operations.This order contributes to inter-thread orderingconstraints.陈渝(清华大学)第 12 讲2020 年 5 月 9 日30/55.C+11 Memory Modelstd:memory_order_seq_cst陈渝(清华大学)第 12 讲2020 年 5 月 9 日31/55.C+11 Memory Modelstd:memory_order_seq_cstsequential consistency直观上,读操作应该返回“最后”一次写入的值。在单处理器系统中,“最后”由程序次序定义。在多处理器系统中,我们称之为顺序连贯(sequential consistency,SC).约束条件在每个处理器内,维护每个处理器的程序次序;在所有处理器间,维护单一的表征所有操作的次序。对于写操作 W1,W2,不能出现从处理器 P1 看来,执行次序为 W1-W2;从处理器 P2 看来,执行次序却为 W2-W1 这种情况。陈渝(清华大学)第 12 讲2020 年 5 月 9 日32/55.C+11 Memory Modelstd:memory_order_release/acquireAn acquire load makes prior writes to other memory locations made by the thread that did therelease visible in the loading thread.陈渝(清华大学)第 12 讲2020 年 5 月 9 日33/55.C+11 Memory Modelstd:memory_order_release/acquireAn acquire load makes prior writes to other memory locations made by the thread that did therelease visible in the loading thread.陈渝(清华大学)第 12 讲2020 年 5 月 9 日34/55.C+11 Memory Modelstd:memory_order_release/acquireAcquire semanticsAcquire semantics is a property that can only apply to operations that read from sharedmemory,whether they are read-modify-write operations or plain loads.The operationis then considered a read-acquire.Acquire semantics prevent memory reordering ofthe read-acquire with any read or write operation that follows it in program order.陈渝(清华大学)第 12 讲2020 年 5 月 9 日35/55.C+11 Memory Modelstd:memory_order_release/acquireRelease semanticsRelease semantics is a property that can only apply to operations that write to sharedmemory,whether they are read-modify-write operations or plain stores.The operationis then considered a write-release.Release semantics prevent memory reordering of thewrite-release with any read or write operation that precedes it in program order.陈渝(清华大学)第 12 讲2020 年 5 月 9 日36/55.C+11 Memory Modelstd:memory_order_release/acquire陈渝(清华大学)第 12 讲2020 年 5 月 9 日37/55.C+11 Memory Modelstd:memory_order_release/acquirevoid produce()payload=42;guard.store(1,std:memory_order_release)void consume(int iterations)for(int i=0;i iterations;i+)if(guard.load(std:memory_order_acquire)resulti=payload;陈渝(清华大学)第 12 讲2020 年 5 月 9 日38/55.C+11 Memory Modelstd:memory_order_release/acquirehttp:/ 12 讲2020 年 5 月 9 日39/55.C+11 Memory Modelstd:memory_order_release/acquire1000 iterations:http:/ 12 讲2020 年 5 月 9 日40/55.C+11 Memory Modelstd:memory_order_consume(Data Dependency Ordering)A consume load makes prior writes to data-dependent memorylocations made by the thread that did the release visible in the loading thread.Data Dependency陈渝(清华大学)第 12 讲2020 年 5 月 9 日41/55.C+11 Memory Modelstd:memory_order_consume(Data Dependency Ordering)A consume load makes prior writes to data-dependent memorylocations made by the thread that did the release visible in the loading thread.陈渝(清华大学)第 12 讲2020 年 5 月 9 日42/55.C+11 Memory Modelstd:memory_order_consume(Data Dependency Ordering)A consume load makes prior writes to data-dependent memorylocations made by the thread that did the release visible in the loading thread.陈渝(清华大学)第 12 讲2020 年 5 月 9 日43/55.C+11 Memory Modelstd:memory_order_consumex86-64 machine code loads Guard into register rcx,then,if rcx is not null,uses rcxto load the payload,thus creating a data dependency between the two loadinstructions.86-64s strong memory model already guarantees that loads are performedin-order,even if there isnt a data dependency.陈渝(清华大学)第 12 讲2020 年 5 月 9 日44/55.C+11 Memory Modelstd:memory_order_consumePowerPC machine code loads Guard into register r9,then uses r9 to load thepayload,thus creating a data dependency between the two load pletely avoid the”cmp;bne;isync”sequence of instructions that formed amemory barrier in the original example,while still ensuring that the two loads areperformed in-order.陈渝(清华大学)第 12 讲2020 年 5 月 9 日45/55.C+11 Memory Modelstd:memory_order_consumeARMV7 machine code loads Guard into register r4,then uses r4 to load thepayload,thus creating a data dependency between the two load pletely avoid the”dmb ish”instruction that was present in the originalexample,while still ensuring that the two loads are performed in-order.陈渝(清华大学)第 12 讲2020 年 5 月 9 日46/55.C+11 Memory Modelstd:memory_order_consume(Data Dependency Ordering)A consume load makes prior writes to data-dependent memorylocations made by the thread that did the release visible in the loading thread.陈渝(清华大学)第 12 讲2020 年 5 月 9 日47/55.C+11 Memory Modelstd:memory_order_consumereal world RCU在真实世界中使用这一技术利用数据依赖顺序以避免内存栅栏的例子就是 Linux 内核。Linux 提供了读-复制-更新(RCU)的同步机制,适合构建在多个线程中需要多读少写的共享变量(包括指针等)数据结构。Linux 实际上没有使用 C+11 消费语义来去除那些内存栅栏,而是依靠它自己的 API 和规范。其实起初 RCU 就被看作是给 C+11 添加消费语义的动机。陈渝(清华大学)第 12 讲2020 年 5 月 9 日48/55.C+11 Memory Modelstd:memory_order_relaxedThe memory_order_relaxed ensure these operations are atomic,but dont impose any orderingconstraints/memory barriers that arent already there.relaxed order 允许单线程上不同的 memory location 进行 reorder,但是对于同一个 memorylocation 不能进行 reorder。陈渝(清华大学)第 12 讲2020 年 5 月 9 日49/55.Lock-Free Programming陈渝(清华大学)第 12 讲2020 年 5 月 9 日50/55.Lock-Free Programming陈渝(清华大学)第 12 讲2020 年 5 月 9 日51/55.Lock-Free ProgrammingLock-free Stacktemplateclass lock_free_stackprivate:struct nodeT data;node*next;node(T const&data_):/1data(data_);陈渝(清华大学)第 12 讲2020 年 5 月 9 日52/55.Lock-Free ProgrammingLock-free Stackstd:atomic head;public:void push(T const&data)node*const new_node=new node(data);/2new_node-next=head.load();/3while(!pare_exchange_weak(new_node-next,new_node);/4使用“比较/交换”操作:返回 false 时,因为比较失败(例如,head 被其他线程修改),会使用 head 中的内容更新 new_node-next(第一个参数)的内容。陈渝(清华大学)第 12 讲2020 年 5 月 9 日53/55.Lock-Free ProgrammingLock-free Stackwhile(!pare_exchange_weak(new_node-next,new_node);/4if(head=new_node-next)head=new_node;return true;elsenew_node-next=head;return false;使用“比较/交换”操作:返回 false 时,因为比较失败(例如,head 被其他线程修改),会使用 head 中的内容更新 new_node-next(第一个参数)的内容。陈渝(清华大学)第 12 讲2020 年 5 月 9 日54/55.Lock-Free ProgrammingLock-free Stackvoid pop(T&result)node*old_head=head.load();while(!pare_exchange_weak(old_head,old_head-next);result=old_head-data;这段代码很优雅,但有关于节点泄露的两个问题。首先,这段代码在空链表时不工作:当head 指针是空指针时,要访问 next 指针时,将引起未定义行为。陈渝(清华大学)第 12 讲2020 年 5 月 9 日55/55