xref: /qemu/docs/devel/multi-thread-tcg.rst (revision 99cd12ce)
1c8c06e52SAlex Bennée..
2c8c06e52SAlex Bennée  Copyright (c) 2015-2020 Linaro Ltd.
3c8c06e52SAlex Bennée
4c8c06e52SAlex Bennée  This work is licensed under the terms of the GNU GPL, version 2 or
5c8c06e52SAlex Bennée  later. See the COPYING file in the top-level directory.
6c8c06e52SAlex Bennée
7ae63ed16SLuis Pires==================
8ae63ed16SLuis PiresMulti-threaded TCG
9ae63ed16SLuis Pires==================
10c8c06e52SAlex Bennée
11c8c06e52SAlex BennéeThis document outlines the design for multi-threaded TCG (a.k.a MTTCG)
12c8c06e52SAlex Bennéesystem-mode emulation. user-mode emulation has always mirrored the
13c8c06e52SAlex Bennéethread structure of the translated executable although some of the
14c8c06e52SAlex Bennéechanges done for MTTCG system emulation have improved the stability of
15c8c06e52SAlex Bennéelinux-user emulation.
16c8c06e52SAlex Bennée
17c8c06e52SAlex BennéeThe original system-mode TCG implementation was single threaded and
18c8c06e52SAlex Bennéedealt with multiple CPUs with simple round-robin scheduling. This
19c8c06e52SAlex Bennéesimplified a lot of things but became increasingly limited as systems
20c8c06e52SAlex Bennéebeing emulated gained additional cores and per-core performance gains
21c8c06e52SAlex Bennéefor host systems started to level off.
22c8c06e52SAlex Bennée
23c8c06e52SAlex BennéevCPU Scheduling
24c8c06e52SAlex Bennée===============
25c8c06e52SAlex Bennée
26c8c06e52SAlex BennéeWe introduce a new running mode where each vCPU will run on its own
27c8c06e52SAlex Bennéeuser-space thread. This is enabled by default for all FE/BE
28c8c06e52SAlex Bennéecombinations where the host memory model is able to accommodate the
29c8c06e52SAlex Bennéeguest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
30c8c06e52SAlex Bennéeguest has had the required work done to support this safely
31c8c06e52SAlex Bennée(TARGET_SUPPORTS_MTTCG).
32c8c06e52SAlex Bennée
33c8c06e52SAlex BennéeSystem emulation will fall back to the original round robin approach
34c8c06e52SAlex Bennéeif:
35c8c06e52SAlex Bennée
36c8c06e52SAlex Bennée* forced by --accel tcg,thread=single
37c8c06e52SAlex Bennée* enabling --icount mode
38c8c06e52SAlex Bennée* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
39c8c06e52SAlex Bennée
40c8c06e52SAlex BennéeIn the general case of running translated code there should be no
41c8c06e52SAlex Bennéeinter-vCPU dependencies and all vCPUs should be able to run at full
42c8c06e52SAlex Bennéespeed. Synchronisation will only be required while accessing internal
43c8c06e52SAlex Bennéeshared data structures or when the emulated architecture requires a
44c8c06e52SAlex Bennéecoherent representation of the emulated machine state.
45c8c06e52SAlex Bennée
46c8c06e52SAlex BennéeShared Data Structures
47c8c06e52SAlex Bennée======================
48c8c06e52SAlex Bennée
49c8c06e52SAlex BennéeMain Run Loop
50c8c06e52SAlex Bennée-------------
51c8c06e52SAlex Bennée
52c8c06e52SAlex BennéeEven when there is no code being generated there are a number of
53c8c06e52SAlex Bennéestructures associated with the hot-path through the main run-loop.
54c8c06e52SAlex BennéeThese are associated with looking up the next translation block to
55c8c06e52SAlex Bennéeexecute. These include:
56c8c06e52SAlex Bennée
57c8c06e52SAlex Bennée    tb_jmp_cache (per-vCPU, cache of recent jumps)
58c8c06e52SAlex Bennée    tb_ctx.htable (global hash table, phys address->tb lookup)
59c8c06e52SAlex Bennée
60c8c06e52SAlex BennéeAs TB linking only occurs when blocks are in the same page this code
61c8c06e52SAlex Bennéeis critical to performance as looking up the next TB to execute is the
62c8c06e52SAlex Bennéemost common reason to exit the generated code.
63c8c06e52SAlex Bennée
64c8c06e52SAlex BennéeDESIGN REQUIREMENT: Make access to lookup structures safe with
65c8c06e52SAlex Bennéemultiple reader/writer threads. Minimise any lock contention to do it.
66c8c06e52SAlex Bennée
67c8c06e52SAlex BennéeThe hot-path avoids using locks where possible. The tb_jmp_cache is
68c8c06e52SAlex Bennéeupdated with atomic accesses to ensure consistent results. The fall
69c8c06e52SAlex Bennéeback QHT based hash table is also designed for lockless lookups. Locks
70c8c06e52SAlex Bennéeare only taken when code generation is required or TranslationBlocks
71c8c06e52SAlex Bennéehave their block-to-block jumps patched.
72c8c06e52SAlex Bennée
73c8c06e52SAlex BennéeGlobal TCG State
74c8c06e52SAlex Bennée----------------
75c8c06e52SAlex Bennée
76c8c06e52SAlex BennéeUser-mode emulation
77c8c06e52SAlex Bennée~~~~~~~~~~~~~~~~~~~
78c8c06e52SAlex Bennée
79c8c06e52SAlex BennéeWe need to protect the entire code generation cycle including any post
80c8c06e52SAlex Bennéegeneration patching of the translated code. This also implies a shared
81c8c06e52SAlex Bennéetranslation buffer which contains code running on all cores. Any
82c8c06e52SAlex Bennéeexecution path that comes to the main run loop will need to hold a
83c8c06e52SAlex Bennéemutex for code generation. This also includes times when we need flush
84c8c06e52SAlex Bennéecode or entries from any shared lookups/caches. Structures held on a
85c8c06e52SAlex Bennéeper-vCPU basis won't need locking unless other vCPUs will need to
86c8c06e52SAlex Bennéemodify them.
87c8c06e52SAlex Bennée
88c8c06e52SAlex BennéeDESIGN REQUIREMENT: Add locking around all code generation and TB
89c8c06e52SAlex Bennéepatching.
90c8c06e52SAlex Bennée
91c8c06e52SAlex Bennée(Current solution)
92c8c06e52SAlex Bennée
93c8c06e52SAlex BennéeCode generation is serialised with mmap_lock().
94c8c06e52SAlex Bennée
95c8c06e52SAlex Bennée!User-mode emulation
96c8c06e52SAlex Bennée~~~~~~~~~~~~~~~~~~~~
97c8c06e52SAlex Bennée
98c8c06e52SAlex BennéeEach vCPU has its own TCG context and associated TCG region, thereby
99c8c06e52SAlex Bennéerequiring no locking during translation.
100c8c06e52SAlex Bennée
101c8c06e52SAlex BennéeTranslation Blocks
102c8c06e52SAlex Bennée------------------
103c8c06e52SAlex Bennée
104c8c06e52SAlex BennéeCurrently the whole system shares a single code generation buffer
105c8c06e52SAlex Bennéewhich when full will force a flush of all translations and start from
106c8c06e52SAlex Bennéescratch again. Some operations also force a full flush of translations
107c8c06e52SAlex Bennéeincluding:
108c8c06e52SAlex Bennée
109c8c06e52SAlex Bennée  - debugging operations (breakpoint insertion/removal)
110c8c06e52SAlex Bennée  - some CPU helper functions
11193154e76SAlex Bennée  - linux-user spawning its first thread
11202ca5ec1SPierrick Bouvier  - operations related to TCG Plugins
113c8c06e52SAlex Bennée
114c8c06e52SAlex BennéeThis is done with the async_safe_run_on_cpu() mechanism to ensure all
115c8c06e52SAlex BennéevCPUs are quiescent when changes are being made to shared global
116c8c06e52SAlex Bennéestructures.
117c8c06e52SAlex Bennée
118c8c06e52SAlex BennéeMore granular translation invalidation events are typically due
119c8c06e52SAlex Bennéeto a change of the state of a physical page:
120c8c06e52SAlex Bennée
121c8c06e52SAlex Bennée  - code modification (self modify code, patching code)
122c8c06e52SAlex Bennée  - page changes (new page mapping in linux-user mode)
123c8c06e52SAlex Bennée
124c8c06e52SAlex BennéeWhile setting the invalid flag in a TranslationBlock will stop it
125c8c06e52SAlex Bennéebeing used when looked up in the hot-path there are a number of other
126c8c06e52SAlex Bennéebook-keeping structures that need to be safely cleared.
127c8c06e52SAlex Bennée
128c8c06e52SAlex BennéeAny TranslationBlocks which have been patched to jump directly to the
129c8c06e52SAlex Bennéenow invalid blocks need the jump patches reversing so they will return
130c8c06e52SAlex Bennéeto the C code.
131c8c06e52SAlex Bennée
132c8c06e52SAlex BennéeThere are a number of look-up caches that need to be properly updated
133c8c06e52SAlex Bennéeincluding the:
134c8c06e52SAlex Bennée
135c8c06e52SAlex Bennée  - jump lookup cache
136c8c06e52SAlex Bennée  - the physical-to-tb lookup hash table
137c8c06e52SAlex Bennée  - the global page table
138c8c06e52SAlex Bennée
139c8c06e52SAlex BennéeThe global page table (l1_map) which provides a multi-level look-up
140c8c06e52SAlex Bennéefor PageDesc structures which contain pointers to the start of a
141c8c06e52SAlex Bennéelinked list of all Translation Blocks in that page (see page_next).
142c8c06e52SAlex Bennée
143c8c06e52SAlex BennéeBoth the jump patching and the page cache involve linked lists that
144c8c06e52SAlex Bennéethe invalidated TranslationBlock needs to be removed from.
145c8c06e52SAlex Bennée
146c8c06e52SAlex BennéeDESIGN REQUIREMENT: Safely handle invalidation of TBs
147c8c06e52SAlex Bennée                      - safely patch/revert direct jumps
148c8c06e52SAlex Bennée                      - remove central PageDesc lookup entries
149c8c06e52SAlex Bennée                      - ensure lookup caches/hashes are safely updated
150c8c06e52SAlex Bennée
151c8c06e52SAlex Bennée(Current solution)
152c8c06e52SAlex Bennée
153c8c06e52SAlex BennéeThe direct jump themselves are updated atomically by the TCG
154c8c06e52SAlex Bennéetb_set_jmp_target() code. Modification to the linked lists that allow
155c8c06e52SAlex Bennéesearching for linked pages are done under the protection of tb->jmp_lock,
156c8c06e52SAlex Bennéewhere tb is the destination block of a jump. Each origin block keeps a
157c8c06e52SAlex Bennéepointer to its destinations so that the appropriate lock can be acquired before
158c8c06e52SAlex Bennéeiterating over a jump list.
159c8c06e52SAlex Bennée
160c8c06e52SAlex BennéeThe global page table is a lockless radix tree; cmpxchg is used
161c8c06e52SAlex Bennéeto atomically insert new elements.
162c8c06e52SAlex Bennée
163c8c06e52SAlex BennéeThe lookup caches are updated atomically and the lookup hash uses QHT
164c8c06e52SAlex Bennéewhich is designed for concurrent safe lookup.
165c8c06e52SAlex Bennée
166c8c06e52SAlex BennéeParallel code generation is supported. QHT is used at insertion time
167c8c06e52SAlex Bennéeas the synchronization point across threads, thereby ensuring that we only
168c8c06e52SAlex Bennéekeep track of a single TranslationBlock for each guest code block.
169c8c06e52SAlex Bennée
170c8c06e52SAlex BennéeMemory maps and TLBs
171c8c06e52SAlex Bennée--------------------
172c8c06e52SAlex Bennée
173c8c06e52SAlex BennéeThe memory handling code is fairly critical to the speed of memory
174c8c06e52SAlex Bennéeaccess in the emulated system. The SoftMMU code is designed so the
175c8c06e52SAlex Bennéehot-path can be handled entirely within translated code. This is
176c8c06e52SAlex Bennéehandled with a per-vCPU TLB structure which once populated will allow
177c8c06e52SAlex Bennéea series of accesses to the page to occur without exiting the
178c8c06e52SAlex Bennéetranslated code. It is possible to set flags in the TLB address which
179c8c06e52SAlex Bennéewill ensure the slow-path is taken for each access. This can be done
180c8c06e52SAlex Bennéeto support:
181c8c06e52SAlex Bennée
182c8c06e52SAlex Bennée  - Memory regions (dividing up access to PIO, MMIO and RAM)
183c8c06e52SAlex Bennée  - Dirty page tracking (for code gen, SMC detection, migration and display)
184c8c06e52SAlex Bennée  - Virtual TLB (for translating guest address->real address)
185c8c06e52SAlex Bennée
186c8c06e52SAlex BennéeWhen the TLB tables are updated by a vCPU thread other than their own
187c8c06e52SAlex Bennéewe need to ensure it is done in a safe way so no inconsistent state is
188c8c06e52SAlex Bennéeseen by the vCPU thread.
189c8c06e52SAlex Bennée
190c8c06e52SAlex BennéeSome operations require updating a number of vCPUs TLBs at the same
191c8c06e52SAlex Bennéetime in a synchronised manner.
192c8c06e52SAlex Bennée
193c8c06e52SAlex BennéeDESIGN REQUIREMENTS:
194c8c06e52SAlex Bennée
195c8c06e52SAlex Bennée  - TLB Flush All/Page
196c8c06e52SAlex Bennée    - can be across-vCPUs
197c8c06e52SAlex Bennée    - cross vCPU TLB flush may need other vCPU brought to halt
198c8c06e52SAlex Bennée    - change may need to be visible to the calling vCPU immediately
199c8c06e52SAlex Bennée  - TLB Flag Update
200c8c06e52SAlex Bennée    - usually cross-vCPU
201c8c06e52SAlex Bennée    - want change to be visible as soon as possible
202c8c06e52SAlex Bennée  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
203c8c06e52SAlex Bennée    - This is a per-vCPU table - by definition can't race
204c8c06e52SAlex Bennée    - updated by its own thread when the slow-path is forced
205c8c06e52SAlex Bennée
206c8c06e52SAlex Bennée(Current solution)
207c8c06e52SAlex Bennée
208*99cd12ceSNicholas PigginA new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force
209*99cd12ceSNicholas Pigginsynchronisation by setting the source vCPUs work as "safe work" and
210*99cd12ceSNicholas Pigginexiting the cpu run loop. This ensures that by the time execution
211*99cd12ceSNicholas Pigginrestarts all flush operations have completed.
212c8c06e52SAlex Bennée
213c8c06e52SAlex BennéeTLB flag updates are all done atomically and are also protected by the
214c8c06e52SAlex Bennéecorresponding page lock.
215c8c06e52SAlex Bennée
216c8c06e52SAlex Bennée(Known limitation)
217c8c06e52SAlex Bennée
218c8c06e52SAlex BennéeNot really a limitation but the wait mechanism is overly strict for
219c8c06e52SAlex Bennéesome architectures which only need flushes completed by a barrier
220c8c06e52SAlex Bennéeinstruction. This could be a future optimisation.
221c8c06e52SAlex Bennée
222c8c06e52SAlex BennéeEmulated hardware state
223c8c06e52SAlex Bennée-----------------------
224c8c06e52SAlex Bennée
2250b2675c4SStefan HajnocziCurrently thanks to KVM work any access to IO memory is automatically protected
2260b2675c4SStefan Hajnocziby the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected
2270b2675c4SStefan Hajnoczito do its own locking.
228c8c06e52SAlex Bennée
229c8c06e52SAlex BennéeHowever IO memory isn't the only way emulated hardware state can be
230c8c06e52SAlex Bennéemodified. Some architectures have model specific registers that
231c8c06e52SAlex Bennéetrigger hardware emulation features. Generally any translation helper
232c8c06e52SAlex Bennéethat needs to update more than a single vCPUs of state should take the
233c8c06e52SAlex BennéeBQL.
234c8c06e52SAlex Bennée
235c8c06e52SAlex BennéeAs the BQL, or global iothread mutex is shared across the system we
236c8c06e52SAlex Bennéepush the use of the lock as far down into the TCG code as possible to
237c8c06e52SAlex Bennéeminimise contention.
238c8c06e52SAlex Bennée
239c8c06e52SAlex Bennée(Current solution)
240c8c06e52SAlex Bennée
241c8c06e52SAlex BennéeMMIO access automatically serialises hardware emulation by way of the
242c8c06e52SAlex BennéeBQL. Currently Arm targets serialise all ARM_CP_IO register accesses
243c8c06e52SAlex Bennéeand also defer the reset/startup of vCPUs to the vCPU context by way
244c8c06e52SAlex Bennéeof async_run_on_cpu().
245c8c06e52SAlex Bennée
246c8c06e52SAlex BennéeUpdates to interrupt state are also protected by the BQL as they can
247c8c06e52SAlex Bennéeoften be cross vCPU.
248c8c06e52SAlex Bennée
249c8c06e52SAlex BennéeMemory Consistency
250c8c06e52SAlex Bennée==================
251c8c06e52SAlex Bennée
252c8c06e52SAlex BennéeBetween emulated guests and host systems there are a range of memory
253c8c06e52SAlex Bennéeconsistency models. Even emulating weakly ordered systems on strongly
254c8c06e52SAlex Bennéeordered hosts needs to ensure things like store-after-load re-ordering
255c8c06e52SAlex Bennéecan be prevented when the guest wants to.
256c8c06e52SAlex Bennée
257c8c06e52SAlex BennéeMemory Barriers
258c8c06e52SAlex Bennée---------------
259c8c06e52SAlex Bennée
260c8c06e52SAlex BennéeBarriers (sometimes known as fences) provide a mechanism for software
261c8c06e52SAlex Bennéeto enforce a particular ordering of memory operations from the point
262c8c06e52SAlex Bennéeof view of external observers (e.g. another processor core). They can
263c8c06e52SAlex Bennéeapply to any memory operations as well as just loads or stores.
264c8c06e52SAlex Bennée
265c8c06e52SAlex BennéeThe Linux kernel has an excellent `write-up
2661ec43ca4SJohn Snow<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
267c8c06e52SAlex Bennéeon the various forms of memory barrier and the guarantees they can
268c8c06e52SAlex Bennéeprovide.
269c8c06e52SAlex Bennée
270c8c06e52SAlex BennéeBarriers are often wrapped around synchronisation primitives to
271c8c06e52SAlex Bennéeprovide explicit memory ordering semantics. However they can be used
272c8c06e52SAlex Bennéeby themselves to provide safe lockless access by ensuring for example
273c8c06e52SAlex Bennéea change to a signal flag will only be visible once the changes to
274c8c06e52SAlex Bennéepayload are.
275c8c06e52SAlex Bennée
276c8c06e52SAlex BennéeDESIGN REQUIREMENT: Add a new tcg_memory_barrier op
277c8c06e52SAlex Bennée
278c8c06e52SAlex BennéeThis would enforce a strong load/store ordering so all loads/stores
279c8c06e52SAlex Bennéecomplete at the memory barrier. On single-core non-SMP strongly
280c8c06e52SAlex Bennéeordered backends this could become a NOP.
281c8c06e52SAlex Bennée
282c8c06e52SAlex BennéeAside from explicit standalone memory barrier instructions there are
283c8c06e52SAlex Bennéealso implicit memory ordering semantics which comes with each guest
284c8c06e52SAlex Bennéememory access instruction. For example all x86 load/stores come with
285c8c06e52SAlex Bennéefairly strong guarantees of sequential consistency whereas Arm has
286c8c06e52SAlex Bennéespecial variants of load/store instructions that imply acquire/release
287c8c06e52SAlex Bennéesemantics.
288c8c06e52SAlex Bennée
289c8c06e52SAlex BennéeIn the case of a strongly ordered guest architecture being emulated on
290c8c06e52SAlex Bennéea weakly ordered host the scope for a heavy performance impact is
291c8c06e52SAlex Bennéequite high.
292c8c06e52SAlex Bennée
293c8c06e52SAlex BennéeDESIGN REQUIREMENTS: Be efficient with use of memory barriers
294c8c06e52SAlex Bennée       - host systems with stronger implied guarantees can skip some barriers
295c8c06e52SAlex Bennée       - merge consecutive barriers to the strongest one
296c8c06e52SAlex Bennée
297c8c06e52SAlex Bennée(Current solution)
298c8c06e52SAlex Bennée
299c8c06e52SAlex BennéeThe system currently has a tcg_gen_mb() which will add memory barrier
300c8c06e52SAlex Bennéeoperations if code generation is being done in a parallel context. The
301c8c06e52SAlex Bennéetcg_optimize() function attempts to merge barriers up to their
302c8c06e52SAlex Bennéestrongest form before any load/store operations. The solution was
303c8c06e52SAlex Bennéeoriginally developed and tested for linux-user based systems. All
304c8c06e52SAlex Bennéebackends have been converted to emit fences when required. So far the
305c8c06e52SAlex Bennéefollowing front-ends have been updated to emit fences when required:
306c8c06e52SAlex Bennée
307c8c06e52SAlex Bennée    - target-i386
308c8c06e52SAlex Bennée    - target-arm
309c8c06e52SAlex Bennée    - target-aarch64
310c8c06e52SAlex Bennée    - target-alpha
311c8c06e52SAlex Bennée    - target-mips
312c8c06e52SAlex Bennée
313c8c06e52SAlex BennéeMemory Control and Maintenance
314c8c06e52SAlex Bennée------------------------------
315c8c06e52SAlex Bennée
316c8c06e52SAlex BennéeThis includes a class of instructions for controlling system cache
317c8c06e52SAlex Bennéebehaviour. While QEMU doesn't model cache behaviour these instructions
318c8c06e52SAlex Bennéeare often seen when code modification has taken place to ensure the
319c8c06e52SAlex Bennéechanges take effect.
320c8c06e52SAlex Bennée
321c8c06e52SAlex BennéeSynchronisation Primitives
322c8c06e52SAlex Bennée--------------------------
323c8c06e52SAlex Bennée
324c8c06e52SAlex BennéeThere are two broad types of synchronisation primitives found in
325c8c06e52SAlex Bennéemodern ISAs: atomic instructions and exclusive regions.
326c8c06e52SAlex Bennée
327c8c06e52SAlex BennéeThe first type offer a simple atomic instruction which will guarantee
328c8c06e52SAlex Bennéesome sort of test and conditional store will be truly atomic w.r.t.
329c8c06e52SAlex Bennéeother cores sharing access to the memory. The classic example is the
330c8c06e52SAlex Bennéex86 cmpxchg instruction.
331c8c06e52SAlex Bennée
332c8c06e52SAlex BennéeThe second type offer a pair of load/store instructions which offer a
333c8c06e52SAlex Bennéeguarantee that a region of memory has not been touched between the
334c8c06e52SAlex Bennéeload and store instructions. An example of this is Arm's ldrex/strex
335c8c06e52SAlex Bennéepair where the strex instruction will return a flag indicating a
336c8c06e52SAlex Bennéesuccessful store only if no other CPU has accessed the memory region
337c8c06e52SAlex Bennéesince the ldrex.
338c8c06e52SAlex Bennée
339c8c06e52SAlex BennéeTraditionally TCG has generated a series of operations that work
340c8c06e52SAlex Bennéebecause they are within the context of a single translation block so
341c8c06e52SAlex Bennéewill have completed before another CPU is scheduled. However with
342c8c06e52SAlex Bennéethe ability to have multiple threads running to emulate multiple CPUs
343c8c06e52SAlex Bennéewe will need to explicitly expose these semantics.
344c8c06e52SAlex Bennée
345c8c06e52SAlex BennéeDESIGN REQUIREMENTS:
346c8c06e52SAlex Bennée  - Support classic atomic instructions
347c8c06e52SAlex Bennée  - Support load/store exclusive (or load link/store conditional) pairs
348c8c06e52SAlex Bennée  - Generic enough infrastructure to support all guest architectures
349c8c06e52SAlex BennéeCURRENT OPEN QUESTIONS:
350c8c06e52SAlex Bennée  - How problematic is the ABA problem in general?
351c8c06e52SAlex Bennée
352c8c06e52SAlex Bennée(Current solution)
353c8c06e52SAlex Bennée
354c8c06e52SAlex BennéeThe TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
355c8c06e52SAlex Bennéecan be used directly or combined to emulate other instructions like
356c8c06e52SAlex BennéeArm's ldrex/strex instructions. While they are susceptible to the ABA
357c8c06e52SAlex Bennéeproblem so far common guests have not implemented patterns where
358c8c06e52SAlex Bennéethis may be a problem - typically presenting a locking ABI which
359c8c06e52SAlex Bennéeassumes cmpxchg like semantics.
360c8c06e52SAlex Bennée
361c8c06e52SAlex BennéeThe code also includes a fall-back for cases where multi-threaded TCG
362c8c06e52SAlex Bennéeops can't work (e.g. guest atomic width > host atomic width). In this
363c8c06e52SAlex Bennéecase an EXCP_ATOMIC exit occurs and the instruction is emulated with
364c8c06e52SAlex Bennéean exclusive lock which ensures all emulation is serialised.
365c8c06e52SAlex Bennée
366c8c06e52SAlex BennéeWhile the atomic helpers look good enough for now there may be a need
367c8c06e52SAlex Bennéeto look at solutions that can more closely model the guest
368c8c06e52SAlex Bennéearchitectures semantics.
369