docs/devel/multi-thread-tcg.rst

c8c06e52SAlex Bennée..
c8c06e52SAlex Bennée  Copyright (c) 2015-2020 Linaro Ltd.
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée  This work is licensed under the terms of the GNU GPL, version 2 or
c8c06e52SAlex Bennée  later. See the COPYING file in the top-level directory.
c8c06e52SAlex Bennée
ae63ed16SLuis Pires==================
ae63ed16SLuis PiresMulti-threaded TCG
ae63ed16SLuis Pires==================
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThis document outlines the design for multi-threaded TCG (a.k.a MTTCG)
c8c06e52SAlex Bennéesystem-mode emulation. user-mode emulation has always mirrored the
c8c06e52SAlex Bennéethread structure of the translated executable although some of the
c8c06e52SAlex Bennéechanges done for MTTCG system emulation have improved the stability of
c8c06e52SAlex Bennéelinux-user emulation.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe original system-mode TCG implementation was single threaded and
c8c06e52SAlex Bennéedealt with multiple CPUs with simple round-robin scheduling. This
c8c06e52SAlex Bennéesimplified a lot of things but became increasingly limited as systems
c8c06e52SAlex Bennéebeing emulated gained additional cores and per-core performance gains
c8c06e52SAlex Bennéefor host systems started to level off.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéevCPU Scheduling
c8c06e52SAlex Bennée===============
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeWe introduce a new running mode where each vCPU will run on its own
c8c06e52SAlex Bennéeuser-space thread. This is enabled by default for all FE/BE
c8c06e52SAlex Bennéecombinations where the host memory model is able to accommodate the
c8c06e52SAlex Bennéeguest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
c8c06e52SAlex Bennéeguest has had the required work done to support this safely
c8c06e52SAlex Bennée(TARGET_SUPPORTS_MTTCG).
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeSystem emulation will fall back to the original round robin approach
c8c06e52SAlex Bennéeif:
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée* forced by --accel tcg,thread=single
c8c06e52SAlex Bennée* enabling --icount mode
c8c06e52SAlex Bennée* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeIn the general case of running translated code there should be no
c8c06e52SAlex Bennéeinter-vCPU dependencies and all vCPUs should be able to run at full
c8c06e52SAlex Bennéespeed. Synchronisation will only be required while accessing internal
c8c06e52SAlex Bennéeshared data structures or when the emulated architecture requires a
c8c06e52SAlex Bennéecoherent representation of the emulated machine state.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeShared Data Structures
c8c06e52SAlex Bennée======================
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeMain Run Loop
c8c06e52SAlex Bennée-------------
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeEven when there is no code being generated there are a number of
c8c06e52SAlex Bennéestructures associated with the hot-path through the main run-loop.
c8c06e52SAlex BennéeThese are associated with looking up the next translation block to
c8c06e52SAlex Bennéeexecute. These include:
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée    tb_jmp_cache (per-vCPU, cache of recent jumps)
c8c06e52SAlex Bennée    tb_ctx.htable (global hash table, phys address->tb lookup)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeAs TB linking only occurs when blocks are in the same page this code
c8c06e52SAlex Bennéeis critical to performance as looking up the next TB to execute is the
c8c06e52SAlex Bennéemost common reason to exit the generated code.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeDESIGN REQUIREMENT: Make access to lookup structures safe with
c8c06e52SAlex Bennéemultiple reader/writer threads. Minimise any lock contention to do it.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe hot-path avoids using locks where possible. The tb_jmp_cache is
c8c06e52SAlex Bennéeupdated with atomic accesses to ensure consistent results. The fall
c8c06e52SAlex Bennéeback QHT based hash table is also designed for lockless lookups. Locks
c8c06e52SAlex Bennéeare only taken when code generation is required or TranslationBlocks
c8c06e52SAlex Bennéehave their block-to-block jumps patched.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeGlobal TCG State
c8c06e52SAlex Bennée----------------
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeUser-mode emulation
c8c06e52SAlex Bennée~~~~~~~~~~~~~~~~~~~
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeWe need to protect the entire code generation cycle including any post
c8c06e52SAlex Bennéegeneration patching of the translated code. This also implies a shared
c8c06e52SAlex Bennéetranslation buffer which contains code running on all cores. Any
c8c06e52SAlex Bennéeexecution path that comes to the main run loop will need to hold a
c8c06e52SAlex Bennéemutex for code generation. This also includes times when we need flush
c8c06e52SAlex Bennéecode or entries from any shared lookups/caches. Structures held on a
c8c06e52SAlex Bennéeper-vCPU basis won't need locking unless other vCPUs will need to
c8c06e52SAlex Bennéemodify them.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeDESIGN REQUIREMENT: Add locking around all code generation and TB
c8c06e52SAlex Bennéepatching.
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée(Current solution)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeCode generation is serialised with mmap_lock().
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée!User-mode emulation
c8c06e52SAlex Bennée~~~~~~~~~~~~~~~~~~~~
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeEach vCPU has its own TCG context and associated TCG region, thereby
c8c06e52SAlex Bennéerequiring no locking during translation.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeTranslation Blocks
c8c06e52SAlex Bennée------------------
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeCurrently the whole system shares a single code generation buffer
c8c06e52SAlex Bennéewhich when full will force a flush of all translations and start from
c8c06e52SAlex Bennéescratch again. Some operations also force a full flush of translations
c8c06e52SAlex Bennéeincluding:
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée  - debugging operations (breakpoint insertion/removal)
c8c06e52SAlex Bennée  - some CPU helper functions
93154e76SAlex Bennée  - linux-user spawning its first thread
*02ca5ec1SPierrick Bouvier  - operations related to TCG Plugins
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThis is done with the async_safe_run_on_cpu() mechanism to ensure all
c8c06e52SAlex BennéevCPUs are quiescent when changes are being made to shared global
c8c06e52SAlex Bennéestructures.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeMore granular translation invalidation events are typically due
c8c06e52SAlex Bennéeto a change of the state of a physical page:
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée  - code modification (self modify code, patching code)
c8c06e52SAlex Bennée  - page changes (new page mapping in linux-user mode)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeWhile setting the invalid flag in a TranslationBlock will stop it
c8c06e52SAlex Bennéebeing used when looked up in the hot-path there are a number of other
c8c06e52SAlex Bennéebook-keeping structures that need to be safely cleared.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeAny TranslationBlocks which have been patched to jump directly to the
c8c06e52SAlex Bennéenow invalid blocks need the jump patches reversing so they will return
c8c06e52SAlex Bennéeto the C code.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThere are a number of look-up caches that need to be properly updated
c8c06e52SAlex Bennéeincluding the:
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée  - jump lookup cache
c8c06e52SAlex Bennée  - the physical-to-tb lookup hash table
c8c06e52SAlex Bennée  - the global page table
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe global page table (l1_map) which provides a multi-level look-up
c8c06e52SAlex Bennéefor PageDesc structures which contain pointers to the start of a
c8c06e52SAlex Bennéelinked list of all Translation Blocks in that page (see page_next).
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeBoth the jump patching and the page cache involve linked lists that
c8c06e52SAlex Bennéethe invalidated TranslationBlock needs to be removed from.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeDESIGN REQUIREMENT: Safely handle invalidation of TBs
c8c06e52SAlex Bennée                      - safely patch/revert direct jumps
c8c06e52SAlex Bennée                      - remove central PageDesc lookup entries
c8c06e52SAlex Bennée                      - ensure lookup caches/hashes are safely updated
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée(Current solution)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe direct jump themselves are updated atomically by the TCG
c8c06e52SAlex Bennéetb_set_jmp_target() code. Modification to the linked lists that allow
c8c06e52SAlex Bennéesearching for linked pages are done under the protection of tb->jmp_lock,
c8c06e52SAlex Bennéewhere tb is the destination block of a jump. Each origin block keeps a
c8c06e52SAlex Bennéepointer to its destinations so that the appropriate lock can be acquired before
c8c06e52SAlex Bennéeiterating over a jump list.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe global page table is a lockless radix tree; cmpxchg is used
c8c06e52SAlex Bennéeto atomically insert new elements.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe lookup caches are updated atomically and the lookup hash uses QHT
c8c06e52SAlex Bennéewhich is designed for concurrent safe lookup.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeParallel code generation is supported. QHT is used at insertion time
c8c06e52SAlex Bennéeas the synchronization point across threads, thereby ensuring that we only
c8c06e52SAlex Bennéekeep track of a single TranslationBlock for each guest code block.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeMemory maps and TLBs
c8c06e52SAlex Bennée--------------------
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe memory handling code is fairly critical to the speed of memory
c8c06e52SAlex Bennéeaccess in the emulated system. The SoftMMU code is designed so the
c8c06e52SAlex Bennéehot-path can be handled entirely within translated code. This is
c8c06e52SAlex Bennéehandled with a per-vCPU TLB structure which once populated will allow
c8c06e52SAlex Bennéea series of accesses to the page to occur without exiting the
c8c06e52SAlex Bennéetranslated code. It is possible to set flags in the TLB address which
c8c06e52SAlex Bennéewill ensure the slow-path is taken for each access. This can be done
c8c06e52SAlex Bennéeto support:
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée  - Memory regions (dividing up access to PIO, MMIO and RAM)
c8c06e52SAlex Bennée  - Dirty page tracking (for code gen, SMC detection, migration and display)
c8c06e52SAlex Bennée  - Virtual TLB (for translating guest address->real address)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeWhen the TLB tables are updated by a vCPU thread other than their own
c8c06e52SAlex Bennéewe need to ensure it is done in a safe way so no inconsistent state is
c8c06e52SAlex Bennéeseen by the vCPU thread.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeSome operations require updating a number of vCPUs TLBs at the same
c8c06e52SAlex Bennéetime in a synchronised manner.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeDESIGN REQUIREMENTS:
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée  - TLB Flush All/Page
c8c06e52SAlex Bennée    - can be across-vCPUs
c8c06e52SAlex Bennée    - cross vCPU TLB flush may need other vCPU brought to halt
c8c06e52SAlex Bennée    - change may need to be visible to the calling vCPU immediately
c8c06e52SAlex Bennée  - TLB Flag Update
c8c06e52SAlex Bennée    - usually cross-vCPU
c8c06e52SAlex Bennée    - want change to be visible as soon as possible
c8c06e52SAlex Bennée  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
c8c06e52SAlex Bennée    - This is a per-vCPU table - by definition can't race
c8c06e52SAlex Bennée    - updated by its own thread when the slow-path is forced
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée(Current solution)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeWe have updated cputlb.c to defer operations when a cross-vCPU
c8c06e52SAlex Bennéeoperation with async_run_on_cpu() which ensures each vCPU sees a
c8c06e52SAlex Bennéecoherent state when it next runs its work (in a few instructions
c8c06e52SAlex Bennéetime).
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeA new set up operations (tlb_flush_*_all_cpus) take an additional flag
c8c06e52SAlex Bennéewhich when set will force synchronisation by setting the source vCPUs
c8c06e52SAlex Bennéework as "safe work" and exiting the cpu run loop. This ensure by the
c8c06e52SAlex Bennéetime execution restarts all flush operations have completed.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeTLB flag updates are all done atomically and are also protected by the
c8c06e52SAlex Bennéecorresponding page lock.
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée(Known limitation)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeNot really a limitation but the wait mechanism is overly strict for
c8c06e52SAlex Bennéesome architectures which only need flushes completed by a barrier
c8c06e52SAlex Bennéeinstruction. This could be a future optimisation.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeEmulated hardware state
c8c06e52SAlex Bennée-----------------------
c8c06e52SAlex Bennée
0b2675c4SStefan HajnocziCurrently thanks to KVM work any access to IO memory is automatically protected
0b2675c4SStefan Hajnocziby the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected
0b2675c4SStefan Hajnoczito do its own locking.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeHowever IO memory isn't the only way emulated hardware state can be
c8c06e52SAlex Bennéemodified. Some architectures have model specific registers that
c8c06e52SAlex Bennéetrigger hardware emulation features. Generally any translation helper
c8c06e52SAlex Bennéethat needs to update more than a single vCPUs of state should take the
c8c06e52SAlex BennéeBQL.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeAs the BQL, or global iothread mutex is shared across the system we
c8c06e52SAlex Bennéepush the use of the lock as far down into the TCG code as possible to
c8c06e52SAlex Bennéeminimise contention.
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée(Current solution)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeMMIO access automatically serialises hardware emulation by way of the
c8c06e52SAlex BennéeBQL. Currently Arm targets serialise all ARM_CP_IO register accesses
c8c06e52SAlex Bennéeand also defer the reset/startup of vCPUs to the vCPU context by way
c8c06e52SAlex Bennéeof async_run_on_cpu().
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeUpdates to interrupt state are also protected by the BQL as they can
c8c06e52SAlex Bennéeoften be cross vCPU.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeMemory Consistency
c8c06e52SAlex Bennée==================
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeBetween emulated guests and host systems there are a range of memory
c8c06e52SAlex Bennéeconsistency models. Even emulating weakly ordered systems on strongly
c8c06e52SAlex Bennéeordered hosts needs to ensure things like store-after-load re-ordering
c8c06e52SAlex Bennéecan be prevented when the guest wants to.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeMemory Barriers
c8c06e52SAlex Bennée---------------
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeBarriers (sometimes known as fences) provide a mechanism for software
c8c06e52SAlex Bennéeto enforce a particular ordering of memory operations from the point
c8c06e52SAlex Bennéeof view of external observers (e.g. another processor core). They can
c8c06e52SAlex Bennéeapply to any memory operations as well as just loads or stores.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe Linux kernel has an excellent `write-up
1ec43ca4SJohn Snow<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
c8c06e52SAlex Bennéeon the various forms of memory barrier and the guarantees they can
c8c06e52SAlex Bennéeprovide.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeBarriers are often wrapped around synchronisation primitives to
c8c06e52SAlex Bennéeprovide explicit memory ordering semantics. However they can be used
c8c06e52SAlex Bennéeby themselves to provide safe lockless access by ensuring for example
c8c06e52SAlex Bennéea change to a signal flag will only be visible once the changes to
c8c06e52SAlex Bennéepayload are.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeDESIGN REQUIREMENT: Add a new tcg_memory_barrier op
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThis would enforce a strong load/store ordering so all loads/stores
c8c06e52SAlex Bennéecomplete at the memory barrier. On single-core non-SMP strongly
c8c06e52SAlex Bennéeordered backends this could become a NOP.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeAside from explicit standalone memory barrier instructions there are
c8c06e52SAlex Bennéealso implicit memory ordering semantics which comes with each guest
c8c06e52SAlex Bennéememory access instruction. For example all x86 load/stores come with
c8c06e52SAlex Bennéefairly strong guarantees of sequential consistency whereas Arm has
c8c06e52SAlex Bennéespecial variants of load/store instructions that imply acquire/release
c8c06e52SAlex Bennéesemantics.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeIn the case of a strongly ordered guest architecture being emulated on
c8c06e52SAlex Bennéea weakly ordered host the scope for a heavy performance impact is
c8c06e52SAlex Bennéequite high.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeDESIGN REQUIREMENTS: Be efficient with use of memory barriers
c8c06e52SAlex Bennée       - host systems with stronger implied guarantees can skip some barriers
c8c06e52SAlex Bennée       - merge consecutive barriers to the strongest one
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée(Current solution)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe system currently has a tcg_gen_mb() which will add memory barrier
c8c06e52SAlex Bennéeoperations if code generation is being done in a parallel context. The
c8c06e52SAlex Bennéetcg_optimize() function attempts to merge barriers up to their
c8c06e52SAlex Bennéestrongest form before any load/store operations. The solution was
c8c06e52SAlex Bennéeoriginally developed and tested for linux-user based systems. All
c8c06e52SAlex Bennéebackends have been converted to emit fences when required. So far the
c8c06e52SAlex Bennéefollowing front-ends have been updated to emit fences when required:
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée    - target-i386
c8c06e52SAlex Bennée    - target-arm
c8c06e52SAlex Bennée    - target-aarch64
c8c06e52SAlex Bennée    - target-alpha
c8c06e52SAlex Bennée    - target-mips
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeMemory Control and Maintenance
c8c06e52SAlex Bennée------------------------------
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThis includes a class of instructions for controlling system cache
c8c06e52SAlex Bennéebehaviour. While QEMU doesn't model cache behaviour these instructions
c8c06e52SAlex Bennéeare often seen when code modification has taken place to ensure the
c8c06e52SAlex Bennéechanges take effect.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeSynchronisation Primitives
c8c06e52SAlex Bennée--------------------------
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThere are two broad types of synchronisation primitives found in
c8c06e52SAlex Bennéemodern ISAs: atomic instructions and exclusive regions.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe first type offer a simple atomic instruction which will guarantee
c8c06e52SAlex Bennéesome sort of test and conditional store will be truly atomic w.r.t.
c8c06e52SAlex Bennéeother cores sharing access to the memory. The classic example is the
c8c06e52SAlex Bennéex86 cmpxchg instruction.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe second type offer a pair of load/store instructions which offer a
c8c06e52SAlex Bennéeguarantee that a region of memory has not been touched between the
c8c06e52SAlex Bennéeload and store instructions. An example of this is Arm's ldrex/strex
c8c06e52SAlex Bennéepair where the strex instruction will return a flag indicating a
c8c06e52SAlex Bennéesuccessful store only if no other CPU has accessed the memory region
c8c06e52SAlex Bennéesince the ldrex.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeTraditionally TCG has generated a series of operations that work
c8c06e52SAlex Bennéebecause they are within the context of a single translation block so
c8c06e52SAlex Bennéewill have completed before another CPU is scheduled. However with
c8c06e52SAlex Bennéethe ability to have multiple threads running to emulate multiple CPUs
c8c06e52SAlex Bennéewe will need to explicitly expose these semantics.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeDESIGN REQUIREMENTS:
c8c06e52SAlex Bennée  - Support classic atomic instructions
c8c06e52SAlex Bennée  - Support load/store exclusive (or load link/store conditional) pairs
c8c06e52SAlex Bennée  - Generic enough infrastructure to support all guest architectures
c8c06e52SAlex BennéeCURRENT OPEN QUESTIONS:
c8c06e52SAlex Bennée  - How problematic is the ABA problem in general?
c8c06e52SAlex Bennée
c8c06e52SAlex Bennée(Current solution)
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
c8c06e52SAlex Bennéecan be used directly or combined to emulate other instructions like
c8c06e52SAlex BennéeArm's ldrex/strex instructions. While they are susceptible to the ABA
c8c06e52SAlex Bennéeproblem so far common guests have not implemented patterns where
c8c06e52SAlex Bennéethis may be a problem - typically presenting a locking ABI which
c8c06e52SAlex Bennéeassumes cmpxchg like semantics.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeThe code also includes a fall-back for cases where multi-threaded TCG
c8c06e52SAlex Bennéeops can't work (e.g. guest atomic width > host atomic width). In this
c8c06e52SAlex Bennéecase an EXCP_ATOMIC exit occurs and the instruction is emulated with
c8c06e52SAlex Bennéean exclusive lock which ensures all emulation is serialised.
c8c06e52SAlex Bennée
c8c06e52SAlex BennéeWhile the atomic helpers look good enough for now there may be a need
c8c06e52SAlex Bennéeto look at solutions that can more closely model the guest
c8c06e52SAlex Bennéearchitectures semantics.