1.. Porting SIMD code:
2
3.. role:: raw-html(raw)
4    :format: html
5
6=======================================
7Porting SIMD code targeting WebAssembly
8=======================================
9
10Emscripten supports the `WebAssembly SIMD proposal <https://github.com/webassembly/simd/>`_ when using the WebAssembly LLVM backend. To enable SIMD, pass the -msimd128 flag at compile time. This will also turn on LLVM's autovectorization passes, so no source modifications are necessary to benefit from SIMD.
11
12At the source level, the GCC/Clang `SIMD Vector Extensions <https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html>`_ can be used and will be lowered to WebAssembly SIMD instructions where possible. In addition, there is a portable intrinsics header file that can be used.
13
14    .. code-block:: cpp
15
16       #include <wasm_simd128.h>
17
18Separate documentation for the intrinsics header is a work in progress, but its usage is straightforward and its source can be found at `wasm_simd128.h <https://github.com/llvm/llvm-project/blob/master/clang/lib/Headers/wasm_simd128.h>`_. These intrinsics are under active development in parallel with the SIMD proposal and should not be considered any more stable than the proposal itself. Note that most engines will also require an extra flag to enable SIMD. For example, Node requires `--experimental-wasm-simd`.
19
20WebAssembly SIMD is not supported when using the Fastcomp backend.
21
22======================================
23Limitations and behavioral differences
24======================================
25
26When porting native SIMD code, it should be noted that because of portability concerns, the WebAssembly SIMD specification does not expose the full native instruction sets. In particular the following changes exist:
27
28 - Emscripten does not support x86 or any other native inline SIMD assembly or building .s assembly files, so all code should be written to use SIMD intrinsic functions or compiler vector extensions.
29
30 - WebAssembly SIMD does not have control over managing floating point rounding modes or handling denormals.
31
32 - Cache line prefetch instructions are not available, and calls to these functions will compile, but are treated as no-ops.
33
34 - Asymmetric memory fence operations are not available, but will be implemented as fully synchronous memory fences when SharedArrayBuffer is enabled (-s USE_PTHREADS=1) or as no-ops when multithreading is not enabled (default, -s USE_PTHREADS=0).
35
36SIMD-related bug reports are tracked in the `Emscripten bug tracker with the label SIMD <https://github.com/emscripten-core/emscripten/issues?q=is%3Aopen+is%3Aissue+label%3ASIMD>`_.
37
38=====================================================
39Compiling SIMD code targeting x86 SSE instruction set
40=====================================================
41
42Emscripten supports compiling existing codebases that use x86 SSE by passing the `-msse` directive to the compiler, and including the header `<xmmintrin.h>`.
43
44Currently only the SSE1 and SSE2 instruction sets are supported.
45
46The following table highlights the availability and expected performance of different SSE1 intrinsics. Even if you are directly targeting the native Wasm SIMD opcodes via wasm_simd128.h header, this table can be useful for understanding the performance limitations that the Wasm SIMD specification has when running on x86 hardware.
47
48For detailed information on each SSE intrinsic function, visit the excellent `Intel Intrinsics Guide on SSE1 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE>`_.
49
50The following legend is used to highlight the expected performance of various instructions:
51 - ✅ Wasm SIMD has a native opcode that matches the x86 SSE instruction, should yield native performance
52 - �� while the Wasm SIMD spec does not provide a proper performance guarantee, given a suitably smart enough compiler and a runtime VM path, this intrinsic should be able to generate the identical native SSE instruction.
53 - �� there is some information missing (e.g. type or alignment information) for a Wasm VM to be guaranteed to be able to reconstruct the intended x86 SSE opcode. This might cause a penalty depending on the target CPU hardware family, especially on older CPU generations.
54 - ⚠️ the underlying x86 SSE instruction is not available, but it is emulated via at most few other Wasm SIMD instructions, causing a small penalty.
55 - ❌ the underlying x86 SSE instruction is not exposed by the Wasm SIMD specification, so it must be emulated via a slow path, e.g. a sequence of several slower SIMD instructions, or a scalar implementation.
56 - �� the underlying x86 SSE opcode is not available in Wasm SIMD, and the implementation must resort to such a slow emulated path, that a workaround rethinking the algorithm at a higher level is advised.
57 - �� the given SSE intrinsic is available to let applications compile, but does nothing.
58 - ⚫ the given SSE intrinsic is not available. Referencing the intrinsic will cause a compiler error.
59
60Certain intrinsics in the table below are marked "virtual". This means that there does not actually exist a native x86 SSE instruction set opcode to implement them, but native compilers offer the function as a convenience. Different compilers might generate a different instruction sequence for these.
61
62.. list-table:: x86 SSE intrinsics available via #include <xmmintrin.h>
63   :widths: 20 30
64   :header-rows: 1
65
66   * - Intrinsic name
67     - WebAssembly SIMD support
68   * - _mm_set_ps
69     - ✅ wasm_f32x4_make
70   * - _mm_setr_ps
71     - ✅ wasm_f32x4_make
72   * - _mm_set_ss
73     - �� emulated with wasm_f32x4_make
74   * - _mm_set_ps1 (_mm_set1_ps)
75     - ✅ wasm_f32x4_splat
76   * - _mm_setzero_ps
77     - �� emulated with wasm_f32x4_const(0)
78   * - _mm_load_ps
79     - �� wasm_v128_load. VM must guess type. :raw-html:`<br />` Unaligned load on x86 CPUs.
80   * - _mm_loadl_pi
81     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle.
82   * - _mm_loadh_pi
83     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle.
84   * - _mm_loadr_ps
85     - �� Virtual. Simd load + shuffle.
86   * - _mm_loadu_ps
87     - �� wasm_v128_load. VM must guess type.
88   * - _mm_load_ps1 (_mm_load1_ps)
89     - �� Virtual. Simd load + shuffle.
90   * - _mm_load_ss
91     - ❌ emulated with wasm_f32x4_make
92   * - _mm_storel_pi
93     - ❌ scalar stores
94   * - _mm_storeh_pi
95     - ❌ shuffle + scalar stores
96   * - _mm_store_ps
97     - �� wasm_v128_store. VM must guess type. :raw-html:`<br />` Unaligned store on x86 CPUs.
98   * - _mm_stream_ps
99     - �� wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD.
100   * - _mm_prefetch
101     - �� No-op.
102   * - _mm_sfence
103     - ⚠️ A full barrier in multithreaded builds.
104   * - _mm_shuffle_ps
105     - �� wasm_v32x4_shuffle. VM must guess type.
106   * - _mm_storer_ps
107     - �� Virtual. Shuffle + Simd store.
108   * - _mm_store_ps1 (_mm_store1_ps)
109     - �� Virtual. Emulated with shuffle. :raw-html:`<br />` Unaligned store on x86 CPUs.
110   * - _mm_store_ss
111     - �� emulated with scalar store
112   * - _mm_storeu_ps
113     - �� wasm_v128_store. VM must guess type.
114   * - _mm_storeu_si16
115     - �� emulated with scalar store
116   * - _mm_storeu_si64
117     - �� emulated with scalar store
118   * - _mm_movemask_ps
119     - �� No Wasm SIMD support. Emulated in scalar. `simd/#131 <https://github.com/WebAssembly/simd/issues/131>`_
120   * - _mm_move_ss
121     - �� emulated with a shuffle. VM must guess type.
122   * - _mm_add_ps
123     - ✅ wasm_f32x4_add
124   * - _mm_add_ss
125     - ⚠️ emulated with a shuffle
126   * - _mm_sub_ps
127     - ✅ wasm_f32x4_sub
128   * - _mm_sub_ss
129     - ⚠️ emulated with a shuffle
130   * - _mm_mul_ps
131     - ✅ wasm_f32x4_mul
132   * - _mm_mul_ss
133     - ⚠️ emulated with a shuffle
134   * - _mm_div_ps
135     - ✅ wasm_f32x4_div
136   * - _mm_div_ss
137     - ⚠️ emulated with a shuffle
138   * - _mm_min_ps
139     - TODO: pmin once it works
140   * - _mm_min_ss
141     - ⚠️ emulated with a shuffle
142   * - _mm_max_ps
143     - TODO: pmax once it works
144   * - _mm_max_ss
145     - ⚠️ emulated with a shuffle
146   * - _mm_rcp_ps
147     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with full precision div. `simd/#3 <https://github.com/WebAssembly/simd/issues/3>`_
148   * - _mm_rcp_ss
149     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with full precision div+shuffle `simd/#3 <https://github.com/WebAssembly/simd/issues/3>`_
150   * - _mm_sqrt_ps
151     - ✅ wasm_f32x4_sqrt
152   * - _mm_sqrt_ss
153     - ⚠️ emulated with a shuffle
154   * - _mm_rsqrt_ps
155     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with full precision div+sqrt. `simd/#3 <https://github.com/WebAssembly/simd/issues/3>`_
156   * - _mm_rsqrt_ss
157     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with full precision div+sqrt+shuffle. `simd/#3 <https://github.com/WebAssembly/simd/issues/3>`_
158   * - _mm_unpackhi_ps
159     - �� emulated with a shuffle
160   * - _mm_unpacklo_ps
161     - �� emulated with a shuffle
162   * - _mm_movehl_ps
163     - �� emulated with a shuffle
164   * - _mm_movelh_ps
165     - �� emulated with a shuffle
166   * - _MM_TRANSPOSE4_PS
167     - �� emulated with a shuffle
168   * - _mm_cmplt_ps
169     - ✅ wasm_f32x4_lt
170   * - _mm_cmplt_ss
171     - ⚠️ emulated with a shuffle
172   * - _mm_cmple_ps
173     - ✅ wasm_f32x4_le
174   * - _mm_cmple_ss
175     - ⚠️ emulated with a shuffle
176   * - _mm_cmpeq_ps
177     - ✅ wasm_f32x4_eq
178   * - _mm_cmpeq_ss
179     - ⚠️ emulated with a shuffle
180   * - _mm_cmpge_ps
181     - ✅ wasm_f32x4_ge
182   * - _mm_cmpge_ss
183     - ⚠️ emulated with a shuffle
184   * - _mm_cmpgt_ps
185     - ✅ wasm_f32x4_gt
186   * - _mm_cmpgt_ss
187     - ⚠️ emulated with a shuffle
188   * - _mm_cmpord_ps
189     - ❌ emulated with 2xcmp+and
190   * - _mm_cmpord_ss
191     - ❌ emulated with 2xcmp+and+shuffle
192   * - _mm_cmpunord_ps
193     - ❌ emulated with 2xcmp+or
194   * - _mm_cmpunord_ss
195     - ❌ emulated with 2xcmp+or+shuffle
196   * - _mm_and_ps
197     - �� wasm_v128_and. VM must guess type.
198   * - _mm_andnot_ps
199     - �� wasm_v128_andnot. VM must guess type.
200   * - _mm_or_ps
201     - �� wasm_v128_or. VM must guess type.
202   * - _mm_xor_ps
203     - �� wasm_v128_xor. VM must guess type.
204   * - _mm_cmpneq_ps
205     - ✅ wasm_f32x4_ne
206   * - _mm_cmpneq_ss
207     - ⚠️ emulated with a shuffle
208   * - _mm_cmpnge_ps
209     - ⚠️ emulated with not+ge
210   * - _mm_cmpnge_ss
211     - ⚠️ emulated with not+ge+shuffle
212   * - _mm_cmpngt_ps
213     - ⚠️ emulated with not+gt
214   * - _mm_cmpngt_ss
215     - ⚠️ emulated with not+gt+shuffle
216   * - _mm_cmpnle_ps
217     - ⚠️ emulated with not+le
218   * - _mm_cmpnle_ss
219     - ⚠️ emulated with not+le+shuffle
220   * - _mm_cmpnlt_ps
221     - ⚠️ emulated with not+lt
222   * - _mm_cmpnlt_ss
223     - ⚠️ emulated with not+lt+shuffle
224   * - _mm_comieq_ss
225     - ❌ scalarized
226   * - _mm_comige_ss
227     - ❌ scalarized
228   * - _mm_comigt_ss
229     - ❌ scalarized
230   * - _mm_comile_ss
231     - ❌ scalarized
232   * - _mm_comilt_ss
233     - ❌ scalarized
234   * - _mm_comineq_ss
235     - ❌ scalarized
236   * - _mm_ucomieq_ss
237     - ❌ scalarized
238   * - _mm_ucomige_ss
239     - ❌ scalarized
240   * - _mm_ucomigt_ss
241     - ❌ scalarized
242   * - _mm_ucomile_ss
243     - ❌ scalarized
244   * - _mm_ucomilt_ss
245     - ❌ scalarized
246   * - _mm_ucomineq_ss
247     - ❌ scalarized
248   * - _mm_cvtsi32_ss (_mm_cvt_si2ss)
249     - ❌ scalarized
250   * - _mm_cvtss_si32 (_mm_cvt_ss2si)
251     - �� scalar with complex emulated semantics
252   * - _mm_cvttss_si32 (_mm_cvtt_ss2si)
253     - �� scalar with complex emulated semantics
254   * - _mm_cvtsi64_ss
255     - ❌ scalarized
256   * - _mm_cvtss_si64
257     - �� scalar with complex emulated semantics
258   * - _mm_cvttss_si64
259     - �� scalar with complex emulated semantics
260   * - _mm_cvtss_f32
261     - �� scalar get
262   * - _mm_malloc
263     - ✅ Allocates memory with specified alignment.
264   * - _mm_free
265     - ✅ Aliases to free().
266   * - _MM_GET_EXCEPTION_MASK
267     - ✅ Always returns all exceptions masked (0x1f80).
268   * - _MM_GET_EXCEPTION_STATE
269     - ❌ Exception state is not tracked. Always returns 0.
270   * - _MM_GET_FLUSH_ZERO_MODE
271     - ✅ Always returns _MM_FLUSH_ZERO_OFF.
272   * - _MM_GET_ROUNDING_MODE
273     - ✅ Always returns _MM_ROUND_NEAREST.
274   * - _mm_getcsr
275     - ✅ Always returns _MM_FLUSH_ZERO_OFF :raw-html:`<br />` | _MM_ROUND_NEAREST | all exceptions masked (0x1f80).
276   * - _MM_SET_EXCEPTION_MASK
277     - ⚫ Not available. Fixed to all exceptions masked.
278   * - _MM_SET_EXCEPTION_STATE
279     - ⚫ Not available. Fixed to zero/clear state.
280   * - _MM_SET_FLUSH_ZERO_MODE
281     - ⚫ Not available. Fixed to _MM_FLUSH_ZERO_OFF.
282   * - _MM_SET_ROUNDING_MODE
283     - ⚫ Not available. Fixed to _MM_ROUND_NEAREST.
284   * - _mm_setcsr
285     - ⚫ Not available.
286   * - _mm_undefined_ps
287     - ✅ Virtual
288
289⚫ The following extensions that SSE1 instruction set brought to 64-bit wide MMX registers are not available:
290 - _mm_avg_pu8, _mm_avg_pu16, _mm_cvt_pi2ps, _mm_cvt_ps2pi, _mm_cvt_pi16_ps, _mm_cvt_pi32_ps, _mm_cvt_pi32x2_ps, _mm_cvt_pi8_ps, _mm_cvt_ps_pi16, _mm_cvt_ps_pi32, _mm_cvt_ps_pi8, _mm_cvt_pu16_ps, _mm_cvt_pu8_ps, _mm_cvtt_ps2pi, _mm_cvtt_pi16_ps, _mm_cvttps_pi32, _mm_extract_pi16, _mm_insert_pi16, _mm_maskmove_si64, _m_maskmovq, _mm_max_pi16, _mm_max_pu8, _mm_min_pi16, _mm_min_pu8, _mm_movemask_pi8, _mm_mulhi_pu16, _m_pavgb, _m_pavgw, _m_pextrw, _m_pinsrw, _m_pmaxsw, _m_pmaxub, _m_pminsw, _m_pminub, _m_pmovmskb, _m_pmulhuw, _m_psadbw, _m_pshufw, _mm_sad_pu8, _mm_shuffle_pi16 and _mm_stream_pi.
291
292Any code referencing these intrinsics will not compile.
293
294The following table highlights the availability and expected performance of different SSE2 intrinsics. Refer to `Intel Intrinsics Guide on SSE2 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE2>`_.
295
296.. list-table:: x86 SSE2 intrinsics available via #include <emmintrin.h>
297   :widths: 20 30
298   :header-rows: 1
299
300   * - Intrinsic name
301     - WebAssembly SIMD support
302   * - _mm_add_epi16
303     - ✅ wasm_i16x8_add
304   * - _mm_add_epi32
305     - ✅ wasm_i32x4_add
306   * - _mm_add_epi64
307     - ✅ wasm_i64x2_add
308   * - _mm_add_epi8
309     - ✅ wasm_i8x16_add
310   * - _mm_add_pd
311     - ✅ wasm_f64x2_add
312   * - _mm_add_sd
313     - ⚠️ emulated with a shuffle
314   * - _mm_adds_epi16
315     - ✅ wasm_i16x8_add_saturate
316   * - _mm_adds_epi8
317     - ✅ wasm_i8x16_add_saturate
318   * - _mm_adds_epu16
319     - ✅ wasm_u16x8_add_saturate
320   * - _mm_adds_epu8
321     - ✅ wasm_u8x16_add_saturate
322   * - _mm_and_pd
323     - �� wasm_v128_and. VM must guess type.
324   * - _mm_and_si128
325     - �� wasm_v128_and. VM must guess type.
326   * - _mm_andnot_pd
327     - �� wasm_v128_andnot. VM must guess type.
328   * - _mm_andnot_si128
329     - �� wasm_v128_andnot. VM must guess type.
330   * - _mm_avg_epu16
331     - ✅ wasm_u16x8_avgr
332   * - _mm_avg_epu8
333     - ✅ wasm_u8x16_avgr
334   * - _mm_castpd_ps
335     - ✅ no-op
336   * - _mm_castpd_si128
337     - ✅ no-op
338   * - _mm_castps_pd
339     - ✅ no-op
340   * - _mm_castps_si128
341     - ✅ no-op
342   * - _mm_castsi128_pd
343     - ✅ no-op
344   * - _mm_castsi128_ps
345     - ✅ no-op
346   * - _mm_clflush
347     - �� No-op. No cache hinting in Wasm SIMD.
348   * - _mm_cmpeq_epi16
349     - ✅ wasm_i16x8_eq
350   * - _mm_cmpeq_epi32
351     - ✅ wasm_i32x4_eq
352   * - _mm_cmpeq_epi8
353     - ✅ wasm_i8x16_eq
354   * - _mm_cmpeq_pd
355     - ✅ wasm_f64x2_eq
356   * - _mm_cmpeq_sd
357     - ⚠️ emulated with a shuffle
358   * - _mm_cmpge_pd
359     - ✅ wasm_f64x2_ge
360   * - _mm_cmpge_sd
361     - ⚠️ emulated with a shuffle
362   * - _mm_cmpgt_epi16
363     - ✅ wasm_i16x8_gt
364   * - _mm_cmpgt_epi32
365     - ✅ wasm_i32x4_gt
366   * - _mm_cmpgt_epi8
367     - ✅ wasm_i8x16_gt
368   * - _mm_cmpgt_pd
369     - ✅ wasm_f64x2_gt
370   * - _mm_cmpgt_sd
371     - ⚠️ emulated with a shuffle
372   * - _mm_cmple_pd
373     - ✅ wasm_f64x2_le
374   * - _mm_cmple_sd
375     - ⚠️ emulated with a shuffle
376   * - _mm_cmplt_epi16
377     - ✅ wasm_i16x8_lt
378   * - _mm_cmplt_epi32
379     - ✅ wasm_i32x4_lt
380   * - _mm_cmplt_epi8
381     - ✅ wasm_i8x16_lt
382   * - _mm_cmplt_pd
383     - ✅ wasm_f64x2_lt
384   * - _mm_cmplt_sd
385     - ⚠️ emulated with a shuffle
386   * - _mm_cmpneq_pd
387     - ✅ wasm_f64x2_ne
388   * - _mm_cmpneq_sd
389     - ⚠️ emulated with a shuffle
390   * - _mm_cmpnge_pd
391     - ⚠️ emulated with not+ge
392   * - _mm_cmpnge_sd
393     - ⚠️ emulated with not+ge+shuffle
394   * - _mm_cmpngt_pd
395     - ⚠️ emulated with not+gt
396   * - _mm_cmpngt_sd
397     - ⚠️ emulated with not+gt+shuffle
398   * - _mm_cmpnle_pd
399     - ⚠️ emulated with not+le
400   * - _mm_cmpnle_sd
401     - ⚠️ emulated with not+le+shuffle
402   * - _mm_cmpnlt_pd
403     - ⚠️ emulated with not+lt
404   * - _mm_cmpnlt_sd
405     - ⚠️ emulated with not+lt+shuffle
406   * - _mm_cmpord_pd
407     - ❌ emulated with 2xcmp+and
408   * - _mm_cmpord_sd
409     - ❌ emulated with 2xcmp+and+shuffle
410   * - _mm_cmpunord_pd
411     - ❌ emulated with 2xcmp+or
412   * - _mm_cmpunord_sd
413     - ❌ emulated with 2xcmp+or+shuffle
414   * - _mm_comieq_sd
415     - ❌ scalarized
416   * - _mm_comige_sd
417     - ❌ scalarized
418   * - _mm_comigt_sd
419     - ❌ scalarized
420   * - _mm_comile_sd
421     - ❌ scalarized
422   * - _mm_comilt_sd
423     - ❌ scalarized
424   * - _mm_comineq_sd
425     - ❌ scalarized
426   * - _mm_cvtepi32_pd
427     - ❌ scalarized
428   * - _mm_cvtepi32_ps
429     - ✅ wasm_f32x4_convert_i32x4
430   * - _mm_cvtpd_epi32
431     - ❌ scalarized
432   * - _mm_cvtpd_ps
433     - ❌ scalarized
434   * - _mm_cvtps_epi32
435     - ❌ scalarized
436   * - _mm_cvtps_pd
437     - ❌ scalarized
438   * - _mm_cvtsd_f64
439     - ✅ wasm_f64x2_extract_lane
440   * - _mm_cvtsd_si32
441     - ❌ scalarized
442   * - _mm_cvtsd_si64
443     - ❌ scalarized
444   * - _mm_cvtsd_si64x
445     - ❌ scalarized
446   * - _mm_cvtsd_ss
447     - ❌ scalarized
448   * - _mm_cvtsi128_si32
449     - ✅ wasm_i32x4_extract_lane
450   * - _mm_cvtsi128_si64 (_mm_cvtsi128_si64x)
451     - ✅ wasm_i64x2_extract_lane
452   * - _mm_cvtsi32_sd
453     - ❌ scalarized
454   * - _mm_cvtsi32_si128
455     - �� emulated with wasm_i32x4_make
456   * - _mm_cvtsi64_sd (_mm_cvtsi64x_sd)
457     - ❌ scalarized
458   * - _mm_cvtsi64_si128 (_mm_cvtsi64x_si128)
459     - �� emulated with wasm_i64x2_make
460   * - _mm_cvtss_sd
461     - ❌ scalarized
462   * - _mm_cvttpd_epi32
463     - ❌ scalarized
464   * - _mm_cvttps_epi32
465     - ❌ scalarized
466   * - _mm_cvttsd_si32
467     - ❌ scalarized
468   * - _mm_cvttsd_si64 (_mm_cvttsd_si64x)
469     - ❌ scalarized
470   * - _mm_div_pd
471     - ✅ wasm_f64x2_div
472   * - _mm_div_sd
473     - ⚠️ emulated with a shuffle
474   * - _mm_extract_epi16
475     - ✅ wasm_u16x8_extract_lane
476   * - _mm_insert_epi16
477     - ✅ wasm_i16x8_replace_lane
478   * - _mm_lfence
479     - ⚠️ A full barrier in multithreaded builds.
480   * - _mm_load_pd
481     - �� wasm_v128_load. VM must guess type. :raw-html:`<br />` Unaligned load on x86 CPUs.
482   * - _mm_load1_pd (_mm_load_pd1)
483     - �� Virtual. v64x2.load_splat, VM must guess type.
484   * - _mm_load_sd
485     - ❌ emulated with wasm_f64x2_make
486   * - _mm_load_si128
487     - �� wasm_v128_load. VM must guess type. :raw-html:`<br />` Unaligned load on x86 CPUs.
488   * - _mm_loadh_pd
489     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle.
490   * - _mm_loadl_epi64
491     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle.
492   * - _mm_loadl_pd
493     - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle.
494   * - _mm_loadr_pd
495     - �� Virtual. Simd load + shuffle.
496   * - _mm_loadu_pd
497     - �� wasm_v128_load. VM must guess type.
498   * - _mm_loadu_si128
499     - �� wasm_v128_load. VM must guess type.
500   * - _mm_loadu_si32
501     - ❌ emulated with wasm_i32x4_make
502   * - _mm_madd_epi16
503     - ❌ scalarized
504   * - _mm_maskmoveu_si128
505     - ❌ scalarized
506   * - _mm_max_epi16
507     - ✅ wasm_i16x8_max
508   * - _mm_max_epu8
509     - ✅ wasm_u8x16_max
510   * - _mm_max_pd
511     - TODO: migrate to wasm_f64x2_pmax
512   * - _mm_max_sd
513     - ⚠️ emulated with a shuffle
514   * - _mm_mfence
515     - ⚠️ A full barrier in multithreaded builds.
516   * - _mm_min_epi16
517     - ✅ wasm_i16x8_min
518   * - _mm_min_epu8
519     - ✅ wasm_u8x16_min
520   * - _mm_min_pd
521     - TODO: migrate to wasm_f64x2_pmin
522   * - _mm_min_sd
523     - ⚠️ emulated with a shuffle
524   * - _mm_move_epi64
525     - �� emulated with a shuffle. VM must guess type.
526   * - _mm_move_sd
527     - �� emulated with a shuffle. VM must guess type.
528   * - _mm_movemask_epi8
529     - ❌ scalarized
530   * - _mm_movemask_pd
531     - ❌ scalarized
532   * - _mm_mul_epu32
533     - ❌ scalarized
534   * - _mm_mul_pd
535     - ✅ wasm_f64x2_mul
536   * - _mm_mul_sd
537     - ⚠️ emulated with a shuffle
538   * - _mm_mulhi_epi16
539     - ❌ scalarized
540   * - _mm_mulhi_epu16
541     - ❌ scalarized
542   * - _mm_mullo_epi16
543     - ✅ wasm_i16x8_mul
544   * - _mm_or_pd
545     - �� wasm_v128_or. VM must guess type.
546   * - _mm_or_si128
547     - �� wasm_v128_or. VM must guess type.
548   * - _mm_packs_epi16
549     - ❌ scalarized
550   * - _mm_packs_epi32
551     - ❌ scalarized
552   * - _mm_packus_epi16
553     - ❌ scalarized
554   * - _mm_pause
555     - �� No-op.
556   * - _mm_sad_epu8
557     - ❌ scalarized
558   * - _mm_set_epi16
559     - ✅ wasm_i16x8_make
560   * - _mm_set_epi32
561     - ✅ wasm_i32x4_make
562   * - _mm_set_epi64 (_mm_set_epi64x)
563     - ✅ wasm_i64x2_make
564   * - _mm_set_epi8
565     - ✅ wasm_i8x16_make
566   * - _mm_set_pd
567     - ✅ wasm_f64x2_make
568   * - _mm_set_sd
569     - �� emulated with wasm_f64x2_make
570   * - _mm_set1_epi16
571     - ✅ wasm_i16x8_splat
572   * - _mm_set1_epi32
573     - ✅ wasm_i32x4_splat
574   * - _mm_set1_epi64 (_mm_set1_epi64x)
575     - ✅ wasm_i64x2_splat
576   * - _mm_set1_epi8
577     - ✅ wasm_i8x16_splat
578   * - _mm_set1_pd (_mm_set_pd1)
579     - ✅ wasm_f64x2_splat
580   * - _mm_setr_epi16
581     - ✅ wasm_i16x8_make
582   * - _mm_setr_epi32
583     - ✅ wasm_i32x4_make
584   * - _mm_setr_epi64
585     - ✅ wasm_i64x2_make
586   * - _mm_setr_epi8
587     - ✅ wasm_i8x16_make
588   * - _mm_setr_pd
589     - ✅ wasm_f64x2_make
590   * - _mm_setzero_pd
591     - �� emulated with wasm_f64x2_const
592   * - _mm_setzero_si128
593     - �� emulated with wasm_i64x2_const
594   * - _mm_shuffle_epi32
595     - �� emulated with a general shuffle
596   * - _mm_shuffle_pd
597     - �� emulated with a general shuffle
598   * - _mm_shufflehi_epi16
599     - �� emulated with a general shuffle
600   * - _mm_shufflelo_epi16
601     - �� emulated with a general shuffle
602   * - _mm_sll_epi16
603     - ❌ scalarized
604   * - _mm_sll_epi32
605     - ❌ scalarized
606   * - _mm_sll_epi64
607     - ❌ scalarized
608   * - _mm_slli_epi16
609     - �� wasm_i16x8_shl :raw-html:`<br />` ✅ if shift count is immediate constant.
610   * - _mm_slli_epi32
611     - �� wasm_i32x4_shl :raw-html:`<br />` ✅ if shift count is immediate constant.
612   * - _mm_slli_epi64
613     - �� wasm_i64x2_shl :raw-html:`<br />` ✅ if shift count is immediate constant.
614   * - _mm_slli_si128 (_mm_bslli_si128)
615     - �� emulated with a general shuffle
616   * - _mm_sqrt_pd
617     - ✅ wasm_f64x2_sqrt
618   * - _mm_sqrt_sd
619     - ⚠️ emulated with a shuffle
620   * - _mm_sra_epi16
621     - ❌ scalarized
622   * - _mm_sra_epi32
623     - ❌ scalarized
624   * - _mm_srai_epi16
625     - �� wasm_i16x8_shr :raw-html:`<br />` ✅ if shift count is immediate constant.
626   * - _mm_srai_epi32
627     - �� wasm_i32x4_shr :raw-html:`<br />` ✅ if shift count is immediate constant.
628   * - _mm_srl_epi16
629     - ❌ scalarized
630   * - _mm_srl_epi32
631     - ❌ scalarized
632   * - _mm_srl_epi64
633     - ❌ scalarized
634   * - _mm_srli_epi16
635     - �� wasm_u16x8_shr :raw-html:`<br />` ✅ if shift count is immediate constant.
636   * - _mm_srli_epi32
637     - �� wasm_u32x4_shr :raw-html:`<br />` ✅ if shift count is immediate constant.
638   * - _mm_srli_epi64
639     - �� wasm_u64x2_shr :raw-html:`<br />` ✅ if shift count is immediate constant.
640   * - _mm_srli_si128 (_mm_bsrli_si128)
641     - �� emulated with a general shuffle
642   * - _mm_store_pd
643     - �� wasm_v128_store. VM must guess type. :raw-html:`<br />` Unaligned store on x86 CPUs.
644   * - _mm_store_sd
645     - �� emulated with scalar store
646   * - _mm_store_si128
647     - �� wasm_v128_store. VM must guess type. :raw-html:`<br />` Unaligned store on x86 CPUs.
648   * - _mm_store1_pd (_mm_store_pd1)
649     - �� Virtual. Emulated with shuffle. :raw-html:`<br />` Unaligned store on x86 CPUs.
650   * - _mm_storeh_pd
651     - ❌ shuffle + scalar stores
652   * - _mm_storel_epi64
653     - ❌ scalar store
654   * - _mm_storel_pd
655     - ❌ scalar store
656   * - _mm_storer_pd
657     - ❌ shuffle + scalar stores
658   * - _mm_storeu_pd
659     - �� wasm_v128_store. VM must guess type.
660   * - _mm_storeu_si128
661     - �� wasm_v128_store. VM must guess type.
662   * - _mm_storeu_si32
663     - �� emulated with scalar store
664   * - _mm_stream_pd
665     - �� wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD.
666   * - _mm_stream_si128
667     - �� wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD.
668   * - _mm_stream_si32
669     - �� wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD.
670   * - _mm_stream_si64
671     - �� wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD.
672   * - _mm_sub_epi16
673     - ✅ wasm_i16x8_sub
674   * - _mm_sub_epi32
675     - ✅ wasm_i32x4_sub
676   * - _mm_sub_epi64
677     - ✅ wasm_i64x2_sub
678   * - _mm_sub_epi8
679     - ✅ wasm_i8x16_sub
680   * - _mm_sub_pd
681     - ✅ wasm_f64x2_sub
682   * - _mm_sub_sd
683     - ⚠️ emulated with a shuffle
684   * - _mm_subs_epi16
685     - ✅ wasm_i16x8_sub_saturate
686   * - _mm_subs_epi8
687     - ✅ wasm_i8x16_sub_saturate
688   * - _mm_subs_epu16
689     - ✅ wasm_u16x8_sub_saturate
690   * - _mm_subs_epu8
691     - ✅ wasm_u8x16_sub_saturate
692   * - _mm_ucomieq_sd
693     - ❌ scalarized
694   * - _mm_ucomige_sd
695     - ❌ scalarized
696   * - _mm_ucomigt_sd
697     - ❌ scalarized
698   * - _mm_ucomile_sd
699     - ❌ scalarized
700   * - _mm_ucomilt_sd
701     - ❌ scalarized
702   * - _mm_ucomineq_sd
703     - ❌ scalarized
704   * - _mm_undefined_pd
705     - ✅ Virtual
706   * - _mm_undefined_si128
707     - ✅ Virtual
708   * - _mm_unpackhi_epi16
709     - �� emulated with a shuffle
710   * - _mm_unpackhi_epi32
711     - �� emulated with a shuffle
712   * - _mm_unpackhi_epi64
713     - �� emulated with a shuffle
714   * - _mm_unpackhi_epi8
715     - �� emulated with a shuffle
716   * - _mm_unpachi_pd
717     - �� emulated with a shuffle
718   * - _mm_unpacklo_epi16
719     - �� emulated with a shuffle
720   * - _mm_unpacklo_epi32
721     - �� emulated with a shuffle
722   * - _mm_unpacklo_epi64
723     - �� emulated with a shuffle
724   * - _mm_unpacklo_epi8
725     - �� emulated with a shuffle
726   * - _mm_unpacklo_pd
727     - �� emulated with a shuffle
728   * - _mm_xor_pd
729     - �� wasm_v128_or. VM must guess type.
730   * - _mm_xor_si128
731     - �� wasm_v128_or. VM must guess type.
732
733⚫ The following extensions that SSE2 instruction set brought to 64-bit wide MMX registers are not available:
734 - _mm_add_si64, _mm_movepi64_pi64, _mm_movpi64_epi64, _mm_mul_su32, _mm_sub_si64, _mm_cvtpd_pi32, _mm_cvtpi32_pd, _mm_cvttpd_pi32
735
736Any code referencing these intrinsics will not compile.
737
738The following table highlights the availability and expected performance of different SSE3 intrinsics. Refer to `Intel Intrinsics Guide on SSE3 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE3>`_.
739
740.. list-table:: x86 SSE3 intrinsics available via #include <pmmintrin.h>
741   :widths: 20 30
742   :header-rows: 1
743
744   * - Intrinsic name
745     - WebAssembly SIMD support
746   * - _mm_lddqu_si128
747     - ✅ wasm_v128_load.
748   * - _mm_addsub_ps
749     - ⚠️ emulated with a SIMD add+mul+const
750   * - _mm_hadd_ps
751     - ⚠️ emulated with a SIMD add+two shuffles
752   * - _mm_hsub_ps
753     - ⚠️ emulated with a SIMD sub+two shuffles
754   * - _mm_movehdup_ps
755     - �� emulated with a general shuffle
756   * - _mm_moveldup_ps
757     - �� emulated with a general shuffle
758   * - _mm_addsub_pd
759     - ⚠️ emulated with a SIMD add+mul+const
760   * - _mm_hadd_pd
761     - ⚠️ emulated with a SIMD add+two shuffles
762   * - _mm_hsub_pd
763     - ⚠️ emulated with a SIMD add+two shuffles
764   * - _mm_loaddup_pd
765     - �� Scalar load + splat.
766   * - _mm_movedup_pd
767     - �� emulated with a general shuffle
768   * - _MM_GET_DENORMALS_ZERO_MODE
769     - ✅ Always returns _MM_DENORMALS_ZERO_ON. I.e. denormals are available.
770   * - _MM_SET_DENORMALS_ZERO_MODE
771     - ⚫ Not available. Fixed to _MM_DENORMALS_ZERO_ON.
772   * - _mm_monitor
773     - ⚫ Not available.
774   * - _mm_mwait
775     - ⚫ Not available.
776
777The following table highlights the availability and expected performance of different SSSE3 intrinsics. Refer to `Intel Intrinsics Guide on SSSE3 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSSE3>`_.
778
779.. list-table:: x86 SSSE3 intrinsics available via #include <tmmintrin.h>
780   :widths: 20 30
781   :header-rows: 1
782
783   * - Intrinsic name
784     - WebAssembly SIMD support
785   * - _mm_abs_epi8
786     - ⚠️ emulated with a SIMD shift+xor+add
787   * - _mm_abs_epi16
788     - ⚠️ emulated with a SIMD shift+xor+add
789   * - _mm_abs_epi32
790     - ⚠️ emulated with a SIMD shift+xor+add
791   * - _mm_alignr_epi8
792     - ⚠️ emulated with a SIMD or+two shifts
793   * - _mm_hadd_epi16
794     - ⚠️ emulated with a SIMD add+two shuffles
795   * - _mm_hadd_epi32
796     - ⚠️ emulated with a SIMD add+two shuffles
797   * - _mm_hadds_epi16
798     - ⚠️ emulated with a SIMD adds+two shuffles
799   * - _mm_hsub_epi16
800     - ⚠️ emulated with a SIMD sub+two shuffles
801   * - _mm_hsub_epi32
802     - ⚠️ emulated with a SIMD sub+two shuffles
803   * - _mm_hsubs_epi16
804     - ⚠️ emulated with a SIMD subs+two shuffles
805   * - _mm_maddubs_epi16
806     - �� scalarized
807   * - _mm_mulhrs_epi16
808     - �� scalarized (TODO: emulatable in SIMD?)
809   * - _mm_shuffle_epi8
810     - �� scalarized (TODO: use wasm_v8x16_swizzle when available)
811   * - _mm_sign_epi8
812     - ⚠️ emulated with a SIMD complex shuffle+cmp+xor+andnot
813   * - _mm_sign_epi16
814     - ⚠️ emulated with a SIMD shr+cmp+xor+andnot
815   * - _mm_sign_epi32
816     - ⚠️ emulated with a SIMD shr+cmp+xor+andnot
817
818⚫ The SSSE3 functions that deal with 64-bit wide MMX registers are not available:
819 -  _mm_abs_pi8, _mm_abs_pi16, _mm_abs_pi32, _mm_alignr_pi8, _mm_hadd_pi16, _mm_hadd_pi32, _mm_hadds_pi16, _mm_hsub_pi16, _mm_hsub_pi32, _mm_hsubs_pi16, _mm_maddubs_pi16, _mm_mulhrs_pi16, _mm_shuffle_pi8, _mm_sign_pi8, _mm_sign_pi16 and _mm_sign_pi32
820
821Any code referencing these intrinsics will not compile.
822
823The following table highlights the availability and expected performance of different SSE4.1 intrinsics. Refer to `Intel Intrinsics Guide on SSE4.1 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE4_1>`_.
824
825.. list-table:: x86 SSE4.1 intrinsics available via #include <smmintrin.h>
826   :widths: 20 30
827   :header-rows: 1
828
829   * - Intrinsic name
830     - WebAssembly SIMD support
831   * - _mm_blend_epi16
832     - �� emulated with a general shuffle
833   * - _mm_blend_pd
834     - �� emulated with a general shuffle
835   * - _mm_blend_ps
836     - �� emulated with a general shuffle
837   * - _mm_blendv_epi8
838     - ⚠️ emulated with a SIMD shr+and+andnot+or
839   * - _mm_blendv_pd
840     - ⚠️ emulated with a SIMD shr+and+andnot+or
841   * - _mm_blendv_ps
842     - ⚠️ emulated with a SIMD shr+and+andnot+or
843   * - _mm_ceil_pd
844     - ❌ scalarized
845   * - _mm_ceil_ps
846     - ❌ scalarized
847   * - _mm_ceil_sd
848     - ❌ scalarized
849   * - _mm_ceil_ss
850     - ❌ scalarized
851   * - _mm_cmpeq_epi64
852     - ❌ scalarized
853   * - _mm_cvtepi16_epi32
854     - ✅ wasm_i32x4_widen_low_i16x8
855   * - _mm_cvtepi16_epi64
856     - ❌ scalarized
857   * - _mm_cvtepi32_epi64
858     - ❌ scalarized
859   * - _mm_cvtepi8_epi16
860     - ✅ wasm_i16x8_widen_low_i8x16
861   * - _mm_cvtepi8_epi32
862     - ❌ scalarized
863   * - _mm_cvtepi8_epi64
864     - ❌ scalarized
865   * - _mm_cvtepu16_epi32
866     - ✅ wasm_i32x4_widen_low_u16x8
867   * - _mm_cvtepu16_epi64
868     - ❌ scalarized
869   * - _mm_cvtepu32_epi64
870     - ❌ scalarized
871   * - _mm_cvtepu8_epi16
872     - ✅ wasm_i16x8_widen_low_u8x16
873   * - _mm_cvtepu8_epi32
874     - ❌ scalarized
875   * - _mm_cvtepu8_epi64
876     - ❌ scalarized
877   * - _mm_dp_pd
878     - ⚠️ emulated with SIMD mul+add+setzero+2xblend
879   * - _mm_dp_ps
880     - ⚠️ emulated with SIMD mul+add+setzero+2xblend
881   * - _mm_extract_epi32
882     - ✅ wasm_i32x4_extract_lane
883   * - _mm_extract_epi64
884     - ✅ wasm_i64x2_extract_lane
885   * - _mm_extract_epi8
886     - ✅ wasm_u8x16_extract_lane
887   * - _mm_extract_ps
888     - ✅ wasm_i32x4_extract_lane
889   * - _mm_floor_pd
890     - ❌ scalarized
891   * - _mm_floor_ps
892     - ❌ scalarized
893   * - _mm_floor_sd
894     - ❌ scalarized
895   * - _mm_floor_ss
896     - ❌ scalarized
897   * - _mm_insert_epi32
898     - ✅ wasm_i32x4_replace_lane
899   * - _mm_insert_epi64
900     - ✅ wasm_i64x2_replace_lane
901   * - _mm_insert_epi8
902     - ✅ wasm_i8x16_replace_lane
903   * - _mm_insert_ps
904     - ⚠️ emulated with generic non-SIMD-mapping shuffles
905   * - _mm_max_epi32
906     - ✅ wasm_i32x4_max
907   * - _mm_max_epi8
908     - ✅ wasm_i8x16_max
909   * - _mm_max_epu16
910     - ✅ wasm_u16x8_max
911   * - _mm_max_epu32
912     - ✅ wasm_u32x4_max
913   * - _mm_min_epi32
914     - ✅ wasm_i32x4_min
915   * - _mm_min_epi8
916     - ✅ wasm_i8x16_min
917   * - _mm_min_epu16
918     - ✅ wasm_u16x8_min
919   * - _mm_min_epu32
920     - ✅ wasm_u32x4_min
921   * - _mm_minpos_epu16
922     - �� scalarized
923   * - _mm_mpsadbw_epu8
924     - �� scalarized
925   * - _mm_mul_epi32
926     - ❌ scalarized
927   * - _mm_mullo_epi32
928     - ✅ wasm_i32x4_mul
929   * - _mm_packus_epi32
930     - ✅ wasm_u16x8_narrow_i32x4
931   * - _mm_round_pd
932     - �� scalarized
933   * - _mm_round_ps
934     - �� scalarized
935   * - _mm_round_sd
936     - �� scalarized
937   * - _mm_round_ss
938     - �� scalarized
939   * - _mm_stream_load_si128
940     - �� wasm_v128_load. VM must guess type. :raw-html:`<br />` Unaligned load on x86 CPUs.
941   * - _mm_test_all_ones
942     - ❌ scalarized
943   * - _mm_test_all_zeros
944     - ❌ scalarized
945   * - _mm_test_mix_ones_zeros
946     - ❌ scalarized
947   * - _mm_testc_si128
948     - ❌ scalarized
949   * - _mm_test_nzc_si128
950     - ❌ scalarized
951   * - _mm_testz_si128
952     - ❌ scalarized
953
954The following table highlights the availability and expected performance of different SSE4.2 intrinsics. Refer to `Intel Intrinsics Guide on SSE4.2 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE4_2>`_.
955
956.. list-table:: x86 SSE4.1 intrinsics available via #include <smmintrin.h>
957   :widths: 20 30
958   :header-rows: 1
959
960   * - Intrinsic name
961     - WebAssembly SIMD support
962   * - _mm_cmpgt_epi64
963     - ❌ scalarized
964
965⚫ The SSE4.2 functions that deal with string comparisons and CRC calculations are not available:
966 - _mm_cmpestra, _mm_cmpestrc, _mm_cmpestri, _mm_cmpestrm, _mm_cmpestro, _mm_cmpestrs, _mm_cmpestrz, _mm_cmpistra, _mm_cmpistrc, _mm_cmpistri, _mm_cmpistrm, _mm_cmpistro, _mm_cmpistrs, _mm_cmpistrz, _mm_crc32_u16, _mm_crc32_u32, _mm_crc32_u64, _mm_crc32_u8
967
968Any code referencing these intrinsics will not compile.
969
970The following table highlights the availability and expected performance of different AVX intrinsics. Refer to `Intel Intrinsics Guide on AVX <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX>`_.
971
972.. list-table:: x86 AVX intrinsics available via #include <immintrin.h>
973   :widths: 20 30
974   :header-rows: 1
975
976   * - Intrinsic name
977     - WebAssembly SIMD support
978   * - _mm_broadcast_ss
979     - ✅ wasm_v32x4_load_splat
980   * - _mm_cmp_pd
981     - ⚠️ emulated with 1-2 SIMD cmp+and/or
982   * - _mm_cmp_ps
983     - ⚠️ emulated with 1-2 SIMD cmp+and/or
984   * - _mm_cmp_sd
985     - ⚠️ emulated with 1-2 SIMD cmp+and/or+move
986   * - _mm_cmp_ss
987     - ⚠️ emulated with 1-2 SIMD cmp+and/or+move
988   * - _mm_maskload_pd
989     - ⚠️ emulated with SIMD load+shift+and
990   * - _mm_maskload_ps
991     - ⚠️ emulated with SIMD load+shift+and
992   * - _mm_maskstore_pd
993     - ❌ scalarized
994   * - _mm_maskstore_ps
995     - ❌ scalarized
996   * - _mm_permute_pd
997     - �� emulated with a general shuffle
998   * - _mm_permute_ps
999     - �� emulated with a general shuffle
1000   * - _mm_permutevar_pd
1001     - �� scalarized
1002   * - _mm_permutevar_ps
1003     - �� scalarized
1004   * - _mm_testc_pd
1005     - �� emulated with complex SIMD+scalar sequence
1006   * - _mm_testc_ps
1007     - �� emulated with complex SIMD+scalar sequence
1008   * - _mm_testnzc_pd
1009     - �� emulated with complex SIMD+scalar sequence
1010   * - _mm_testnzc_ps
1011     - �� emulated with complex SIMD+scalar sequence
1012   * - _mm_testz_pd
1013     - �� emulated with complex SIMD+scalar sequence
1014   * - _mm_testz_ps
1015     - �� emulated with complex SIMD+scalar sequence
1016
1017Only the 128-bit wide instructions from AVX instruction set are available. 256-bit wide AVX instructions are not provided.
1018
1019
1020======================================================
1021Compiling SIMD code targeting ARM NEON instruction set
1022======================================================
1023
1024Emscripten supports compiling existing codebases that use ARM NEON by
1025passing the `-mfpu=neon` directive to the compiler, and including the
1026header `<arm_neon.h>`.
1027
1028In terms of performance, it is very important to note that only
1029instructions which operate on 128-bit wide vectors are supported
1030cleanly. This means that nearly any instruction which is not of a "q"
1031variant (i.e. "vaddq" as opposed to "vadd") will be scalarized.
1032
1033These are pulled from `SIMDe repository on Github
1034<https://github.com/simd-everywhere/simde>`_. To update emscripten
1035with the latest SIMDe version, run `tools/simde_update.py`.
1036
1037The following table highlights the availability of various 128-bit
1038wide intrinsics.
1039
1040Similarly to above, the following legend is used:
1041 - ✅ Wasm SIMD has a native opcode that matches the NEON instruction, should yield native performance
1042 - �� while the Wasm SIMD spec does not provide a proper performance guarantee, given a suitably smart enough compiler and a runtime VM path, this intrinsic should be able to generate the identical native NEON instruction.
1043 - ⚠️ the underlying NEON instruction is not available, but it is emulated via at most few other Wasm SIMD instructions, causing a small penalty.
1044 - ❌ the underlying NEON instruction is not exposed by the Wasm SIMD specification, so it must be emulated via a slow path, e.g. a sequence of several slower SIMD instructions, or a scalar implementation.
1045 - ⚫ the given NEON intrinsic is not available. Referencing the intrinsic will cause a compiler error.
1046
1047For detailed information on each intrinsic function, refer to `NEON Intrinsics Reference
1048<https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics>`_.
1049
1050.. list-table:: NEON Intrinsics
1051   :widths: 20 30
1052   :header-rows: 1
1053
1054   * - Intrinsic name
1055     - Wasm SIMD Support
1056   * - vaba
1057     - ⚫ Not implemented, will trigger compiler error
1058   * - vabal
1059     - ⚫ Not implemented, will trigger compiler error
1060   * - vabd
1061     - ⚫ Not implemented, will trigger compiler error
1062   * - vabdl
1063     - ⚫ Not implemented, will trigger compiler error
1064   * - vabs
1065     - native
1066   * - vadd
1067     - native
1068   * - vaddl
1069     - ⚫ Not implemented, will trigger compiler error
1070   * - vaddlv
1071     - ⚫ Not implemented, will trigger compiler error
1072   * - vaddv
1073     - ⚫ Not implemented, will trigger compiler error
1074   * - vaddw
1075     - ❌ Will be emulated with slow instructions, or scalarized
1076   * - vand
1077     - native
1078   * - vbic
1079     - ⚫ Not implemented, will trigger compiler error
1080   * - vbsl
1081     - native
1082   * - vcagt
1083     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1084   * - vceq
1085     - �� Depends on a smart enough compiler, but should be near native
1086   * - vceqz
1087     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1088   * - vcge
1089     - native
1090   * - vcgez
1091     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1092   * - vcgt
1093     - native
1094   * - vcgtz
1095     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1096   * - vcle
1097     - native
1098   * - vclez
1099     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1100   * - vcls
1101     - ⚫ Not implemented, will trigger compiler error
1102   * - vclt
1103     - native
1104   * - vcltz
1105     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1106   * - vcnt
1107     - ⚫ Not implemented, will trigger compiler error
1108   * - vclz
1109     - ⚫ Not implemented, will trigger compiler error
1110   * - vcombine
1111     - ❌ Will be emulated with slow instructions, or scalarized
1112   * - vcreate
1113     - ❌ Will be emulated with slow instructions, or scalarized
1114   * - vdot
1115     - ❌ Will be emulated with slow instructions, or scalarized
1116   * - vdot_lane
1117     - ❌ Will be emulated with slow instructions, or scalarized
1118   * - vdup
1119     - ⚫ Not implemented, will trigger compiler error
1120   * - vdup_n
1121     - native
1122   * - veor
1123     - native
1124   * - vext
1125     - ❌ Will be emulated with slow instructions, or scalarized
1126   * - vget_lane
1127     - native
1128   * - vhadd
1129     - ⚫ Not implemented, will trigger compiler error
1130   * - vhsub
1131     - ⚫ Not implemented, will trigger compiler error
1132   * - vld1
1133     - native
1134   * - vld2
1135     - ⚫ Not implemented, will trigger compiler error
1136   * - vld3
1137     - �� Depends on a smart enough compiler, but should be near native
1138   * - vld4
1139     - �� Depends on a smart enough compiler, but should be near native
1140   * - vmax
1141     - native
1142   * - vmaxv
1143     - ⚫ Not implemented, will trigger compiler error
1144   * - vmin
1145     - native
1146   * - vminv
1147     - ⚫ Not implemented, will trigger compiler error
1148   * - vmla
1149     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1150   * - vmlal
1151     - ❌ Will be emulated with slow instructions, or scalarized
1152   * - vmls
1153     - ⚫ Not implemented, will trigger compiler error
1154   * - vmlsl
1155     - ⚫ Not implemented, will trigger compiler error
1156   * - vmovl
1157     - native
1158   * - vmul
1159     - native
1160   * - vmul_n
1161     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1162   * - vmull
1163     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1164   * - vmull_n
1165     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1166   * - vmull_high
1167     - ❌ Will be emulated with slow instructions, or scalarized
1168   * - vmvn
1169     - native
1170   * - vneg
1171     - native
1172   * - vorn
1173     - ⚫ Not implemented, will trigger compiler error
1174   * - vorr
1175     - native
1176   * - vpadal
1177     - ❌ Will be emulated with slow instructions, or scalarized
1178   * - vpadd
1179     - ❌ Will be emulated with slow instructions, or scalarized
1180   * - vpaddl
1181     - ❌ Will be emulated with slow instructions, or scalarized
1182   * - vpmax
1183     - ❌ Will be emulated with slow instructions, or scalarized
1184   * - vpmin
1185     - ❌ Will be emulated with slow instructions, or scalarized
1186   * - vpminnm
1187     - ⚫ Not implemented, will trigger compiler error
1188   * - vqabs
1189     - ⚫ Not implemented, will trigger compiler error
1190   * - vqabsb
1191     - ⚫ Not implemented, will trigger compiler error
1192   * - vqadd
1193     - �� Depends on a smart enough compiler, but should be near native
1194   * - vqaddb
1195     - ⚫ Not implemented, will trigger compiler error
1196   * - vqdmulh
1197     - ❌ Will be emulated with slow instructions, or scalarized
1198   * - vqneg
1199     - ⚫ Not implemented, will trigger compiler error
1200   * - vqnegb
1201     - ⚫ Not implemented, will trigger compiler error
1202   * - vqrdmulh
1203     - ⚫ Not implemented, will trigger compiler error
1204   * - vqrshl
1205     - ⚫ Not implemented, will trigger compiler error
1206   * - vqrshlb
1207     - ⚫ Not implemented, will trigger compiler error
1208   * - vqshl
1209     - ⚫ Not implemented, will trigger compiler error
1210   * - vqshlb
1211     - ⚫ Not implemented, will trigger compiler error
1212   * - vqsub
1213     - ⚫ Not implemented, will trigger compiler error
1214   * - vqsubb
1215     - ⚫ Not implemented, will trigger compiler error
1216   * - vqtbl1
1217     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1218   * - vqtbl2
1219     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1220   * - vqtbl3
1221     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1222   * - vqtbl4
1223     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1224   * - vqtbx1
1225     - ❌ Will be emulated with slow instructions, or scalarized
1226   * - vqtbx2
1227     - ❌ Will be emulated with slow instructions, or scalarized
1228   * - vqtbx3
1229     - ❌ Will be emulated with slow instructions, or scalarized
1230   * - vqtbx4
1231     - ❌ Will be emulated with slow instructions, or scalarized
1232   * - vrbit
1233     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1234   * - vreinterpret
1235     - �� Depends on a smart enough compiler, but should be near native
1236   * - vrev16
1237     - native
1238   * - vrev32
1239     - native
1240   * - vrev64
1241     - native
1242   * - vrhadd
1243     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1244   * - vrshl
1245     - ❌ Will be emulated with slow instructions, or scalarized
1246   * - vrshr_n
1247     - ❌ Will be emulated with slow instructions, or scalarized
1248   * - vrsra_n
1249     - ❌ Will be emulated with slow instructions, or scalarized
1250   * - vset_lane
1251     - native
1252   * - vshl
1253     - scalaried
1254   * - vshl_n
1255     - ❌ Will be emulated with slow instructions, or scalarized
1256   * - vshr_n
1257     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1258   * - vsra_n
1259     - ❌ Will be emulated with slow instructions, or scalarized
1260   * - vst1
1261     - native
1262   * - vst1_lane
1263     - �� Depends on a smart enough compiler, but should be near native
1264   * - vst2
1265     - ⚫ Not implemented, will trigger compiler error
1266   * - vst3
1267     - �� Depends on a smart enough compiler, but should be near native
1268   * - vst4
1269     - �� Depends on a smart enough compiler, but should be near native
1270   * - vsub
1271     - native
1272   * - vsubl
1273     - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions
1274   * - vsubw
1275     - ⚫ Not implemented, will trigger compiler error
1276   * - vtbl1
1277     - ❌ Will be emulated with slow instructions, or scalarized
1278   * - vtbl2
1279     - ❌ Will be emulated with slow instructions, or scalarized
1280   * - vtbl3
1281     - ❌ Will be emulated with slow instructions, or scalarized
1282   * - vtbl4
1283     - ❌ Will be emulated with slow instructions, or scalarized
1284   * - vtbx1
1285     - ❌ Will be emulated with slow instructions, or scalarized
1286   * - vtbx2
1287     - ❌ Will be emulated with slow instructions, or scalarized
1288   * - vtbx3
1289     - ❌ Will be emulated with slow instructions, or scalarized
1290   * - vtbx4
1291     - ❌ Will be emulated with slow instructions, or scalarized
1292   * - vtrn
1293     - ❌ Will be emulated with slow instructions, or scalarized
1294   * - vtrn1
1295     - ❌ Will be emulated with slow instructions, or scalarized
1296   * - vtrn2
1297     - ❌ Will be emulated with slow instructions, or scalarized
1298   * - vtst
1299     - ❌ Will be emulated with slow instructions, or scalarized
1300   * - vuqadd
1301     - ⚫ Not implemented, will trigger compiler error
1302   * - vuqaddb
1303     - ⚫ Not implemented, will trigger compiler error
1304   * - vuzp
1305     - ❌ Will be emulated with slow instructions, or scalarized
1306   * - vuzp1
1307     - ❌ Will be emulated with slow instructions, or scalarized
1308   * - vuzp2
1309     - ❌ Will be emulated with slow instructions, or scalarized
1310   * - vzip
1311     - ❌ Will be emulated with slow instructions, or scalarized
1312   * - vzip1
1313     - ❌ Will be emulated with slow instructions, or scalarized
1314   * - vzip2
1315     - ❌ Will be emulated with slow instructions, or scalarized
1316