1.. Porting SIMD code: 2 3.. role:: raw-html(raw) 4 :format: html 5 6======================================= 7Porting SIMD code targeting WebAssembly 8======================================= 9 10Emscripten supports the `WebAssembly SIMD proposal <https://github.com/webassembly/simd/>`_ when using the WebAssembly LLVM backend. To enable SIMD, pass the -msimd128 flag at compile time. This will also turn on LLVM's autovectorization passes, so no source modifications are necessary to benefit from SIMD. 11 12At the source level, the GCC/Clang `SIMD Vector Extensions <https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html>`_ can be used and will be lowered to WebAssembly SIMD instructions where possible. In addition, there is a portable intrinsics header file that can be used. 13 14 .. code-block:: cpp 15 16 #include <wasm_simd128.h> 17 18Separate documentation for the intrinsics header is a work in progress, but its usage is straightforward and its source can be found at `wasm_simd128.h <https://github.com/llvm/llvm-project/blob/master/clang/lib/Headers/wasm_simd128.h>`_. These intrinsics are under active development in parallel with the SIMD proposal and should not be considered any more stable than the proposal itself. Note that most engines will also require an extra flag to enable SIMD. For example, Node requires `--experimental-wasm-simd`. 19 20WebAssembly SIMD is not supported when using the Fastcomp backend. 21 22====================================== 23Limitations and behavioral differences 24====================================== 25 26When porting native SIMD code, it should be noted that because of portability concerns, the WebAssembly SIMD specification does not expose the full native instruction sets. In particular the following changes exist: 27 28 - Emscripten does not support x86 or any other native inline SIMD assembly or building .s assembly files, so all code should be written to use SIMD intrinsic functions or compiler vector extensions. 29 30 - WebAssembly SIMD does not have control over managing floating point rounding modes or handling denormals. 31 32 - Cache line prefetch instructions are not available, and calls to these functions will compile, but are treated as no-ops. 33 34 - Asymmetric memory fence operations are not available, but will be implemented as fully synchronous memory fences when SharedArrayBuffer is enabled (-s USE_PTHREADS=1) or as no-ops when multithreading is not enabled (default, -s USE_PTHREADS=0). 35 36SIMD-related bug reports are tracked in the `Emscripten bug tracker with the label SIMD <https://github.com/emscripten-core/emscripten/issues?q=is%3Aopen+is%3Aissue+label%3ASIMD>`_. 37 38===================================================== 39Compiling SIMD code targeting x86 SSE instruction set 40===================================================== 41 42Emscripten supports compiling existing codebases that use x86 SSE by passing the `-msse` directive to the compiler, and including the header `<xmmintrin.h>`. 43 44Currently only the SSE1 and SSE2 instruction sets are supported. 45 46The following table highlights the availability and expected performance of different SSE1 intrinsics. Even if you are directly targeting the native Wasm SIMD opcodes via wasm_simd128.h header, this table can be useful for understanding the performance limitations that the Wasm SIMD specification has when running on x86 hardware. 47 48For detailed information on each SSE intrinsic function, visit the excellent `Intel Intrinsics Guide on SSE1 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE>`_. 49 50The following legend is used to highlight the expected performance of various instructions: 51 - ✅ Wasm SIMD has a native opcode that matches the x86 SSE instruction, should yield native performance 52 - while the Wasm SIMD spec does not provide a proper performance guarantee, given a suitably smart enough compiler and a runtime VM path, this intrinsic should be able to generate the identical native SSE instruction. 53 - there is some information missing (e.g. type or alignment information) for a Wasm VM to be guaranteed to be able to reconstruct the intended x86 SSE opcode. This might cause a penalty depending on the target CPU hardware family, especially on older CPU generations. 54 - ⚠️ the underlying x86 SSE instruction is not available, but it is emulated via at most few other Wasm SIMD instructions, causing a small penalty. 55 - ❌ the underlying x86 SSE instruction is not exposed by the Wasm SIMD specification, so it must be emulated via a slow path, e.g. a sequence of several slower SIMD instructions, or a scalar implementation. 56 - the underlying x86 SSE opcode is not available in Wasm SIMD, and the implementation must resort to such a slow emulated path, that a workaround rethinking the algorithm at a higher level is advised. 57 - the given SSE intrinsic is available to let applications compile, but does nothing. 58 - ⚫ the given SSE intrinsic is not available. Referencing the intrinsic will cause a compiler error. 59 60Certain intrinsics in the table below are marked "virtual". This means that there does not actually exist a native x86 SSE instruction set opcode to implement them, but native compilers offer the function as a convenience. Different compilers might generate a different instruction sequence for these. 61 62.. list-table:: x86 SSE intrinsics available via #include <xmmintrin.h> 63 :widths: 20 30 64 :header-rows: 1 65 66 * - Intrinsic name 67 - WebAssembly SIMD support 68 * - _mm_set_ps 69 - ✅ wasm_f32x4_make 70 * - _mm_setr_ps 71 - ✅ wasm_f32x4_make 72 * - _mm_set_ss 73 - emulated with wasm_f32x4_make 74 * - _mm_set_ps1 (_mm_set1_ps) 75 - ✅ wasm_f32x4_splat 76 * - _mm_setzero_ps 77 - emulated with wasm_f32x4_const(0) 78 * - _mm_load_ps 79 - wasm_v128_load. VM must guess type. :raw-html:`<br />` Unaligned load on x86 CPUs. 80 * - _mm_loadl_pi 81 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle. 82 * - _mm_loadh_pi 83 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle. 84 * - _mm_loadr_ps 85 - Virtual. Simd load + shuffle. 86 * - _mm_loadu_ps 87 - wasm_v128_load. VM must guess type. 88 * - _mm_load_ps1 (_mm_load1_ps) 89 - Virtual. Simd load + shuffle. 90 * - _mm_load_ss 91 - ❌ emulated with wasm_f32x4_make 92 * - _mm_storel_pi 93 - ❌ scalar stores 94 * - _mm_storeh_pi 95 - ❌ shuffle + scalar stores 96 * - _mm_store_ps 97 - wasm_v128_store. VM must guess type. :raw-html:`<br />` Unaligned store on x86 CPUs. 98 * - _mm_stream_ps 99 - wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD. 100 * - _mm_prefetch 101 - No-op. 102 * - _mm_sfence 103 - ⚠️ A full barrier in multithreaded builds. 104 * - _mm_shuffle_ps 105 - wasm_v32x4_shuffle. VM must guess type. 106 * - _mm_storer_ps 107 - Virtual. Shuffle + Simd store. 108 * - _mm_store_ps1 (_mm_store1_ps) 109 - Virtual. Emulated with shuffle. :raw-html:`<br />` Unaligned store on x86 CPUs. 110 * - _mm_store_ss 111 - emulated with scalar store 112 * - _mm_storeu_ps 113 - wasm_v128_store. VM must guess type. 114 * - _mm_storeu_si16 115 - emulated with scalar store 116 * - _mm_storeu_si64 117 - emulated with scalar store 118 * - _mm_movemask_ps 119 - No Wasm SIMD support. Emulated in scalar. `simd/#131 <https://github.com/WebAssembly/simd/issues/131>`_ 120 * - _mm_move_ss 121 - emulated with a shuffle. VM must guess type. 122 * - _mm_add_ps 123 - ✅ wasm_f32x4_add 124 * - _mm_add_ss 125 - ⚠️ emulated with a shuffle 126 * - _mm_sub_ps 127 - ✅ wasm_f32x4_sub 128 * - _mm_sub_ss 129 - ⚠️ emulated with a shuffle 130 * - _mm_mul_ps 131 - ✅ wasm_f32x4_mul 132 * - _mm_mul_ss 133 - ⚠️ emulated with a shuffle 134 * - _mm_div_ps 135 - ✅ wasm_f32x4_div 136 * - _mm_div_ss 137 - ⚠️ emulated with a shuffle 138 * - _mm_min_ps 139 - TODO: pmin once it works 140 * - _mm_min_ss 141 - ⚠️ emulated with a shuffle 142 * - _mm_max_ps 143 - TODO: pmax once it works 144 * - _mm_max_ss 145 - ⚠️ emulated with a shuffle 146 * - _mm_rcp_ps 147 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with full precision div. `simd/#3 <https://github.com/WebAssembly/simd/issues/3>`_ 148 * - _mm_rcp_ss 149 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with full precision div+shuffle `simd/#3 <https://github.com/WebAssembly/simd/issues/3>`_ 150 * - _mm_sqrt_ps 151 - ✅ wasm_f32x4_sqrt 152 * - _mm_sqrt_ss 153 - ⚠️ emulated with a shuffle 154 * - _mm_rsqrt_ps 155 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with full precision div+sqrt. `simd/#3 <https://github.com/WebAssembly/simd/issues/3>`_ 156 * - _mm_rsqrt_ss 157 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with full precision div+sqrt+shuffle. `simd/#3 <https://github.com/WebAssembly/simd/issues/3>`_ 158 * - _mm_unpackhi_ps 159 - emulated with a shuffle 160 * - _mm_unpacklo_ps 161 - emulated with a shuffle 162 * - _mm_movehl_ps 163 - emulated with a shuffle 164 * - _mm_movelh_ps 165 - emulated with a shuffle 166 * - _MM_TRANSPOSE4_PS 167 - emulated with a shuffle 168 * - _mm_cmplt_ps 169 - ✅ wasm_f32x4_lt 170 * - _mm_cmplt_ss 171 - ⚠️ emulated with a shuffle 172 * - _mm_cmple_ps 173 - ✅ wasm_f32x4_le 174 * - _mm_cmple_ss 175 - ⚠️ emulated with a shuffle 176 * - _mm_cmpeq_ps 177 - ✅ wasm_f32x4_eq 178 * - _mm_cmpeq_ss 179 - ⚠️ emulated with a shuffle 180 * - _mm_cmpge_ps 181 - ✅ wasm_f32x4_ge 182 * - _mm_cmpge_ss 183 - ⚠️ emulated with a shuffle 184 * - _mm_cmpgt_ps 185 - ✅ wasm_f32x4_gt 186 * - _mm_cmpgt_ss 187 - ⚠️ emulated with a shuffle 188 * - _mm_cmpord_ps 189 - ❌ emulated with 2xcmp+and 190 * - _mm_cmpord_ss 191 - ❌ emulated with 2xcmp+and+shuffle 192 * - _mm_cmpunord_ps 193 - ❌ emulated with 2xcmp+or 194 * - _mm_cmpunord_ss 195 - ❌ emulated with 2xcmp+or+shuffle 196 * - _mm_and_ps 197 - wasm_v128_and. VM must guess type. 198 * - _mm_andnot_ps 199 - wasm_v128_andnot. VM must guess type. 200 * - _mm_or_ps 201 - wasm_v128_or. VM must guess type. 202 * - _mm_xor_ps 203 - wasm_v128_xor. VM must guess type. 204 * - _mm_cmpneq_ps 205 - ✅ wasm_f32x4_ne 206 * - _mm_cmpneq_ss 207 - ⚠️ emulated with a shuffle 208 * - _mm_cmpnge_ps 209 - ⚠️ emulated with not+ge 210 * - _mm_cmpnge_ss 211 - ⚠️ emulated with not+ge+shuffle 212 * - _mm_cmpngt_ps 213 - ⚠️ emulated with not+gt 214 * - _mm_cmpngt_ss 215 - ⚠️ emulated with not+gt+shuffle 216 * - _mm_cmpnle_ps 217 - ⚠️ emulated with not+le 218 * - _mm_cmpnle_ss 219 - ⚠️ emulated with not+le+shuffle 220 * - _mm_cmpnlt_ps 221 - ⚠️ emulated with not+lt 222 * - _mm_cmpnlt_ss 223 - ⚠️ emulated with not+lt+shuffle 224 * - _mm_comieq_ss 225 - ❌ scalarized 226 * - _mm_comige_ss 227 - ❌ scalarized 228 * - _mm_comigt_ss 229 - ❌ scalarized 230 * - _mm_comile_ss 231 - ❌ scalarized 232 * - _mm_comilt_ss 233 - ❌ scalarized 234 * - _mm_comineq_ss 235 - ❌ scalarized 236 * - _mm_ucomieq_ss 237 - ❌ scalarized 238 * - _mm_ucomige_ss 239 - ❌ scalarized 240 * - _mm_ucomigt_ss 241 - ❌ scalarized 242 * - _mm_ucomile_ss 243 - ❌ scalarized 244 * - _mm_ucomilt_ss 245 - ❌ scalarized 246 * - _mm_ucomineq_ss 247 - ❌ scalarized 248 * - _mm_cvtsi32_ss (_mm_cvt_si2ss) 249 - ❌ scalarized 250 * - _mm_cvtss_si32 (_mm_cvt_ss2si) 251 - scalar with complex emulated semantics 252 * - _mm_cvttss_si32 (_mm_cvtt_ss2si) 253 - scalar with complex emulated semantics 254 * - _mm_cvtsi64_ss 255 - ❌ scalarized 256 * - _mm_cvtss_si64 257 - scalar with complex emulated semantics 258 * - _mm_cvttss_si64 259 - scalar with complex emulated semantics 260 * - _mm_cvtss_f32 261 - scalar get 262 * - _mm_malloc 263 - ✅ Allocates memory with specified alignment. 264 * - _mm_free 265 - ✅ Aliases to free(). 266 * - _MM_GET_EXCEPTION_MASK 267 - ✅ Always returns all exceptions masked (0x1f80). 268 * - _MM_GET_EXCEPTION_STATE 269 - ❌ Exception state is not tracked. Always returns 0. 270 * - _MM_GET_FLUSH_ZERO_MODE 271 - ✅ Always returns _MM_FLUSH_ZERO_OFF. 272 * - _MM_GET_ROUNDING_MODE 273 - ✅ Always returns _MM_ROUND_NEAREST. 274 * - _mm_getcsr 275 - ✅ Always returns _MM_FLUSH_ZERO_OFF :raw-html:`<br />` | _MM_ROUND_NEAREST | all exceptions masked (0x1f80). 276 * - _MM_SET_EXCEPTION_MASK 277 - ⚫ Not available. Fixed to all exceptions masked. 278 * - _MM_SET_EXCEPTION_STATE 279 - ⚫ Not available. Fixed to zero/clear state. 280 * - _MM_SET_FLUSH_ZERO_MODE 281 - ⚫ Not available. Fixed to _MM_FLUSH_ZERO_OFF. 282 * - _MM_SET_ROUNDING_MODE 283 - ⚫ Not available. Fixed to _MM_ROUND_NEAREST. 284 * - _mm_setcsr 285 - ⚫ Not available. 286 * - _mm_undefined_ps 287 - ✅ Virtual 288 289⚫ The following extensions that SSE1 instruction set brought to 64-bit wide MMX registers are not available: 290 - _mm_avg_pu8, _mm_avg_pu16, _mm_cvt_pi2ps, _mm_cvt_ps2pi, _mm_cvt_pi16_ps, _mm_cvt_pi32_ps, _mm_cvt_pi32x2_ps, _mm_cvt_pi8_ps, _mm_cvt_ps_pi16, _mm_cvt_ps_pi32, _mm_cvt_ps_pi8, _mm_cvt_pu16_ps, _mm_cvt_pu8_ps, _mm_cvtt_ps2pi, _mm_cvtt_pi16_ps, _mm_cvttps_pi32, _mm_extract_pi16, _mm_insert_pi16, _mm_maskmove_si64, _m_maskmovq, _mm_max_pi16, _mm_max_pu8, _mm_min_pi16, _mm_min_pu8, _mm_movemask_pi8, _mm_mulhi_pu16, _m_pavgb, _m_pavgw, _m_pextrw, _m_pinsrw, _m_pmaxsw, _m_pmaxub, _m_pminsw, _m_pminub, _m_pmovmskb, _m_pmulhuw, _m_psadbw, _m_pshufw, _mm_sad_pu8, _mm_shuffle_pi16 and _mm_stream_pi. 291 292Any code referencing these intrinsics will not compile. 293 294The following table highlights the availability and expected performance of different SSE2 intrinsics. Refer to `Intel Intrinsics Guide on SSE2 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE2>`_. 295 296.. list-table:: x86 SSE2 intrinsics available via #include <emmintrin.h> 297 :widths: 20 30 298 :header-rows: 1 299 300 * - Intrinsic name 301 - WebAssembly SIMD support 302 * - _mm_add_epi16 303 - ✅ wasm_i16x8_add 304 * - _mm_add_epi32 305 - ✅ wasm_i32x4_add 306 * - _mm_add_epi64 307 - ✅ wasm_i64x2_add 308 * - _mm_add_epi8 309 - ✅ wasm_i8x16_add 310 * - _mm_add_pd 311 - ✅ wasm_f64x2_add 312 * - _mm_add_sd 313 - ⚠️ emulated with a shuffle 314 * - _mm_adds_epi16 315 - ✅ wasm_i16x8_add_saturate 316 * - _mm_adds_epi8 317 - ✅ wasm_i8x16_add_saturate 318 * - _mm_adds_epu16 319 - ✅ wasm_u16x8_add_saturate 320 * - _mm_adds_epu8 321 - ✅ wasm_u8x16_add_saturate 322 * - _mm_and_pd 323 - wasm_v128_and. VM must guess type. 324 * - _mm_and_si128 325 - wasm_v128_and. VM must guess type. 326 * - _mm_andnot_pd 327 - wasm_v128_andnot. VM must guess type. 328 * - _mm_andnot_si128 329 - wasm_v128_andnot. VM must guess type. 330 * - _mm_avg_epu16 331 - ✅ wasm_u16x8_avgr 332 * - _mm_avg_epu8 333 - ✅ wasm_u8x16_avgr 334 * - _mm_castpd_ps 335 - ✅ no-op 336 * - _mm_castpd_si128 337 - ✅ no-op 338 * - _mm_castps_pd 339 - ✅ no-op 340 * - _mm_castps_si128 341 - ✅ no-op 342 * - _mm_castsi128_pd 343 - ✅ no-op 344 * - _mm_castsi128_ps 345 - ✅ no-op 346 * - _mm_clflush 347 - No-op. No cache hinting in Wasm SIMD. 348 * - _mm_cmpeq_epi16 349 - ✅ wasm_i16x8_eq 350 * - _mm_cmpeq_epi32 351 - ✅ wasm_i32x4_eq 352 * - _mm_cmpeq_epi8 353 - ✅ wasm_i8x16_eq 354 * - _mm_cmpeq_pd 355 - ✅ wasm_f64x2_eq 356 * - _mm_cmpeq_sd 357 - ⚠️ emulated with a shuffle 358 * - _mm_cmpge_pd 359 - ✅ wasm_f64x2_ge 360 * - _mm_cmpge_sd 361 - ⚠️ emulated with a shuffle 362 * - _mm_cmpgt_epi16 363 - ✅ wasm_i16x8_gt 364 * - _mm_cmpgt_epi32 365 - ✅ wasm_i32x4_gt 366 * - _mm_cmpgt_epi8 367 - ✅ wasm_i8x16_gt 368 * - _mm_cmpgt_pd 369 - ✅ wasm_f64x2_gt 370 * - _mm_cmpgt_sd 371 - ⚠️ emulated with a shuffle 372 * - _mm_cmple_pd 373 - ✅ wasm_f64x2_le 374 * - _mm_cmple_sd 375 - ⚠️ emulated with a shuffle 376 * - _mm_cmplt_epi16 377 - ✅ wasm_i16x8_lt 378 * - _mm_cmplt_epi32 379 - ✅ wasm_i32x4_lt 380 * - _mm_cmplt_epi8 381 - ✅ wasm_i8x16_lt 382 * - _mm_cmplt_pd 383 - ✅ wasm_f64x2_lt 384 * - _mm_cmplt_sd 385 - ⚠️ emulated with a shuffle 386 * - _mm_cmpneq_pd 387 - ✅ wasm_f64x2_ne 388 * - _mm_cmpneq_sd 389 - ⚠️ emulated with a shuffle 390 * - _mm_cmpnge_pd 391 - ⚠️ emulated with not+ge 392 * - _mm_cmpnge_sd 393 - ⚠️ emulated with not+ge+shuffle 394 * - _mm_cmpngt_pd 395 - ⚠️ emulated with not+gt 396 * - _mm_cmpngt_sd 397 - ⚠️ emulated with not+gt+shuffle 398 * - _mm_cmpnle_pd 399 - ⚠️ emulated with not+le 400 * - _mm_cmpnle_sd 401 - ⚠️ emulated with not+le+shuffle 402 * - _mm_cmpnlt_pd 403 - ⚠️ emulated with not+lt 404 * - _mm_cmpnlt_sd 405 - ⚠️ emulated with not+lt+shuffle 406 * - _mm_cmpord_pd 407 - ❌ emulated with 2xcmp+and 408 * - _mm_cmpord_sd 409 - ❌ emulated with 2xcmp+and+shuffle 410 * - _mm_cmpunord_pd 411 - ❌ emulated with 2xcmp+or 412 * - _mm_cmpunord_sd 413 - ❌ emulated with 2xcmp+or+shuffle 414 * - _mm_comieq_sd 415 - ❌ scalarized 416 * - _mm_comige_sd 417 - ❌ scalarized 418 * - _mm_comigt_sd 419 - ❌ scalarized 420 * - _mm_comile_sd 421 - ❌ scalarized 422 * - _mm_comilt_sd 423 - ❌ scalarized 424 * - _mm_comineq_sd 425 - ❌ scalarized 426 * - _mm_cvtepi32_pd 427 - ❌ scalarized 428 * - _mm_cvtepi32_ps 429 - ✅ wasm_f32x4_convert_i32x4 430 * - _mm_cvtpd_epi32 431 - ❌ scalarized 432 * - _mm_cvtpd_ps 433 - ❌ scalarized 434 * - _mm_cvtps_epi32 435 - ❌ scalarized 436 * - _mm_cvtps_pd 437 - ❌ scalarized 438 * - _mm_cvtsd_f64 439 - ✅ wasm_f64x2_extract_lane 440 * - _mm_cvtsd_si32 441 - ❌ scalarized 442 * - _mm_cvtsd_si64 443 - ❌ scalarized 444 * - _mm_cvtsd_si64x 445 - ❌ scalarized 446 * - _mm_cvtsd_ss 447 - ❌ scalarized 448 * - _mm_cvtsi128_si32 449 - ✅ wasm_i32x4_extract_lane 450 * - _mm_cvtsi128_si64 (_mm_cvtsi128_si64x) 451 - ✅ wasm_i64x2_extract_lane 452 * - _mm_cvtsi32_sd 453 - ❌ scalarized 454 * - _mm_cvtsi32_si128 455 - emulated with wasm_i32x4_make 456 * - _mm_cvtsi64_sd (_mm_cvtsi64x_sd) 457 - ❌ scalarized 458 * - _mm_cvtsi64_si128 (_mm_cvtsi64x_si128) 459 - emulated with wasm_i64x2_make 460 * - _mm_cvtss_sd 461 - ❌ scalarized 462 * - _mm_cvttpd_epi32 463 - ❌ scalarized 464 * - _mm_cvttps_epi32 465 - ❌ scalarized 466 * - _mm_cvttsd_si32 467 - ❌ scalarized 468 * - _mm_cvttsd_si64 (_mm_cvttsd_si64x) 469 - ❌ scalarized 470 * - _mm_div_pd 471 - ✅ wasm_f64x2_div 472 * - _mm_div_sd 473 - ⚠️ emulated with a shuffle 474 * - _mm_extract_epi16 475 - ✅ wasm_u16x8_extract_lane 476 * - _mm_insert_epi16 477 - ✅ wasm_i16x8_replace_lane 478 * - _mm_lfence 479 - ⚠️ A full barrier in multithreaded builds. 480 * - _mm_load_pd 481 - wasm_v128_load. VM must guess type. :raw-html:`<br />` Unaligned load on x86 CPUs. 482 * - _mm_load1_pd (_mm_load_pd1) 483 - Virtual. v64x2.load_splat, VM must guess type. 484 * - _mm_load_sd 485 - ❌ emulated with wasm_f64x2_make 486 * - _mm_load_si128 487 - wasm_v128_load. VM must guess type. :raw-html:`<br />` Unaligned load on x86 CPUs. 488 * - _mm_loadh_pd 489 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle. 490 * - _mm_loadl_epi64 491 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle. 492 * - _mm_loadl_pd 493 - ❌ No Wasm SIMD support. :raw-html:`<br />` Emulated with scalar loads + shuffle. 494 * - _mm_loadr_pd 495 - Virtual. Simd load + shuffle. 496 * - _mm_loadu_pd 497 - wasm_v128_load. VM must guess type. 498 * - _mm_loadu_si128 499 - wasm_v128_load. VM must guess type. 500 * - _mm_loadu_si32 501 - ❌ emulated with wasm_i32x4_make 502 * - _mm_madd_epi16 503 - ❌ scalarized 504 * - _mm_maskmoveu_si128 505 - ❌ scalarized 506 * - _mm_max_epi16 507 - ✅ wasm_i16x8_max 508 * - _mm_max_epu8 509 - ✅ wasm_u8x16_max 510 * - _mm_max_pd 511 - TODO: migrate to wasm_f64x2_pmax 512 * - _mm_max_sd 513 - ⚠️ emulated with a shuffle 514 * - _mm_mfence 515 - ⚠️ A full barrier in multithreaded builds. 516 * - _mm_min_epi16 517 - ✅ wasm_i16x8_min 518 * - _mm_min_epu8 519 - ✅ wasm_u8x16_min 520 * - _mm_min_pd 521 - TODO: migrate to wasm_f64x2_pmin 522 * - _mm_min_sd 523 - ⚠️ emulated with a shuffle 524 * - _mm_move_epi64 525 - emulated with a shuffle. VM must guess type. 526 * - _mm_move_sd 527 - emulated with a shuffle. VM must guess type. 528 * - _mm_movemask_epi8 529 - ❌ scalarized 530 * - _mm_movemask_pd 531 - ❌ scalarized 532 * - _mm_mul_epu32 533 - ❌ scalarized 534 * - _mm_mul_pd 535 - ✅ wasm_f64x2_mul 536 * - _mm_mul_sd 537 - ⚠️ emulated with a shuffle 538 * - _mm_mulhi_epi16 539 - ❌ scalarized 540 * - _mm_mulhi_epu16 541 - ❌ scalarized 542 * - _mm_mullo_epi16 543 - ✅ wasm_i16x8_mul 544 * - _mm_or_pd 545 - wasm_v128_or. VM must guess type. 546 * - _mm_or_si128 547 - wasm_v128_or. VM must guess type. 548 * - _mm_packs_epi16 549 - ❌ scalarized 550 * - _mm_packs_epi32 551 - ❌ scalarized 552 * - _mm_packus_epi16 553 - ❌ scalarized 554 * - _mm_pause 555 - No-op. 556 * - _mm_sad_epu8 557 - ❌ scalarized 558 * - _mm_set_epi16 559 - ✅ wasm_i16x8_make 560 * - _mm_set_epi32 561 - ✅ wasm_i32x4_make 562 * - _mm_set_epi64 (_mm_set_epi64x) 563 - ✅ wasm_i64x2_make 564 * - _mm_set_epi8 565 - ✅ wasm_i8x16_make 566 * - _mm_set_pd 567 - ✅ wasm_f64x2_make 568 * - _mm_set_sd 569 - emulated with wasm_f64x2_make 570 * - _mm_set1_epi16 571 - ✅ wasm_i16x8_splat 572 * - _mm_set1_epi32 573 - ✅ wasm_i32x4_splat 574 * - _mm_set1_epi64 (_mm_set1_epi64x) 575 - ✅ wasm_i64x2_splat 576 * - _mm_set1_epi8 577 - ✅ wasm_i8x16_splat 578 * - _mm_set1_pd (_mm_set_pd1) 579 - ✅ wasm_f64x2_splat 580 * - _mm_setr_epi16 581 - ✅ wasm_i16x8_make 582 * - _mm_setr_epi32 583 - ✅ wasm_i32x4_make 584 * - _mm_setr_epi64 585 - ✅ wasm_i64x2_make 586 * - _mm_setr_epi8 587 - ✅ wasm_i8x16_make 588 * - _mm_setr_pd 589 - ✅ wasm_f64x2_make 590 * - _mm_setzero_pd 591 - emulated with wasm_f64x2_const 592 * - _mm_setzero_si128 593 - emulated with wasm_i64x2_const 594 * - _mm_shuffle_epi32 595 - emulated with a general shuffle 596 * - _mm_shuffle_pd 597 - emulated with a general shuffle 598 * - _mm_shufflehi_epi16 599 - emulated with a general shuffle 600 * - _mm_shufflelo_epi16 601 - emulated with a general shuffle 602 * - _mm_sll_epi16 603 - ❌ scalarized 604 * - _mm_sll_epi32 605 - ❌ scalarized 606 * - _mm_sll_epi64 607 - ❌ scalarized 608 * - _mm_slli_epi16 609 - wasm_i16x8_shl :raw-html:`<br />` ✅ if shift count is immediate constant. 610 * - _mm_slli_epi32 611 - wasm_i32x4_shl :raw-html:`<br />` ✅ if shift count is immediate constant. 612 * - _mm_slli_epi64 613 - wasm_i64x2_shl :raw-html:`<br />` ✅ if shift count is immediate constant. 614 * - _mm_slli_si128 (_mm_bslli_si128) 615 - emulated with a general shuffle 616 * - _mm_sqrt_pd 617 - ✅ wasm_f64x2_sqrt 618 * - _mm_sqrt_sd 619 - ⚠️ emulated with a shuffle 620 * - _mm_sra_epi16 621 - ❌ scalarized 622 * - _mm_sra_epi32 623 - ❌ scalarized 624 * - _mm_srai_epi16 625 - wasm_i16x8_shr :raw-html:`<br />` ✅ if shift count is immediate constant. 626 * - _mm_srai_epi32 627 - wasm_i32x4_shr :raw-html:`<br />` ✅ if shift count is immediate constant. 628 * - _mm_srl_epi16 629 - ❌ scalarized 630 * - _mm_srl_epi32 631 - ❌ scalarized 632 * - _mm_srl_epi64 633 - ❌ scalarized 634 * - _mm_srli_epi16 635 - wasm_u16x8_shr :raw-html:`<br />` ✅ if shift count is immediate constant. 636 * - _mm_srli_epi32 637 - wasm_u32x4_shr :raw-html:`<br />` ✅ if shift count is immediate constant. 638 * - _mm_srli_epi64 639 - wasm_u64x2_shr :raw-html:`<br />` ✅ if shift count is immediate constant. 640 * - _mm_srli_si128 (_mm_bsrli_si128) 641 - emulated with a general shuffle 642 * - _mm_store_pd 643 - wasm_v128_store. VM must guess type. :raw-html:`<br />` Unaligned store on x86 CPUs. 644 * - _mm_store_sd 645 - emulated with scalar store 646 * - _mm_store_si128 647 - wasm_v128_store. VM must guess type. :raw-html:`<br />` Unaligned store on x86 CPUs. 648 * - _mm_store1_pd (_mm_store_pd1) 649 - Virtual. Emulated with shuffle. :raw-html:`<br />` Unaligned store on x86 CPUs. 650 * - _mm_storeh_pd 651 - ❌ shuffle + scalar stores 652 * - _mm_storel_epi64 653 - ❌ scalar store 654 * - _mm_storel_pd 655 - ❌ scalar store 656 * - _mm_storer_pd 657 - ❌ shuffle + scalar stores 658 * - _mm_storeu_pd 659 - wasm_v128_store. VM must guess type. 660 * - _mm_storeu_si128 661 - wasm_v128_store. VM must guess type. 662 * - _mm_storeu_si32 663 - emulated with scalar store 664 * - _mm_stream_pd 665 - wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD. 666 * - _mm_stream_si128 667 - wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD. 668 * - _mm_stream_si32 669 - wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD. 670 * - _mm_stream_si64 671 - wasm_v128_store. VM must guess type. :raw-html:`<br />` No cache control in Wasm SIMD. 672 * - _mm_sub_epi16 673 - ✅ wasm_i16x8_sub 674 * - _mm_sub_epi32 675 - ✅ wasm_i32x4_sub 676 * - _mm_sub_epi64 677 - ✅ wasm_i64x2_sub 678 * - _mm_sub_epi8 679 - ✅ wasm_i8x16_sub 680 * - _mm_sub_pd 681 - ✅ wasm_f64x2_sub 682 * - _mm_sub_sd 683 - ⚠️ emulated with a shuffle 684 * - _mm_subs_epi16 685 - ✅ wasm_i16x8_sub_saturate 686 * - _mm_subs_epi8 687 - ✅ wasm_i8x16_sub_saturate 688 * - _mm_subs_epu16 689 - ✅ wasm_u16x8_sub_saturate 690 * - _mm_subs_epu8 691 - ✅ wasm_u8x16_sub_saturate 692 * - _mm_ucomieq_sd 693 - ❌ scalarized 694 * - _mm_ucomige_sd 695 - ❌ scalarized 696 * - _mm_ucomigt_sd 697 - ❌ scalarized 698 * - _mm_ucomile_sd 699 - ❌ scalarized 700 * - _mm_ucomilt_sd 701 - ❌ scalarized 702 * - _mm_ucomineq_sd 703 - ❌ scalarized 704 * - _mm_undefined_pd 705 - ✅ Virtual 706 * - _mm_undefined_si128 707 - ✅ Virtual 708 * - _mm_unpackhi_epi16 709 - emulated with a shuffle 710 * - _mm_unpackhi_epi32 711 - emulated with a shuffle 712 * - _mm_unpackhi_epi64 713 - emulated with a shuffle 714 * - _mm_unpackhi_epi8 715 - emulated with a shuffle 716 * - _mm_unpachi_pd 717 - emulated with a shuffle 718 * - _mm_unpacklo_epi16 719 - emulated with a shuffle 720 * - _mm_unpacklo_epi32 721 - emulated with a shuffle 722 * - _mm_unpacklo_epi64 723 - emulated with a shuffle 724 * - _mm_unpacklo_epi8 725 - emulated with a shuffle 726 * - _mm_unpacklo_pd 727 - emulated with a shuffle 728 * - _mm_xor_pd 729 - wasm_v128_or. VM must guess type. 730 * - _mm_xor_si128 731 - wasm_v128_or. VM must guess type. 732 733⚫ The following extensions that SSE2 instruction set brought to 64-bit wide MMX registers are not available: 734 - _mm_add_si64, _mm_movepi64_pi64, _mm_movpi64_epi64, _mm_mul_su32, _mm_sub_si64, _mm_cvtpd_pi32, _mm_cvtpi32_pd, _mm_cvttpd_pi32 735 736Any code referencing these intrinsics will not compile. 737 738The following table highlights the availability and expected performance of different SSE3 intrinsics. Refer to `Intel Intrinsics Guide on SSE3 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE3>`_. 739 740.. list-table:: x86 SSE3 intrinsics available via #include <pmmintrin.h> 741 :widths: 20 30 742 :header-rows: 1 743 744 * - Intrinsic name 745 - WebAssembly SIMD support 746 * - _mm_lddqu_si128 747 - ✅ wasm_v128_load. 748 * - _mm_addsub_ps 749 - ⚠️ emulated with a SIMD add+mul+const 750 * - _mm_hadd_ps 751 - ⚠️ emulated with a SIMD add+two shuffles 752 * - _mm_hsub_ps 753 - ⚠️ emulated with a SIMD sub+two shuffles 754 * - _mm_movehdup_ps 755 - emulated with a general shuffle 756 * - _mm_moveldup_ps 757 - emulated with a general shuffle 758 * - _mm_addsub_pd 759 - ⚠️ emulated with a SIMD add+mul+const 760 * - _mm_hadd_pd 761 - ⚠️ emulated with a SIMD add+two shuffles 762 * - _mm_hsub_pd 763 - ⚠️ emulated with a SIMD add+two shuffles 764 * - _mm_loaddup_pd 765 - Scalar load + splat. 766 * - _mm_movedup_pd 767 - emulated with a general shuffle 768 * - _MM_GET_DENORMALS_ZERO_MODE 769 - ✅ Always returns _MM_DENORMALS_ZERO_ON. I.e. denormals are available. 770 * - _MM_SET_DENORMALS_ZERO_MODE 771 - ⚫ Not available. Fixed to _MM_DENORMALS_ZERO_ON. 772 * - _mm_monitor 773 - ⚫ Not available. 774 * - _mm_mwait 775 - ⚫ Not available. 776 777The following table highlights the availability and expected performance of different SSSE3 intrinsics. Refer to `Intel Intrinsics Guide on SSSE3 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSSE3>`_. 778 779.. list-table:: x86 SSSE3 intrinsics available via #include <tmmintrin.h> 780 :widths: 20 30 781 :header-rows: 1 782 783 * - Intrinsic name 784 - WebAssembly SIMD support 785 * - _mm_abs_epi8 786 - ⚠️ emulated with a SIMD shift+xor+add 787 * - _mm_abs_epi16 788 - ⚠️ emulated with a SIMD shift+xor+add 789 * - _mm_abs_epi32 790 - ⚠️ emulated with a SIMD shift+xor+add 791 * - _mm_alignr_epi8 792 - ⚠️ emulated with a SIMD or+two shifts 793 * - _mm_hadd_epi16 794 - ⚠️ emulated with a SIMD add+two shuffles 795 * - _mm_hadd_epi32 796 - ⚠️ emulated with a SIMD add+two shuffles 797 * - _mm_hadds_epi16 798 - ⚠️ emulated with a SIMD adds+two shuffles 799 * - _mm_hsub_epi16 800 - ⚠️ emulated with a SIMD sub+two shuffles 801 * - _mm_hsub_epi32 802 - ⚠️ emulated with a SIMD sub+two shuffles 803 * - _mm_hsubs_epi16 804 - ⚠️ emulated with a SIMD subs+two shuffles 805 * - _mm_maddubs_epi16 806 - scalarized 807 * - _mm_mulhrs_epi16 808 - scalarized (TODO: emulatable in SIMD?) 809 * - _mm_shuffle_epi8 810 - scalarized (TODO: use wasm_v8x16_swizzle when available) 811 * - _mm_sign_epi8 812 - ⚠️ emulated with a SIMD complex shuffle+cmp+xor+andnot 813 * - _mm_sign_epi16 814 - ⚠️ emulated with a SIMD shr+cmp+xor+andnot 815 * - _mm_sign_epi32 816 - ⚠️ emulated with a SIMD shr+cmp+xor+andnot 817 818⚫ The SSSE3 functions that deal with 64-bit wide MMX registers are not available: 819 - _mm_abs_pi8, _mm_abs_pi16, _mm_abs_pi32, _mm_alignr_pi8, _mm_hadd_pi16, _mm_hadd_pi32, _mm_hadds_pi16, _mm_hsub_pi16, _mm_hsub_pi32, _mm_hsubs_pi16, _mm_maddubs_pi16, _mm_mulhrs_pi16, _mm_shuffle_pi8, _mm_sign_pi8, _mm_sign_pi16 and _mm_sign_pi32 820 821Any code referencing these intrinsics will not compile. 822 823The following table highlights the availability and expected performance of different SSE4.1 intrinsics. Refer to `Intel Intrinsics Guide on SSE4.1 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE4_1>`_. 824 825.. list-table:: x86 SSE4.1 intrinsics available via #include <smmintrin.h> 826 :widths: 20 30 827 :header-rows: 1 828 829 * - Intrinsic name 830 - WebAssembly SIMD support 831 * - _mm_blend_epi16 832 - emulated with a general shuffle 833 * - _mm_blend_pd 834 - emulated with a general shuffle 835 * - _mm_blend_ps 836 - emulated with a general shuffle 837 * - _mm_blendv_epi8 838 - ⚠️ emulated with a SIMD shr+and+andnot+or 839 * - _mm_blendv_pd 840 - ⚠️ emulated with a SIMD shr+and+andnot+or 841 * - _mm_blendv_ps 842 - ⚠️ emulated with a SIMD shr+and+andnot+or 843 * - _mm_ceil_pd 844 - ❌ scalarized 845 * - _mm_ceil_ps 846 - ❌ scalarized 847 * - _mm_ceil_sd 848 - ❌ scalarized 849 * - _mm_ceil_ss 850 - ❌ scalarized 851 * - _mm_cmpeq_epi64 852 - ❌ scalarized 853 * - _mm_cvtepi16_epi32 854 - ✅ wasm_i32x4_widen_low_i16x8 855 * - _mm_cvtepi16_epi64 856 - ❌ scalarized 857 * - _mm_cvtepi32_epi64 858 - ❌ scalarized 859 * - _mm_cvtepi8_epi16 860 - ✅ wasm_i16x8_widen_low_i8x16 861 * - _mm_cvtepi8_epi32 862 - ❌ scalarized 863 * - _mm_cvtepi8_epi64 864 - ❌ scalarized 865 * - _mm_cvtepu16_epi32 866 - ✅ wasm_i32x4_widen_low_u16x8 867 * - _mm_cvtepu16_epi64 868 - ❌ scalarized 869 * - _mm_cvtepu32_epi64 870 - ❌ scalarized 871 * - _mm_cvtepu8_epi16 872 - ✅ wasm_i16x8_widen_low_u8x16 873 * - _mm_cvtepu8_epi32 874 - ❌ scalarized 875 * - _mm_cvtepu8_epi64 876 - ❌ scalarized 877 * - _mm_dp_pd 878 - ⚠️ emulated with SIMD mul+add+setzero+2xblend 879 * - _mm_dp_ps 880 - ⚠️ emulated with SIMD mul+add+setzero+2xblend 881 * - _mm_extract_epi32 882 - ✅ wasm_i32x4_extract_lane 883 * - _mm_extract_epi64 884 - ✅ wasm_i64x2_extract_lane 885 * - _mm_extract_epi8 886 - ✅ wasm_u8x16_extract_lane 887 * - _mm_extract_ps 888 - ✅ wasm_i32x4_extract_lane 889 * - _mm_floor_pd 890 - ❌ scalarized 891 * - _mm_floor_ps 892 - ❌ scalarized 893 * - _mm_floor_sd 894 - ❌ scalarized 895 * - _mm_floor_ss 896 - ❌ scalarized 897 * - _mm_insert_epi32 898 - ✅ wasm_i32x4_replace_lane 899 * - _mm_insert_epi64 900 - ✅ wasm_i64x2_replace_lane 901 * - _mm_insert_epi8 902 - ✅ wasm_i8x16_replace_lane 903 * - _mm_insert_ps 904 - ⚠️ emulated with generic non-SIMD-mapping shuffles 905 * - _mm_max_epi32 906 - ✅ wasm_i32x4_max 907 * - _mm_max_epi8 908 - ✅ wasm_i8x16_max 909 * - _mm_max_epu16 910 - ✅ wasm_u16x8_max 911 * - _mm_max_epu32 912 - ✅ wasm_u32x4_max 913 * - _mm_min_epi32 914 - ✅ wasm_i32x4_min 915 * - _mm_min_epi8 916 - ✅ wasm_i8x16_min 917 * - _mm_min_epu16 918 - ✅ wasm_u16x8_min 919 * - _mm_min_epu32 920 - ✅ wasm_u32x4_min 921 * - _mm_minpos_epu16 922 - scalarized 923 * - _mm_mpsadbw_epu8 924 - scalarized 925 * - _mm_mul_epi32 926 - ❌ scalarized 927 * - _mm_mullo_epi32 928 - ✅ wasm_i32x4_mul 929 * - _mm_packus_epi32 930 - ✅ wasm_u16x8_narrow_i32x4 931 * - _mm_round_pd 932 - scalarized 933 * - _mm_round_ps 934 - scalarized 935 * - _mm_round_sd 936 - scalarized 937 * - _mm_round_ss 938 - scalarized 939 * - _mm_stream_load_si128 940 - wasm_v128_load. VM must guess type. :raw-html:`<br />` Unaligned load on x86 CPUs. 941 * - _mm_test_all_ones 942 - ❌ scalarized 943 * - _mm_test_all_zeros 944 - ❌ scalarized 945 * - _mm_test_mix_ones_zeros 946 - ❌ scalarized 947 * - _mm_testc_si128 948 - ❌ scalarized 949 * - _mm_test_nzc_si128 950 - ❌ scalarized 951 * - _mm_testz_si128 952 - ❌ scalarized 953 954The following table highlights the availability and expected performance of different SSE4.2 intrinsics. Refer to `Intel Intrinsics Guide on SSE4.2 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE4_2>`_. 955 956.. list-table:: x86 SSE4.1 intrinsics available via #include <smmintrin.h> 957 :widths: 20 30 958 :header-rows: 1 959 960 * - Intrinsic name 961 - WebAssembly SIMD support 962 * - _mm_cmpgt_epi64 963 - ❌ scalarized 964 965⚫ The SSE4.2 functions that deal with string comparisons and CRC calculations are not available: 966 - _mm_cmpestra, _mm_cmpestrc, _mm_cmpestri, _mm_cmpestrm, _mm_cmpestro, _mm_cmpestrs, _mm_cmpestrz, _mm_cmpistra, _mm_cmpistrc, _mm_cmpistri, _mm_cmpistrm, _mm_cmpistro, _mm_cmpistrs, _mm_cmpistrz, _mm_crc32_u16, _mm_crc32_u32, _mm_crc32_u64, _mm_crc32_u8 967 968Any code referencing these intrinsics will not compile. 969 970The following table highlights the availability and expected performance of different AVX intrinsics. Refer to `Intel Intrinsics Guide on AVX <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX>`_. 971 972.. list-table:: x86 AVX intrinsics available via #include <immintrin.h> 973 :widths: 20 30 974 :header-rows: 1 975 976 * - Intrinsic name 977 - WebAssembly SIMD support 978 * - _mm_broadcast_ss 979 - ✅ wasm_v32x4_load_splat 980 * - _mm_cmp_pd 981 - ⚠️ emulated with 1-2 SIMD cmp+and/or 982 * - _mm_cmp_ps 983 - ⚠️ emulated with 1-2 SIMD cmp+and/or 984 * - _mm_cmp_sd 985 - ⚠️ emulated with 1-2 SIMD cmp+and/or+move 986 * - _mm_cmp_ss 987 - ⚠️ emulated with 1-2 SIMD cmp+and/or+move 988 * - _mm_maskload_pd 989 - ⚠️ emulated with SIMD load+shift+and 990 * - _mm_maskload_ps 991 - ⚠️ emulated with SIMD load+shift+and 992 * - _mm_maskstore_pd 993 - ❌ scalarized 994 * - _mm_maskstore_ps 995 - ❌ scalarized 996 * - _mm_permute_pd 997 - emulated with a general shuffle 998 * - _mm_permute_ps 999 - emulated with a general shuffle 1000 * - _mm_permutevar_pd 1001 - scalarized 1002 * - _mm_permutevar_ps 1003 - scalarized 1004 * - _mm_testc_pd 1005 - emulated with complex SIMD+scalar sequence 1006 * - _mm_testc_ps 1007 - emulated with complex SIMD+scalar sequence 1008 * - _mm_testnzc_pd 1009 - emulated with complex SIMD+scalar sequence 1010 * - _mm_testnzc_ps 1011 - emulated with complex SIMD+scalar sequence 1012 * - _mm_testz_pd 1013 - emulated with complex SIMD+scalar sequence 1014 * - _mm_testz_ps 1015 - emulated with complex SIMD+scalar sequence 1016 1017Only the 128-bit wide instructions from AVX instruction set are available. 256-bit wide AVX instructions are not provided. 1018 1019 1020====================================================== 1021Compiling SIMD code targeting ARM NEON instruction set 1022====================================================== 1023 1024Emscripten supports compiling existing codebases that use ARM NEON by 1025passing the `-mfpu=neon` directive to the compiler, and including the 1026header `<arm_neon.h>`. 1027 1028In terms of performance, it is very important to note that only 1029instructions which operate on 128-bit wide vectors are supported 1030cleanly. This means that nearly any instruction which is not of a "q" 1031variant (i.e. "vaddq" as opposed to "vadd") will be scalarized. 1032 1033These are pulled from `SIMDe repository on Github 1034<https://github.com/simd-everywhere/simde>`_. To update emscripten 1035with the latest SIMDe version, run `tools/simde_update.py`. 1036 1037The following table highlights the availability of various 128-bit 1038wide intrinsics. 1039 1040Similarly to above, the following legend is used: 1041 - ✅ Wasm SIMD has a native opcode that matches the NEON instruction, should yield native performance 1042 - while the Wasm SIMD spec does not provide a proper performance guarantee, given a suitably smart enough compiler and a runtime VM path, this intrinsic should be able to generate the identical native NEON instruction. 1043 - ⚠️ the underlying NEON instruction is not available, but it is emulated via at most few other Wasm SIMD instructions, causing a small penalty. 1044 - ❌ the underlying NEON instruction is not exposed by the Wasm SIMD specification, so it must be emulated via a slow path, e.g. a sequence of several slower SIMD instructions, or a scalar implementation. 1045 - ⚫ the given NEON intrinsic is not available. Referencing the intrinsic will cause a compiler error. 1046 1047For detailed information on each intrinsic function, refer to `NEON Intrinsics Reference 1048<https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics>`_. 1049 1050.. list-table:: NEON Intrinsics 1051 :widths: 20 30 1052 :header-rows: 1 1053 1054 * - Intrinsic name 1055 - Wasm SIMD Support 1056 * - vaba 1057 - ⚫ Not implemented, will trigger compiler error 1058 * - vabal 1059 - ⚫ Not implemented, will trigger compiler error 1060 * - vabd 1061 - ⚫ Not implemented, will trigger compiler error 1062 * - vabdl 1063 - ⚫ Not implemented, will trigger compiler error 1064 * - vabs 1065 - native 1066 * - vadd 1067 - native 1068 * - vaddl 1069 - ⚫ Not implemented, will trigger compiler error 1070 * - vaddlv 1071 - ⚫ Not implemented, will trigger compiler error 1072 * - vaddv 1073 - ⚫ Not implemented, will trigger compiler error 1074 * - vaddw 1075 - ❌ Will be emulated with slow instructions, or scalarized 1076 * - vand 1077 - native 1078 * - vbic 1079 - ⚫ Not implemented, will trigger compiler error 1080 * - vbsl 1081 - native 1082 * - vcagt 1083 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1084 * - vceq 1085 - Depends on a smart enough compiler, but should be near native 1086 * - vceqz 1087 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1088 * - vcge 1089 - native 1090 * - vcgez 1091 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1092 * - vcgt 1093 - native 1094 * - vcgtz 1095 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1096 * - vcle 1097 - native 1098 * - vclez 1099 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1100 * - vcls 1101 - ⚫ Not implemented, will trigger compiler error 1102 * - vclt 1103 - native 1104 * - vcltz 1105 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1106 * - vcnt 1107 - ⚫ Not implemented, will trigger compiler error 1108 * - vclz 1109 - ⚫ Not implemented, will trigger compiler error 1110 * - vcombine 1111 - ❌ Will be emulated with slow instructions, or scalarized 1112 * - vcreate 1113 - ❌ Will be emulated with slow instructions, or scalarized 1114 * - vdot 1115 - ❌ Will be emulated with slow instructions, or scalarized 1116 * - vdot_lane 1117 - ❌ Will be emulated with slow instructions, or scalarized 1118 * - vdup 1119 - ⚫ Not implemented, will trigger compiler error 1120 * - vdup_n 1121 - native 1122 * - veor 1123 - native 1124 * - vext 1125 - ❌ Will be emulated with slow instructions, or scalarized 1126 * - vget_lane 1127 - native 1128 * - vhadd 1129 - ⚫ Not implemented, will trigger compiler error 1130 * - vhsub 1131 - ⚫ Not implemented, will trigger compiler error 1132 * - vld1 1133 - native 1134 * - vld2 1135 - ⚫ Not implemented, will trigger compiler error 1136 * - vld3 1137 - Depends on a smart enough compiler, but should be near native 1138 * - vld4 1139 - Depends on a smart enough compiler, but should be near native 1140 * - vmax 1141 - native 1142 * - vmaxv 1143 - ⚫ Not implemented, will trigger compiler error 1144 * - vmin 1145 - native 1146 * - vminv 1147 - ⚫ Not implemented, will trigger compiler error 1148 * - vmla 1149 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1150 * - vmlal 1151 - ❌ Will be emulated with slow instructions, or scalarized 1152 * - vmls 1153 - ⚫ Not implemented, will trigger compiler error 1154 * - vmlsl 1155 - ⚫ Not implemented, will trigger compiler error 1156 * - vmovl 1157 - native 1158 * - vmul 1159 - native 1160 * - vmul_n 1161 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1162 * - vmull 1163 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1164 * - vmull_n 1165 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1166 * - vmull_high 1167 - ❌ Will be emulated with slow instructions, or scalarized 1168 * - vmvn 1169 - native 1170 * - vneg 1171 - native 1172 * - vorn 1173 - ⚫ Not implemented, will trigger compiler error 1174 * - vorr 1175 - native 1176 * - vpadal 1177 - ❌ Will be emulated with slow instructions, or scalarized 1178 * - vpadd 1179 - ❌ Will be emulated with slow instructions, or scalarized 1180 * - vpaddl 1181 - ❌ Will be emulated with slow instructions, or scalarized 1182 * - vpmax 1183 - ❌ Will be emulated with slow instructions, or scalarized 1184 * - vpmin 1185 - ❌ Will be emulated with slow instructions, or scalarized 1186 * - vpminnm 1187 - ⚫ Not implemented, will trigger compiler error 1188 * - vqabs 1189 - ⚫ Not implemented, will trigger compiler error 1190 * - vqabsb 1191 - ⚫ Not implemented, will trigger compiler error 1192 * - vqadd 1193 - Depends on a smart enough compiler, but should be near native 1194 * - vqaddb 1195 - ⚫ Not implemented, will trigger compiler error 1196 * - vqdmulh 1197 - ❌ Will be emulated with slow instructions, or scalarized 1198 * - vqneg 1199 - ⚫ Not implemented, will trigger compiler error 1200 * - vqnegb 1201 - ⚫ Not implemented, will trigger compiler error 1202 * - vqrdmulh 1203 - ⚫ Not implemented, will trigger compiler error 1204 * - vqrshl 1205 - ⚫ Not implemented, will trigger compiler error 1206 * - vqrshlb 1207 - ⚫ Not implemented, will trigger compiler error 1208 * - vqshl 1209 - ⚫ Not implemented, will trigger compiler error 1210 * - vqshlb 1211 - ⚫ Not implemented, will trigger compiler error 1212 * - vqsub 1213 - ⚫ Not implemented, will trigger compiler error 1214 * - vqsubb 1215 - ⚫ Not implemented, will trigger compiler error 1216 * - vqtbl1 1217 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1218 * - vqtbl2 1219 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1220 * - vqtbl3 1221 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1222 * - vqtbl4 1223 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1224 * - vqtbx1 1225 - ❌ Will be emulated with slow instructions, or scalarized 1226 * - vqtbx2 1227 - ❌ Will be emulated with slow instructions, or scalarized 1228 * - vqtbx3 1229 - ❌ Will be emulated with slow instructions, or scalarized 1230 * - vqtbx4 1231 - ❌ Will be emulated with slow instructions, or scalarized 1232 * - vrbit 1233 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1234 * - vreinterpret 1235 - Depends on a smart enough compiler, but should be near native 1236 * - vrev16 1237 - native 1238 * - vrev32 1239 - native 1240 * - vrev64 1241 - native 1242 * - vrhadd 1243 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1244 * - vrshl 1245 - ❌ Will be emulated with slow instructions, or scalarized 1246 * - vrshr_n 1247 - ❌ Will be emulated with slow instructions, or scalarized 1248 * - vrsra_n 1249 - ❌ Will be emulated with slow instructions, or scalarized 1250 * - vset_lane 1251 - native 1252 * - vshl 1253 - scalaried 1254 * - vshl_n 1255 - ❌ Will be emulated with slow instructions, or scalarized 1256 * - vshr_n 1257 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1258 * - vsra_n 1259 - ❌ Will be emulated with slow instructions, or scalarized 1260 * - vst1 1261 - native 1262 * - vst1_lane 1263 - Depends on a smart enough compiler, but should be near native 1264 * - vst2 1265 - ⚫ Not implemented, will trigger compiler error 1266 * - vst3 1267 - Depends on a smart enough compiler, but should be near native 1268 * - vst4 1269 - Depends on a smart enough compiler, but should be near native 1270 * - vsub 1271 - native 1272 * - vsubl 1273 - ⚠ Does not have direct implementation, but is emulated using fast NEON instructions 1274 * - vsubw 1275 - ⚫ Not implemented, will trigger compiler error 1276 * - vtbl1 1277 - ❌ Will be emulated with slow instructions, or scalarized 1278 * - vtbl2 1279 - ❌ Will be emulated with slow instructions, or scalarized 1280 * - vtbl3 1281 - ❌ Will be emulated with slow instructions, or scalarized 1282 * - vtbl4 1283 - ❌ Will be emulated with slow instructions, or scalarized 1284 * - vtbx1 1285 - ❌ Will be emulated with slow instructions, or scalarized 1286 * - vtbx2 1287 - ❌ Will be emulated with slow instructions, or scalarized 1288 * - vtbx3 1289 - ❌ Will be emulated with slow instructions, or scalarized 1290 * - vtbx4 1291 - ❌ Will be emulated with slow instructions, or scalarized 1292 * - vtrn 1293 - ❌ Will be emulated with slow instructions, or scalarized 1294 * - vtrn1 1295 - ❌ Will be emulated with slow instructions, or scalarized 1296 * - vtrn2 1297 - ❌ Will be emulated with slow instructions, or scalarized 1298 * - vtst 1299 - ❌ Will be emulated with slow instructions, or scalarized 1300 * - vuqadd 1301 - ⚫ Not implemented, will trigger compiler error 1302 * - vuqaddb 1303 - ⚫ Not implemented, will trigger compiler error 1304 * - vuzp 1305 - ❌ Will be emulated with slow instructions, or scalarized 1306 * - vuzp1 1307 - ❌ Will be emulated with slow instructions, or scalarized 1308 * - vuzp2 1309 - ❌ Will be emulated with slow instructions, or scalarized 1310 * - vzip 1311 - ❌ Will be emulated with slow instructions, or scalarized 1312 * - vzip1 1313 - ❌ Will be emulated with slow instructions, or scalarized 1314 * - vzip2 1315 - ❌ Will be emulated with slow instructions, or scalarized 1316