1# encoding_rs 2 3[![Build Status](https://travis-ci.org/hsivonen/encoding_rs.svg?branch=master)](https://travis-ci.org/hsivonen/encoding_rs) 4[![crates.io](https://meritbadge.herokuapp.com/encoding_rs)](https://crates.io/crates/encoding_rs) 5[![docs.rs](https://docs.rs/encoding_rs/badge.svg)](https://docs.rs/encoding_rs/) 6[![Apache 2 / MIT dual-licensed](https://img.shields.io/badge/license-Apache%202%20%2F%20MIT-blue.svg)](https://github.com/hsivonen/encoding_rs/blob/master/COPYRIGHT) 7 8encoding_rs an implementation of the (non-JavaScript parts of) the 9[Encoding Standard](https://encoding.spec.whatwg.org/) written in Rust and 10used in Gecko (starting with Firefox 56). 11 12Additionally, the `mem` module provides various operations for dealing with 13in-RAM text (as opposed to data that's coming from or going to an IO boundary). 14The `mem` module is a module instead of a separate crate due to internal 15implementation detail efficiencies. 16 17## Functionality 18 19Due to the Gecko use case, encoding_rs supports decoding to and encoding from 20UTF-16 in addition to supporting the usual Rust use case of decoding to and 21encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly 22to accommodate the C++ side of Gecko. 23 24Specifically, encoding_rs does the following: 25 26* Decodes a stream of bytes in an Encoding Standard-defined character encoding 27 into valid aligned native-endian in-RAM UTF-16 (units of `u16` / `char16_t`). 28* Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16 29 (units of `u16` / `char16_t`) into a sequence of bytes in an Encoding 30 Standard-defined character encoding as if the lone surrogates had been 31 replaced with the REPLACEMENT CHARACTER before performing the encode. 32 (Gecko's UTF-16 is potentially invalid.) 33* Decodes a stream of bytes in an Encoding Standard-defined character 34 encoding into valid UTF-8. 35* Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding 36 Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.) 37* Does the above in streaming (input and output split across multiple 38 buffers) and non-streaming (whole input in a single buffer and whole 39 output in a single buffer) variants. 40* Avoids copying (borrows) when possible in the non-streaming cases when 41 decoding to or encoding from UTF-8. 42* Resolves textual labels that identify character encodings in 43 protocol text into type-safe objects representing the those encodings 44 conceptually. 45* Maps the type-safe encoding objects onto strings suitable for 46 returning from `document.characterSet`. 47* Validates UTF-8 (in common instruction set scenarios a bit faster for Web 48 workloads than the standard library; hopefully will get upstreamed some 49 day) and ASCII. 50 51Additionally, `encoding_rs::mem` does the following: 52 53* Checks if a byte buffer contains only ASCII. 54* Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII). 55* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 56 buffer contains only Latin1 code points (below U+0100). 57* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 58 buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior 59 (suitable for checking if the Unicode Bidirectional Algorithm can be optimized 60 out). 61* Combined versions of the above two checks. 62* Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16. 63* Converts potentially-invalid UTF-16 and Latin1 to UTF-8. 64* Converts UTF-8 and UTF-16 to Latin1 (if in range). 65* Finds the first invalid code unit in a buffer of potentially-invalid UTF-16. 66* Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16. 67* Copies ASCII from one buffer to another up to the first non-ASCII byte. 68* Converts ASCII to UTF-16 up to the first non-ASCII byte. 69* Converts UTF-16 to ASCII up to the first non-Basic Latin code unit. 70 71## Integration with `std::io` 72 73Notably, the above feature list doesn't include the capability to wrap 74a `std::io::Read`, decode it into UTF-8 and presenting the result via 75`std::io::Read`. The [`encoding_rs_io`](https://crates.io/crates/encoding_rs_io) 76crate provides that capability. 77 78## `no_std` Environment 79 80The crate works in a `no_std` environment assuming that `alloc` is present. 81The `alloc`-using part are on the outer edge of the crate, so if there is 82interest in using the crate in environments without `alloc` it would be 83feasible to add a way to turn off those parts of the API of this crate that 84use `Vec`/`String`/`Cow`. 85 86## Decoding Email 87 88For decoding character encodings that occur in email, use the 89[`charset`](https://crates.io/crates/charset) crate instead of using this 90one directly. (It wraps this crate and adds UTF-7 decoding.) 91 92## Windows Code Page Identifier Mappings 93 94For mappings to and from Windows code page identifiers, use the 95[`codepage`](https://crates.io/crates/codepage) crate. 96 97## DOS Encodings 98 99This crate does not support single-byte DOS encodings that aren't required by 100the Web Platform, but the [`oem_cp`](https://crates.io/crates/oem_cp) crate does. 101 102## Preparing Text for the Encoders 103 104Normalizing text into Unicode Normalization Form C prior to encoding text into 105a legacy encoding minimizes unmappable characters. Text can be normalized to 106Unicode Normalization Form C using the 107[`unic-normal`](https://crates.io/crates/unic-normal) crate. 108 109The exception is windows-1258, which after normalizing to Unicode Normalization 110Form C requires tone marks to be decomposed in order to minimize unmappable 111characters. Vietnamese tone marks can be decomposed using the 112[`detone`](https://crates.io/crates/detone) crate. 113 114## Licensing 115 116Please see the file named 117[COPYRIGHT](https://github.com/hsivonen/encoding_rs/blob/master/COPYRIGHT). 118 119## Documentation 120 121Generated [API documentation](https://docs.rs/encoding_rs/) is available 122online. 123 124There is a [long-form write-up](https://hsivonen.fi/encoding_rs/) about the 125design and internals of the crate. 126 127## C and C++ bindings 128 129An FFI layer for encoding_rs is available as a 130[separate crate](https://github.com/hsivonen/encoding_c). The crate comes 131with a [demo C++ wrapper](https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h) 132using the C++ standard library and [GSL](https://github.com/Microsoft/GSL/) types. 133 134The bindings for the `mem` module are in the 135[encoding_c_mem crate](https://github.com/hsivonen/encoding_c_mem). 136 137For the Gecko context, there's a 138[C++ wrapper using the MFBT/XPCOM types](https://searchfox.org/mozilla-central/source/intl/Encoding.h#100). 139 140There's a [write-up](https://hsivonen.fi/modern-cpp-in-rust/) about the C++ 141wrappers. 142 143## Sample programs 144 145* [Rust](https://github.com/hsivonen/recode_rs) 146* [C](https://github.com/hsivonen/recode_c) 147* [C++](https://github.com/hsivonen/recode_cpp) 148 149## Optional features 150 151There are currently these optional cargo features: 152 153### `simd-accel` 154 155Enables SIMD acceleration using the nightly-dependent `packed_simd_2` crate. 156 157This is an opt-in feature, because enabling this feature _opts out_ of Rust's 158guarantees of future compilers compiling old code (aka. "stability story"). 159 160Currently, this has not been tested to be an improvement except for these 161targets: 162 163* x86_64 164* i686 165* aarch64 166* thumbv7neon 167 168If you use nightly Rust, you use targets whose first component is one of the 169above, and you are prepared _to have to revise your configuration when updating 170Rust_, you should enable this feature. Otherwise, please _do not_ enable this 171feature. 172 173_Note!_ If you are compiling for a target that does not have 128-bit SIMD 174enabled as part of the target definition and you are enabling 128-bit SIMD 175using `-C target_feature`, you need to enable the `core_arch` Cargo feature 176for `packed_simd_2` to compile a crates.io snapshot of `core_arch` instead of 177using the standard-library copy of `core::arch`, because the `core::arch` 178module of the pre-compiled standard library has been compiled with the 179assumption that the CPU doesn't have 128-bit SIMD. At present this applies 180mainly to 32-bit ARM targets whose first component does not include the 181substring `neon`. 182 183The encoding_rs side of things has not been properly set up for POWER, 184PowerPC, MIPS, etc., SIMD at this time, so even if you were to follow 185the advice from the previous paragraph, you probably shouldn't use 186the `simd-accel` option on the less mainstream architectures at this 187time. 188 189Used by Firefox. 190 191### `serde` 192 193Enables support for serializing and deserializing `&'static Encoding`-typed 194struct fields using [Serde][1]. 195 196[1]: https://serde.rs/ 197 198Not used by Firefox. 199 200### `fast-legacy-encode` 201 202A catch-all option for enabling the fastest legacy encode options. _Does not 203affect decode speed or UTF-8 encode speed._ 204 205At present, this option is equivalent to enabling the following options: 206 * `fast-hangul-encode` 207 * `fast-hanja-encode` 208 * `fast-kanji-encode` 209 * `fast-gb-hanzi-encode` 210 * `fast-big5-hanzi-encode` 211 212Adds 176 KB to the binary size. 213 214Not used by Firefox. 215 216### `fast-hangul-encode` 217 218Changes encoding precomposed Hangul syllables into EUC-KR from binary 219search over the decode-optimized tables to lookup by index making Korean 220plain-text encode about 4 times as fast as without this option. 221 222Adds 20 KB to the binary size. 223 224Does _not_ affect decode speed. 225 226Not used by Firefox. 227 228### `fast-hanja-encode` 229 230Changes encoding of Hanja into EUC-KR from linear search over the 231decode-optimized table to lookup by index. Since Hanja is practically absent 232in modern Korean text, this option doesn't affect perfomance in the common 233case and mainly makes sense if you want to make your application resilient 234agaist denial of service by someone intentionally feeding it a lot of Hanja 235to encode into EUC-KR. 236 237Adds 40 KB to the binary size. 238 239Does _not_ affect decode speed. 240 241Not used by Firefox. 242 243### `fast-kanji-encode` 244 245Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linear 246search over the decode-optimized tables to lookup by index making Japanese 247plain-text encode to legacy encodings 30 to 50 times as fast as without this 248option (about 2 times as fast as with `less-slow-kanji-encode`). 249 250Takes precedence over `less-slow-kanji-encode`. 251 252Adds 36 KB to the binary size (24 KB compared to `less-slow-kanji-encode`). 253 254Does _not_ affect decode speed. 255 256Not used by Firefox. 257 258### `less-slow-kanji-encode` 259 260Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP and 261ISO-2022-JP) encode less slow (binary search instead of linear search) making 262Japanese plain-text encode to legacy encodings 14 to 23 times as fast as 263without this option. 264 265Adds 12 KB to the binary size. 266 267Does _not_ affect decode speed. 268 269Not used by Firefox. 270 271### `fast-gb-hanzi-encode` 272 273Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK and 274gb18030 from linear search over a part the decode-optimized tables followed 275by a binary search over another part of the decode-optimized tables to lookup 276by index making Simplified Chinese plain-text encode to the legacy encodings 277100 to 110 times as fast as without this option (about 2.5 times as fast as 278with `less-slow-gb-hanzi-encode`). 279 280Takes precedence over `less-slow-gb-hanzi-encode`. 281 282Adds 36 KB to the binary size (24 KB compared to `less-slow-gb-hanzi-encode`). 283 284Does _not_ affect decode speed. 285 286Not used by Firefox. 287 288### `less-slow-gb-hanzi-encode` 289 290Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode 291less slow (binary search instead of linear search) making Simplified Chinese 292plain-text encode to the legacy encodings about 40 times as fast as without 293this option. 294 295Adds 12 KB to the binary size. 296 297Does _not_ affect decode speed. 298 299Not used by Firefox. 300 301### `fast-big5-hanzi-encode` 302 303Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 from 304linear search over a part the decode-optimized tables to lookup by index 305making Traditional Chinese plain-text encode to Big5 105 to 125 times as fast 306as without this option (about 3 times as fast as with 307`less-slow-big5-hanzi-encode`). 308 309Takes precedence over `less-slow-big5-hanzi-encode`. 310 311Adds 40 KB to the binary size (20 KB compared to `less-slow-big5-hanzi-encode`). 312 313Does _not_ affect decode speed. 314 315Not used by Firefox. 316 317### `less-slow-big5-hanzi-encode` 318 319Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow 320(binary search instead of linear search) making Traditional Chinese 321plain-text encode to Big5 about 36 times as fast as without this option. 322 323Adds 20 KB to the binary size. 324 325Does _not_ affect decode speed. 326 327Not used by Firefox. 328 329## Performance goals 330 331For decoding to UTF-16, the goal is to perform at least as well as Gecko's old 332uconv. For decoding to UTF-8, the goal is to perform at least as well as 333rust-encoding. These goals have been achieved. 334 335Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent 336to `memcpy` and UTF-16 to UTF-8 should be fast.) 337 338Speed is a non-goal when encoding to legacy encodings. By default, encoding to 339legacy encodings should not be optimized for speed at the expense of code size 340as long as form submission and URL parsing in Gecko don't become noticeably 341too slow in real-world use. 342 343In the interest of binary size, by default, encoding_rs does not have 344encode-specific data tables beyond 32 bits of encode-specific data for each 345single-byte encoding. Therefore, encoders search the decode-optimized data 346tables. This is a linear search in most cases. As a result, by default, encode 347to legacy encodings varies from slow to extremely slow relative to other 348libraries. Still, with realistic work loads, this seemed fast enough not to be 349user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing) 350in the Web-exposed encoder use cases. 351 352See the cargo features above for optionally making CJK legacy encode fast. 353 354A framework for measuring performance is [available separately][2]. 355 356[2]: https://github.com/hsivonen/encoding_bench/ 357 358## Rust Version Compatibility 359 360It is a goal to support the latest stable Rust, the latest nightly Rust and 361the version of Rust that's used for Firefox Nightly. 362 363At this time, there is no firm commitment to support a version older than 364what's required by Firefox, and there is no commitment to treat MSRV changes 365as semver-breaking, because this crate depends on `cfg-if`, which doesn't 366appear to treat MSRV changes as semver-breaking, so it would be useless for 367this crate to treat MSRV changes as semver-breaking. 368 369As of 2021-02-04, MSRV appears to be Rust 1.36.0 for using the crate and 3701.42.0 for doc tests to pass without errors about the global allocator. 371 372## Compatibility with rust-encoding 373 374A compatibility layer that implements the rust-encoding API on top of 375encoding_rs is 376[provided as a separate crate](https://github.com/hsivonen/encoding_rs_compat) 377(cannot be uploaded to crates.io). The compatibility layer was originally 378written with the assuption that Firefox would need it, but it is not currently 379used in Firefox. 380 381## Regenerating Generated Code 382 383To regenerate the generated code: 384 385 * Have Python 2 installed. 386 * Clone [`https://github.com/hsivonen/encoding_c`](https://github.com/hsivonen/encoding_c) 387 next to the `encoding_rs` directory. 388 * Clone [`https://github.com/hsivonen/codepage`](https://github.com/hsivonen/codepage) 389 next to the `encoding_rs` directory. 390 * Clone [`https://github.com/whatwg/encoding`](https://github.com/whatwg/encoding) 391 next to the `encoding_rs` directory. 392 * Checkout revision `f381389` of the `encoding` repo. 393 * With the `encoding_rs` directory as the working directory, run 394 `python generate-encoding-data.py`. 395 396## Roadmap 397 398- [x] Design the low-level API. 399- [x] Provide Rust-only convenience features. 400- [x] Provide an stl/gsl-flavored C++ API. 401- [x] Implement all decoders and encoders. 402- [x] Add unit tests for all decoders and encoders. 403- [x] Finish BOM sniffing variants in Rust-only convenience features. 404- [x] Document the API. 405- [x] Publish the crate on crates.io. 406- [x] Create a solution for measuring performance. 407- [x] Accelerate ASCII conversions using SSE2 on x86. 408- [x] Accelerate ASCII conversions using ALU register-sized operations on 409 non-x86 architectures (process an `usize` instead of `u8` at a time). 410- [x] Split FFI into a separate crate so that the FFI doesn't interfere with 411 LTO in pure-Rust usage. 412- [x] Compress CJK indices by making use of sequential code points as well 413 as Unicode-ordered parts of indices. 414- [x] Make lookups by label or name use binary search that searches from the 415 end of the label/name to the start. 416- [x] Make labels with non-ASCII bytes fail fast. 417- [ ] ~Parallelize UTF-8 validation using [Rayon](https://github.com/nikomatsakis/rayon).~ 418 (This turned out to be a pessimization in the ASCII case due to memory bandwidth reasons.) 419- [x] Provide an XPCOM/MFBT-flavored C++ API. 420- [x] Investigate accelerating single-byte encode with a single fast-tracked 421 range per encoding. 422- [x] Replace uconv with encoding_rs in Gecko. 423- [x] Implement the rust-encoding API in terms of encoding_rs. 424- [x] Add SIMD acceleration for Aarch64. 425- [x] Investigate the use of NEON on 32-bit ARM. 426- [ ] ~Investigate Björn Höhrmann's lookup table acceleration for UTF-8 as 427 adapted to Rust in rust-encoding.~ 428- [x] Add actually fast CJK encode options. 429- [ ] ~Investigate [Bob Steagall's lookup table acceleration for UTF-8](https://github.com/BobSteagall/CppNow2018/blob/master/FastConversionFromUTF-8/Fast%20Conversion%20From%20UTF-8%20with%20C%2B%2B%2C%20DFAs%2C%20and%20SSE%20Intrinsics%20-%20Bob%20Steagall%20-%20C%2B%2BNow%202018.pdf).~ 430- [ ] Provide a build mode that works without `alloc` (with lesser API surface). 431- [ ] Migrate to `std::simd` once it is stable and declare 1.0. 432 433## Release Notes 434 435### 0.8.28 436 437* Fix error in Serde support introduced as part of `no_std` support. 438 439### 0.8.27 440 441* Make the crate works in a `no_std` environment (with `alloc`). 442 443### 0.8.26 444 445* Fix oversights in edition 2018 migration that broke the `simd-accel` feature. 446 447### 0.8.25 448 449* Do pointer alignment checks in a way where intermediate steps aren't defined to be Undefined Behavior. 450* Update the `packed_simd` dependency to `packed_simd_2`. 451* Update the `cfg-if` dependency to 1.0. 452* Address warnings that have been introduced by newer Rust versions along the way. 453* Update to edition 2018, since even prior to 1.0 `cfg-if` updated to edition 2018 without a semver break. 454 455### 0.8.24 456 457* Avoid computing an intermediate (not dereferenced) pointer value in a manner designated as Undefined Behavior when computing pointer alignment. 458 459### 0.8.23 460 461* Remove year from copyright notices. (No features or bug fixes.) 462 463### 0.8.22 464 465* Formatting fix and new unit test. (No features or bug fixes.) 466 467### 0.8.21 468 469* Fixed a panic with invalid UTF-16[BE|LE] input at the end of the stream. 470 471### 0.8.20 472 473* Make `Decoder::latin1_byte_compatible_up_to` return `None` in more 474 cases to make the method actually useful. While this could be argued 475 to be a breaking change due to the bug fix changing semantics, it does 476 not break callers that had to handle the `None` case in a reasonable 477 way anyway. 478 479### 0.8.19 480 481* Removed a bunch of bound checks in `convert_str_to_utf16`. 482* Added `mem::convert_utf8_to_utf16_without_replacement`. 483 484### 0.8.18 485 486* Added `mem::utf8_latin1_up_to` and `mem::str_latin1_up_to`. 487* Added `Decoder::latin1_byte_compatible_up_to`. 488 489### 0.8.17 490 491* Update `bincode` (dev dependency) version requirement to 1.0. 492 493### 0.8.16 494 495* Switch from the `simd` crate to `packed_simd`. 496 497### 0.8.15 498 499* Adjust documentation for `simd-accel` (README-only release). 500 501### 0.8.14 502 503* Made UTF-16 to UTF-8 encode conversion fill the output buffer as 504 closely as possible. 505 506### 0.8.13 507 508* Made the UTF-8 to UTF-16 decoder compare the number of code units written 509 with the length of the right slice (the output slice) to fix a panic 510 introduced in 0.8.11. 511 512### 0.8.12 513 514* Removed the `clippy::` prefix from clippy lint names. 515 516### 0.8.11 517 518* Changed minimum Rust requirement to 1.29.0 (for the ability to refer 519 to the interior of a `static` when defining another `static`). 520* Explicitly aligned the lookup tables for single-byte encodings and 521 UTF-8 to cache lines in the hope of freeing up one cache line for 522 other data. (Perhaps the tables were already aligned and this is 523 placebo.) 524* Added 32 bits of encode-oriented data for each single-byte encoding. 525 The change was performance-neutral for non-Latin1-ish Latin legacy 526 encodings, improved Latin1-ish and Arabic legacy encode speed 527 somewhat (new speed is 2.4x the old speed for German, 2.3x for 528 Arabic, 1.7x for Portuguese and 1.4x for French) and improved 529 non-Latin1, non-Arabic legacy single-byte encode a lot (7.2x for 530 Thai, 6x for Greek, 5x for Russian, 4x for Hebrew). 531* Added compile-time options for fast CJK legacy encode options (at 532 the cost of binary size (up to 176 KB) and run-time memory usage). 533 These options still retain the overall code structure instead of 534 rewriting the CJK encoders totally, so the speed isn't as good as 535 what could be achieved by using even more memory / making the 536 binary even langer. 537* Made UTF-8 decode and validation faster. 538* Added method `is_single_byte()` on `Encoding`. 539* Added `mem::decode_latin1()` and `mem::encode_latin1_lossy()`. 540 541### 0.8.10 542 543* Disabled a unit test that tests a panic condition when the assertion 544 being tested is disabled. 545 546### 0.8.9 547 548* Made `--features simd-accel` work with stable-channel compiler to 549 simplify the Firefox build system. 550 551### 0.8.8 552 553* Made the `is_foo_bidi()` not treat U+FEFF (ZERO WIDTH NO-BREAK SPACE 554 aka. BYTE ORDER MARK) as right-to-left. 555* Made the `is_foo_bidi()` functions report `true` if the input contains 556 Hebrew presentations forms (which are right-to-left but not in a 557 right-to-left-roadmapped block). 558 559### 0.8.7 560 561* Fixed a panic in the UTF-16LE/UTF-16BE decoder when decoding to UTF-8. 562 563### 0.8.6 564 565* Temporarily removed the debug assertion added in version 0.8.5 from 566 `convert_utf16_to_latin1_lossy`. 567 568### 0.8.5 569 570* If debug assertions are enabled but fuzzing isn't enabled, lossy conversions 571 to Latin1 in the `mem` module assert that the input is in the range 572 U+0000...U+00FF (inclusive). 573* In the `mem` module provide conversions from Latin1 and UTF-16 to UTF-8 574 that can deal with insufficient output space. The idea is to use them 575 first with an allocation rounded up to jemalloc bucket size and do the 576 worst-case allocation only if the jemalloc rounding up was insufficient 577 as the first guess. 578 579### 0.8.4 580 581* Fix SSE2-specific, `simd-accel`-specific memory corruption introduced in 582 version 0.8.1 in conversions between UTF-16 and Latin1 in the `mem` module. 583 584### 0.8.3 585 586* Removed an `#[inline(never)]` annotation that was not meant for release. 587 588### 0.8.2 589 590* Made non-ASCII UTF-16 to UTF-8 encode faster by manually omitting bound 591 checks and manually adding branch prediction annotations. 592 593### 0.8.1 594 595* Tweaked loop unrolling and memory alignment for SSE2 conversions between 596 UTF-16 and Latin1 in the `mem` module to increase the performance when 597 converting long buffers. 598 599### 0.8.0 600 601* Changed the minimum supported version of Rust to 1.21.0 (semver breaking 602 change). 603* Flipped around the defaults vs. optional features for controlling the size 604 vs. speed trade-off for Kanji and Hanzi legacy encode (semver breaking 605 change). 606* Added NEON support on ARMv7. 607* SIMD-accelerated x-user-defined to UTF-16 decode. 608* Made UTF-16LE and UTF-16BE decode a lot faster (including SIMD 609 acceleration). 610 611### 0.7.2 612 613* Add the `mem` module. 614* Refactor SIMD code which can affect performance outside the `mem` 615 module. 616 617### 0.7.1 618 619* When encoding from invalid UTF-16, correctly handle U+DC00 followed by 620 another low surrogate. 621 622### 0.7.0 623 624* [Make `replacement` a label of the replacement 625 encoding.](https://github.com/whatwg/encoding/issues/70) (Spec change.) 626* Remove `Encoding::for_name()`. (`Encoding::for_label(foo).unwrap()` is 627 now close enough after the above label change.) 628* Remove the `parallel-utf8` cargo feature. 629* Add optional Serde support for `&'static Encoding`. 630* Performance tweaks for ASCII handling. 631* Performance tweaks for UTF-8 validation. 632* SIMD support on aarch64. 633 634### 0.6.11 635 636* Make `Encoder::has_pending_state()` public. 637* Update the `simd` crate dependency to 0.2.0. 638 639### 0.6.10 640 641* Reserve enough space for NCRs when encoding to ISO-2022-JP. 642* Correct max length calculations for multibyte decoders. 643* Correct max length calculations before BOM sniffing has been 644 performed. 645* Correctly calculate max length when encoding from UTF-16 to GBK. 646 647### 0.6.9 648 649* [Don't prepend anything when gb18030 range decode 650 fails](https://github.com/whatwg/encoding/issues/110). (Spec change.) 651 652### 0.6.8 653 654* Correcly handle the case where the first buffer contains potentially 655 partial BOM and the next buffer is the last buffer. 656* Decode byte `7F` correctly in ISO-2022-JP. 657* Make UTF-16 to UTF-8 encode write closer to the end of the buffer. 658* Implement `Hash` for `Encoding`. 659 660### 0.6.7 661 662* [Map half-width katakana to full-width katana in ISO-2022-JP 663 encoder](https://github.com/whatwg/encoding/issues/105). (Spec change.) 664* Give `InputEmpty` correct precedence over `OutputFull` when encoding 665 with replacement and the output buffer passed in is too short or the 666 remaining space in the output buffer is too small after a replacement. 667 668### 0.6.6 669 670* Correct max length calculation when a partial BOM prefix is part of 671 the decoder's state. 672 673### 0.6.5 674 675* Correct max length calculation in various encoders. 676* Correct max length calculation in the UTF-16 decoder. 677* Derive `PartialEq` and `Eq` for the `CoderResult`, `DecoderResult` 678 and `EncoderResult` types. 679 680### 0.6.4 681 682* Avoid panic when encoding with replacement and the destination buffer is 683 too short to hold one numeric character reference. 684 685### 0.6.3 686 687* Add support for 32-bit big-endian hosts. (For real this time.) 688 689### 0.6.2 690 691* Fix a panic from subslicing with bad indices in 692 `Encoder::encode_from_utf16`. (Due to an oversight, it lacked the fix that 693 `Encoder::encode_from_utf8` already had.) 694* Micro-optimize error status accumulation in non-streaming case. 695 696### 0.6.1 697 698* Avoid panic near integer overflow in a case that's unlikely to actually 699 happen. 700* Address Clippy lints. 701 702### 0.6.0 703 704* Make the methods for computing worst-case buffer size requirements check 705 for integer overflow. 706* Upgrade rayon to 0.7.0. 707 708### 0.5.1 709 710* Reorder methods for better documentation readability. 711* Add support for big-endian hosts. (Only 64-bit case actually tested.) 712* Optimize the ALU (non-SIMD) case for 32-bit ARM instead of x86_64. 713 714### 0.5.0 715 716* Avoid allocating an excessively long buffers in non-streaming decode. 717* Fix the behavior of ISO-2022-JP and replacement decoders near the end of the 718 output buffer. 719* Annotate the result structs with `#[must_use]`. 720 721### 0.4.0 722 723* Split FFI into a separate crate. 724* Performance tweaks. 725* CJK binary size and encoding performance changes. 726* Parallelize UTF-8 validation in the case of long buffers (with optional 727 feature `parallel-utf8`). 728* Borrow even with ISO-2022-JP when possible. 729 730### 0.3.2 731 732* Fix moving pointers to alignment in ALU-based ASCII acceleration. 733* Fix errors in documentation and improve documentation. 734 735### 0.3.1 736 737* Fix UTF-8 to UTF-16 decode for byte sequences beginning with 0xEE. 738* Make UTF-8 to UTF-8 decode SSE2-accelerated when feature `simd-accel` is used. 739* When decoding and encoding ASCII-only input from or to an ASCII-compatible 740 encoding using the non-streaming API, return a borrow of the input. 741* Make encode from UTF-16 to UTF-8 faster. 742 743### 0.3 744 745* Change the references to the instances of `Encoding` from `const` to `static` 746 to make the referents unique across crates that use the refernces. 747* Introduce non-reference-typed `FOO_INIT` instances of `Encoding` to allow 748 foreign crates to initialize `static` arrays with references to `Encoding` 749 instances even under Rust's constraints that prohibit the initialization of 750 `&'static Encoding`-typed array items with `&'static Encoding`-typed 751 `statics`. 752* Document that the above two points will be reverted if Rust changes `const` 753 to work so that cross-crate usage keeps the referents unique. 754* Return `Cow`s from Rust-only non-streaming methods for encode and decode. 755* `Encoding::for_bom()` returns the length of the BOM. 756* ASCII-accelerated conversions for encodings other than UTF-16LE, UTF-16BE, 757 ISO-2022-JP and x-user-defined. 758* Add SSE2 acceleration behind the `simd-accel` feature flag. (Requires 759 nightly Rust.) 760* Fix panic with long bogus labels. 761* Map [0xCA to U+05BA in windows-1255](https://github.com/whatwg/encoding/issues/73). 762 (Spec change.) 763* Correct the [end of the Shift_JIS EUDC range](https://github.com/whatwg/encoding/issues/53). 764 (Spec change.) 765 766### 0.2.4 767 768* Polish FFI documentation. 769 770### 0.2.3 771 772* Fix UTF-16 to UTF-8 encode. 773 774### 0.2.2 775 776* Add `Encoder.encode_from_utf8_to_vec_without_replacement()`. 777 778### 0.2.1 779 780* Add `Encoding.is_ascii_compatible()`. 781 782* Add `Encoding::for_bom()`. 783 784* Make `==` for `Encoding` use name comparison instead of pointer comparison, 785 because uses of the encoding constants in different crates result in 786 different addresses and the constant cannot be turned into statics without 787 breaking other things. 788 789### 0.2.0 790 791The initial release. 792