1# encoding_rs 2 3[![Build Status](https://travis-ci.org/hsivonen/encoding_rs.svg?branch=master)](https://travis-ci.org/hsivonen/encoding_rs) 4[![crates.io](https://img.shields.io/crates/v/encoding_rs.svg)](https://crates.io/crates/encoding_rs) 5[![docs.rs](https://docs.rs/encoding_rs/badge.svg)](https://docs.rs/encoding_rs/) 6 7encoding_rs an implementation of the (non-JavaScript parts of) the 8[Encoding Standard](https://encoding.spec.whatwg.org/) written in Rust. 9 10The Encoding Standard defines the Web-compatible set of character encodings, 11which means this crate can be used to decode Web content. encoding_rs is 12used in Gecko starting with Firefox 56. Due to the notable overlap between 13the legacy encodings on the Web and the legacy encodings used on Windows, 14this crate may be of use for non-Web-related situations as well; see below 15for links to adjacent crates. 16 17Additionally, the `mem` module provides various operations for dealing with 18in-RAM text (as opposed to data that's coming from or going to an IO boundary). 19The `mem` module is a module instead of a separate crate due to internal 20implementation detail efficiencies. 21 22## Functionality 23 24Due to the Gecko use case, encoding_rs supports decoding to and encoding from 25UTF-16 in addition to supporting the usual Rust use case of decoding to and 26encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly 27to accommodate the C++ side of Gecko. 28 29Specifically, encoding_rs does the following: 30 31* Decodes a stream of bytes in an Encoding Standard-defined character encoding 32 into valid aligned native-endian in-RAM UTF-16 (units of `u16` / `char16_t`). 33* Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16 34 (units of `u16` / `char16_t`) into a sequence of bytes in an Encoding 35 Standard-defined character encoding as if the lone surrogates had been 36 replaced with the REPLACEMENT CHARACTER before performing the encode. 37 (Gecko's UTF-16 is potentially invalid.) 38* Decodes a stream of bytes in an Encoding Standard-defined character 39 encoding into valid UTF-8. 40* Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding 41 Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.) 42* Does the above in streaming (input and output split across multiple 43 buffers) and non-streaming (whole input in a single buffer and whole 44 output in a single buffer) variants. 45* Avoids copying (borrows) when possible in the non-streaming cases when 46 decoding to or encoding from UTF-8. 47* Resolves textual labels that identify character encodings in 48 protocol text into type-safe objects representing the those encodings 49 conceptually. 50* Maps the type-safe encoding objects onto strings suitable for 51 returning from `document.characterSet`. 52* Validates UTF-8 (in common instruction set scenarios a bit faster for Web 53 workloads than the standard library; hopefully will get upstreamed some 54 day) and ASCII. 55 56Additionally, `encoding_rs::mem` does the following: 57 58* Checks if a byte buffer contains only ASCII. 59* Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII). 60* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 61 buffer contains only Latin1 code points (below U+0100). 62* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 63 buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior 64 (suitable for checking if the Unicode Bidirectional Algorithm can be optimized 65 out). 66* Combined versions of the above two checks. 67* Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16. 68* Converts potentially-invalid UTF-16 and Latin1 to UTF-8. 69* Converts UTF-8 and UTF-16 to Latin1 (if in range). 70* Finds the first invalid code unit in a buffer of potentially-invalid UTF-16. 71* Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16. 72* Copies ASCII from one buffer to another up to the first non-ASCII byte. 73* Converts ASCII to UTF-16 up to the first non-ASCII byte. 74* Converts UTF-16 to ASCII up to the first non-Basic Latin code unit. 75 76## Integration with `std::io` 77 78Notably, the above feature list doesn't include the capability to wrap 79a `std::io::Read`, decode it into UTF-8 and presenting the result via 80`std::io::Read`. The [`encoding_rs_io`](https://crates.io/crates/encoding_rs_io) 81crate provides that capability. 82 83## `no_std` Environment 84 85The crate works in a `no_std` environment. By default, the `alloc` feature, 86which assumes that an allocator is present is enabled. For a no-allocator 87environment, the default features (i.e. `alloc`) can be turned off. This 88makes the part of the API that returns `Vec`/`String`/`Cow` unavailable. 89 90## Decoding Email 91 92For decoding character encodings that occur in email, use the 93[`charset`](https://crates.io/crates/charset) crate instead of using this 94one directly. (It wraps this crate and adds UTF-7 decoding.) 95 96## Windows Code Page Identifier Mappings 97 98For mappings to and from Windows code page identifiers, use the 99[`codepage`](https://crates.io/crates/codepage) crate. 100 101## DOS Encodings 102 103This crate does not support single-byte DOS encodings that aren't required by 104the Web Platform, but the [`oem_cp`](https://crates.io/crates/oem_cp) crate does. 105 106## Preparing Text for the Encoders 107 108Normalizing text into Unicode Normalization Form C prior to encoding text into 109a legacy encoding minimizes unmappable characters. Text can be normalized to 110Unicode Normalization Form C using the 111[`unic-normal`](https://crates.io/crates/unic-normal) crate. 112 113The exception is windows-1258, which after normalizing to Unicode Normalization 114Form C requires tone marks to be decomposed in order to minimize unmappable 115characters. Vietnamese tone marks can be decomposed using the 116[`detone`](https://crates.io/crates/detone) crate. 117 118## Licensing 119 120TL;DR: ((Apache-2.0 OR MIT) AND BSD-3-Clause) for the code and data combination, 121but [crates.io doesn't support 122parentheses](https://github.com/rust-lang/crates.io/issues/2595), so the crate 123metadata points to a custom file. 124 125Please see the file named 126[COPYRIGHT](https://github.com/hsivonen/encoding_rs/blob/master/COPYRIGHT). 127 128The non-test code that isn't generated from the WHATWG data in this crate is 129under Apache-2.0 OR MIT. Test code is under CC0. 130 131This crate contains code/data generated from WHATWG-supplied data. The WHATWG 132upstream changed its license for portions of specs incorporated into source code 133from CC0 to BSD-3-Clause between the initial release of this crate and the present 134version of this crate. The in-source licensing legends have been updated for the 135parts of the generated code that have changed since the upstream license change. 136 137## Documentation 138 139Generated [API documentation](https://docs.rs/encoding_rs/) is available 140online. 141 142There is a [long-form write-up](https://hsivonen.fi/encoding_rs/) about the 143design and internals of the crate. 144 145## C and C++ bindings 146 147An FFI layer for encoding_rs is available as a 148[separate crate](https://github.com/hsivonen/encoding_c). The crate comes 149with a [demo C++ wrapper](https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h) 150using the C++ standard library and [GSL](https://github.com/Microsoft/GSL/) types. 151 152The bindings for the `mem` module are in the 153[encoding_c_mem crate](https://github.com/hsivonen/encoding_c_mem). 154 155For the Gecko context, there's a 156[C++ wrapper using the MFBT/XPCOM types](https://searchfox.org/mozilla-central/source/intl/Encoding.h#100). 157 158There's a [write-up](https://hsivonen.fi/modern-cpp-in-rust/) about the C++ 159wrappers. 160 161## Sample programs 162 163* [Rust](https://github.com/hsivonen/recode_rs) 164* [C](https://github.com/hsivonen/recode_c) 165* [C++](https://github.com/hsivonen/recode_cpp) 166 167## Optional features 168 169There are currently these optional cargo features: 170 171### `simd-accel` 172 173Enables SIMD acceleration using the nightly-dependent `packed_simd_2` crate. 174 175This is an opt-in feature, because enabling this feature _opts out_ of Rust's 176guarantees of future compilers compiling old code (aka. "stability story"). 177 178Currently, this has not been tested to be an improvement except for these 179targets: 180 181* x86_64 182* i686 183* aarch64 184* thumbv7neon 185 186If you use nightly Rust, you use targets whose first component is one of the 187above, and you are prepared _to have to revise your configuration when updating 188Rust_, you should enable this feature. Otherwise, please _do not_ enable this 189feature. 190 191_Note!_ If you are compiling for a target that does not have 128-bit SIMD 192enabled as part of the target definition and you are enabling 128-bit SIMD 193using `-C target_feature`, you need to enable the `core_arch` Cargo feature 194for `packed_simd_2` to compile a crates.io snapshot of `core_arch` instead of 195using the standard-library copy of `core::arch`, because the `core::arch` 196module of the pre-compiled standard library has been compiled with the 197assumption that the CPU doesn't have 128-bit SIMD. At present this applies 198mainly to 32-bit ARM targets whose first component does not include the 199substring `neon`. 200 201The encoding_rs side of things has not been properly set up for POWER, 202PowerPC, MIPS, etc., SIMD at this time, so even if you were to follow 203the advice from the previous paragraph, you probably shouldn't use 204the `simd-accel` option on the less mainstream architectures at this 205time. 206 207Used by Firefox. 208 209### `serde` 210 211Enables support for serializing and deserializing `&'static Encoding`-typed 212struct fields using [Serde][1]. 213 214[1]: https://serde.rs/ 215 216Not used by Firefox. 217 218### `fast-legacy-encode` 219 220A catch-all option for enabling the fastest legacy encode options. _Does not 221affect decode speed or UTF-8 encode speed._ 222 223At present, this option is equivalent to enabling the following options: 224 * `fast-hangul-encode` 225 * `fast-hanja-encode` 226 * `fast-kanji-encode` 227 * `fast-gb-hanzi-encode` 228 * `fast-big5-hanzi-encode` 229 230Adds 176 KB to the binary size. 231 232Not used by Firefox. 233 234### `fast-hangul-encode` 235 236Changes encoding precomposed Hangul syllables into EUC-KR from binary 237search over the decode-optimized tables to lookup by index making Korean 238plain-text encode about 4 times as fast as without this option. 239 240Adds 20 KB to the binary size. 241 242Does _not_ affect decode speed. 243 244Not used by Firefox. 245 246### `fast-hanja-encode` 247 248Changes encoding of Hanja into EUC-KR from linear search over the 249decode-optimized table to lookup by index. Since Hanja is practically absent 250in modern Korean text, this option doesn't affect perfomance in the common 251case and mainly makes sense if you want to make your application resilient 252agaist denial of service by someone intentionally feeding it a lot of Hanja 253to encode into EUC-KR. 254 255Adds 40 KB to the binary size. 256 257Does _not_ affect decode speed. 258 259Not used by Firefox. 260 261### `fast-kanji-encode` 262 263Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linear 264search over the decode-optimized tables to lookup by index making Japanese 265plain-text encode to legacy encodings 30 to 50 times as fast as without this 266option (about 2 times as fast as with `less-slow-kanji-encode`). 267 268Takes precedence over `less-slow-kanji-encode`. 269 270Adds 36 KB to the binary size (24 KB compared to `less-slow-kanji-encode`). 271 272Does _not_ affect decode speed. 273 274Not used by Firefox. 275 276### `less-slow-kanji-encode` 277 278Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP and 279ISO-2022-JP) encode less slow (binary search instead of linear search) making 280Japanese plain-text encode to legacy encodings 14 to 23 times as fast as 281without this option. 282 283Adds 12 KB to the binary size. 284 285Does _not_ affect decode speed. 286 287Not used by Firefox. 288 289### `fast-gb-hanzi-encode` 290 291Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK and 292gb18030 from linear search over a part the decode-optimized tables followed 293by a binary search over another part of the decode-optimized tables to lookup 294by index making Simplified Chinese plain-text encode to the legacy encodings 295100 to 110 times as fast as without this option (about 2.5 times as fast as 296with `less-slow-gb-hanzi-encode`). 297 298Takes precedence over `less-slow-gb-hanzi-encode`. 299 300Adds 36 KB to the binary size (24 KB compared to `less-slow-gb-hanzi-encode`). 301 302Does _not_ affect decode speed. 303 304Not used by Firefox. 305 306### `less-slow-gb-hanzi-encode` 307 308Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode 309less slow (binary search instead of linear search) making Simplified Chinese 310plain-text encode to the legacy encodings about 40 times as fast as without 311this option. 312 313Adds 12 KB to the binary size. 314 315Does _not_ affect decode speed. 316 317Not used by Firefox. 318 319### `fast-big5-hanzi-encode` 320 321Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 from 322linear search over a part the decode-optimized tables to lookup by index 323making Traditional Chinese plain-text encode to Big5 105 to 125 times as fast 324as without this option (about 3 times as fast as with 325`less-slow-big5-hanzi-encode`). 326 327Takes precedence over `less-slow-big5-hanzi-encode`. 328 329Adds 40 KB to the binary size (20 KB compared to `less-slow-big5-hanzi-encode`). 330 331Does _not_ affect decode speed. 332 333Not used by Firefox. 334 335### `less-slow-big5-hanzi-encode` 336 337Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow 338(binary search instead of linear search) making Traditional Chinese 339plain-text encode to Big5 about 36 times as fast as without this option. 340 341Adds 20 KB to the binary size. 342 343Does _not_ affect decode speed. 344 345Not used by Firefox. 346 347## Performance goals 348 349For decoding to UTF-16, the goal is to perform at least as well as Gecko's old 350uconv. For decoding to UTF-8, the goal is to perform at least as well as 351rust-encoding. These goals have been achieved. 352 353Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent 354to `memcpy` and UTF-16 to UTF-8 should be fast.) 355 356Speed is a non-goal when encoding to legacy encodings. By default, encoding to 357legacy encodings should not be optimized for speed at the expense of code size 358as long as form submission and URL parsing in Gecko don't become noticeably 359too slow in real-world use. 360 361In the interest of binary size, by default, encoding_rs does not have 362encode-specific data tables beyond 32 bits of encode-specific data for each 363single-byte encoding. Therefore, encoders search the decode-optimized data 364tables. This is a linear search in most cases. As a result, by default, encode 365to legacy encodings varies from slow to extremely slow relative to other 366libraries. Still, with realistic work loads, this seemed fast enough not to be 367user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing) 368in the Web-exposed encoder use cases. 369 370See the cargo features above for optionally making CJK legacy encode fast. 371 372A framework for measuring performance is [available separately][2]. 373 374[2]: https://github.com/hsivonen/encoding_bench/ 375 376## Rust Version Compatibility 377 378It is a goal to support the latest stable Rust, the latest nightly Rust and 379the version of Rust that's used for Firefox Nightly. 380 381At this time, there is no firm commitment to support a version older than 382what's required by Firefox, and there is no commitment to treat MSRV changes 383as semver-breaking, because this crate depends on `cfg-if`, which doesn't 384appear to treat MSRV changes as semver-breaking, so it would be useless for 385this crate to treat MSRV changes as semver-breaking. 386 387As of 2021-02-04, MSRV appears to be Rust 1.36.0 for using the crate and 3881.42.0 for doc tests to pass without errors about the global allocator. 389 390## Compatibility with rust-encoding 391 392A compatibility layer that implements the rust-encoding API on top of 393encoding_rs is 394[provided as a separate crate](https://github.com/hsivonen/encoding_rs_compat) 395(cannot be uploaded to crates.io). The compatibility layer was originally 396written with the assuption that Firefox would need it, but it is not currently 397used in Firefox. 398 399## Regenerating Generated Code 400 401To regenerate the generated code: 402 403 * Have Python 2 installed. 404 * Clone [`https://github.com/hsivonen/encoding_c`](https://github.com/hsivonen/encoding_c) 405 next to the `encoding_rs` directory. 406 * Clone [`https://github.com/hsivonen/codepage`](https://github.com/hsivonen/codepage) 407 next to the `encoding_rs` directory. 408 * Clone [`https://github.com/whatwg/encoding`](https://github.com/whatwg/encoding) 409 next to the `encoding_rs` directory. 410 * Checkout revision `be3337450e7df1c49dca7872153c4c4670dd8256` of the `encoding` repo. 411 (Note: `f381389` was the revision of `encoding` used from before the `encoding` repo 412 license change. So far, only output changed since then has been updated to 413 the new license legend.) 414 * With the `encoding_rs` directory as the working directory, run 415 `python generate-encoding-data.py`. 416 417## Roadmap 418 419- [x] Design the low-level API. 420- [x] Provide Rust-only convenience features. 421- [x] Provide an stl/gsl-flavored C++ API. 422- [x] Implement all decoders and encoders. 423- [x] Add unit tests for all decoders and encoders. 424- [x] Finish BOM sniffing variants in Rust-only convenience features. 425- [x] Document the API. 426- [x] Publish the crate on crates.io. 427- [x] Create a solution for measuring performance. 428- [x] Accelerate ASCII conversions using SSE2 on x86. 429- [x] Accelerate ASCII conversions using ALU register-sized operations on 430 non-x86 architectures (process an `usize` instead of `u8` at a time). 431- [x] Split FFI into a separate crate so that the FFI doesn't interfere with 432 LTO in pure-Rust usage. 433- [x] Compress CJK indices by making use of sequential code points as well 434 as Unicode-ordered parts of indices. 435- [x] Make lookups by label or name use binary search that searches from the 436 end of the label/name to the start. 437- [x] Make labels with non-ASCII bytes fail fast. 438- [ ] ~Parallelize UTF-8 validation using [Rayon](https://github.com/nikomatsakis/rayon).~ 439 (This turned out to be a pessimization in the ASCII case due to memory bandwidth reasons.) 440- [x] Provide an XPCOM/MFBT-flavored C++ API. 441- [x] Investigate accelerating single-byte encode with a single fast-tracked 442 range per encoding. 443- [x] Replace uconv with encoding_rs in Gecko. 444- [x] Implement the rust-encoding API in terms of encoding_rs. 445- [x] Add SIMD acceleration for Aarch64. 446- [x] Investigate the use of NEON on 32-bit ARM. 447- [ ] ~Investigate Björn Höhrmann's lookup table acceleration for UTF-8 as 448 adapted to Rust in rust-encoding.~ 449- [x] Add actually fast CJK encode options. 450- [ ] ~Investigate [Bob Steagall's lookup table acceleration for UTF-8](https://github.com/BobSteagall/CppNow2018/blob/master/FastConversionFromUTF-8/Fast%20Conversion%20From%20UTF-8%20with%20C%2B%2B%2C%20DFAs%2C%20and%20SSE%20Intrinsics%20-%20Bob%20Steagall%20-%20C%2B%2BNow%202018.pdf).~ 451- [ ] Provide a build mode that works without `alloc` (with lesser API surface). 452- [ ] Migrate to `std::simd` once it is stable and declare 1.0. 453 454## Release Notes 455 456### 0.8.30 457 458* Update the licensing information to take into account the WHATWG data license change. 459 460### 0.8.29 461 462* Make the parts that use an allocator optional. 463 464### 0.8.28 465 466* Fix error in Serde support introduced as part of `no_std` support. 467 468### 0.8.27 469 470* Make the crate works in a `no_std` environment (with `alloc`). 471 472### 0.8.26 473 474* Fix oversights in edition 2018 migration that broke the `simd-accel` feature. 475 476### 0.8.25 477 478* Do pointer alignment checks in a way where intermediate steps aren't defined to be Undefined Behavior. 479* Update the `packed_simd` dependency to `packed_simd_2`. 480* Update the `cfg-if` dependency to 1.0. 481* Address warnings that have been introduced by newer Rust versions along the way. 482* Update to edition 2018, since even prior to 1.0 `cfg-if` updated to edition 2018 without a semver break. 483 484### 0.8.24 485 486* Avoid computing an intermediate (not dereferenced) pointer value in a manner designated as Undefined Behavior when computing pointer alignment. 487 488### 0.8.23 489 490* Remove year from copyright notices. (No features or bug fixes.) 491 492### 0.8.22 493 494* Formatting fix and new unit test. (No features or bug fixes.) 495 496### 0.8.21 497 498* Fixed a panic with invalid UTF-16[BE|LE] input at the end of the stream. 499 500### 0.8.20 501 502* Make `Decoder::latin1_byte_compatible_up_to` return `None` in more 503 cases to make the method actually useful. While this could be argued 504 to be a breaking change due to the bug fix changing semantics, it does 505 not break callers that had to handle the `None` case in a reasonable 506 way anyway. 507 508### 0.8.19 509 510* Removed a bunch of bound checks in `convert_str_to_utf16`. 511* Added `mem::convert_utf8_to_utf16_without_replacement`. 512 513### 0.8.18 514 515* Added `mem::utf8_latin1_up_to` and `mem::str_latin1_up_to`. 516* Added `Decoder::latin1_byte_compatible_up_to`. 517 518### 0.8.17 519 520* Update `bincode` (dev dependency) version requirement to 1.0. 521 522### 0.8.16 523 524* Switch from the `simd` crate to `packed_simd`. 525 526### 0.8.15 527 528* Adjust documentation for `simd-accel` (README-only release). 529 530### 0.8.14 531 532* Made UTF-16 to UTF-8 encode conversion fill the output buffer as 533 closely as possible. 534 535### 0.8.13 536 537* Made the UTF-8 to UTF-16 decoder compare the number of code units written 538 with the length of the right slice (the output slice) to fix a panic 539 introduced in 0.8.11. 540 541### 0.8.12 542 543* Removed the `clippy::` prefix from clippy lint names. 544 545### 0.8.11 546 547* Changed minimum Rust requirement to 1.29.0 (for the ability to refer 548 to the interior of a `static` when defining another `static`). 549* Explicitly aligned the lookup tables for single-byte encodings and 550 UTF-8 to cache lines in the hope of freeing up one cache line for 551 other data. (Perhaps the tables were already aligned and this is 552 placebo.) 553* Added 32 bits of encode-oriented data for each single-byte encoding. 554 The change was performance-neutral for non-Latin1-ish Latin legacy 555 encodings, improved Latin1-ish and Arabic legacy encode speed 556 somewhat (new speed is 2.4x the old speed for German, 2.3x for 557 Arabic, 1.7x for Portuguese and 1.4x for French) and improved 558 non-Latin1, non-Arabic legacy single-byte encode a lot (7.2x for 559 Thai, 6x for Greek, 5x for Russian, 4x for Hebrew). 560* Added compile-time options for fast CJK legacy encode options (at 561 the cost of binary size (up to 176 KB) and run-time memory usage). 562 These options still retain the overall code structure instead of 563 rewriting the CJK encoders totally, so the speed isn't as good as 564 what could be achieved by using even more memory / making the 565 binary even langer. 566* Made UTF-8 decode and validation faster. 567* Added method `is_single_byte()` on `Encoding`. 568* Added `mem::decode_latin1()` and `mem::encode_latin1_lossy()`. 569 570### 0.8.10 571 572* Disabled a unit test that tests a panic condition when the assertion 573 being tested is disabled. 574 575### 0.8.9 576 577* Made `--features simd-accel` work with stable-channel compiler to 578 simplify the Firefox build system. 579 580### 0.8.8 581 582* Made the `is_foo_bidi()` not treat U+FEFF (ZERO WIDTH NO-BREAK SPACE 583 aka. BYTE ORDER MARK) as right-to-left. 584* Made the `is_foo_bidi()` functions report `true` if the input contains 585 Hebrew presentations forms (which are right-to-left but not in a 586 right-to-left-roadmapped block). 587 588### 0.8.7 589 590* Fixed a panic in the UTF-16LE/UTF-16BE decoder when decoding to UTF-8. 591 592### 0.8.6 593 594* Temporarily removed the debug assertion added in version 0.8.5 from 595 `convert_utf16_to_latin1_lossy`. 596 597### 0.8.5 598 599* If debug assertions are enabled but fuzzing isn't enabled, lossy conversions 600 to Latin1 in the `mem` module assert that the input is in the range 601 U+0000...U+00FF (inclusive). 602* In the `mem` module provide conversions from Latin1 and UTF-16 to UTF-8 603 that can deal with insufficient output space. The idea is to use them 604 first with an allocation rounded up to jemalloc bucket size and do the 605 worst-case allocation only if the jemalloc rounding up was insufficient 606 as the first guess. 607 608### 0.8.4 609 610* Fix SSE2-specific, `simd-accel`-specific memory corruption introduced in 611 version 0.8.1 in conversions between UTF-16 and Latin1 in the `mem` module. 612 613### 0.8.3 614 615* Removed an `#[inline(never)]` annotation that was not meant for release. 616 617### 0.8.2 618 619* Made non-ASCII UTF-16 to UTF-8 encode faster by manually omitting bound 620 checks and manually adding branch prediction annotations. 621 622### 0.8.1 623 624* Tweaked loop unrolling and memory alignment for SSE2 conversions between 625 UTF-16 and Latin1 in the `mem` module to increase the performance when 626 converting long buffers. 627 628### 0.8.0 629 630* Changed the minimum supported version of Rust to 1.21.0 (semver breaking 631 change). 632* Flipped around the defaults vs. optional features for controlling the size 633 vs. speed trade-off for Kanji and Hanzi legacy encode (semver breaking 634 change). 635* Added NEON support on ARMv7. 636* SIMD-accelerated x-user-defined to UTF-16 decode. 637* Made UTF-16LE and UTF-16BE decode a lot faster (including SIMD 638 acceleration). 639 640### 0.7.2 641 642* Add the `mem` module. 643* Refactor SIMD code which can affect performance outside the `mem` 644 module. 645 646### 0.7.1 647 648* When encoding from invalid UTF-16, correctly handle U+DC00 followed by 649 another low surrogate. 650 651### 0.7.0 652 653* [Make `replacement` a label of the replacement 654 encoding.](https://github.com/whatwg/encoding/issues/70) (Spec change.) 655* Remove `Encoding::for_name()`. (`Encoding::for_label(foo).unwrap()` is 656 now close enough after the above label change.) 657* Remove the `parallel-utf8` cargo feature. 658* Add optional Serde support for `&'static Encoding`. 659* Performance tweaks for ASCII handling. 660* Performance tweaks for UTF-8 validation. 661* SIMD support on aarch64. 662 663### 0.6.11 664 665* Make `Encoder::has_pending_state()` public. 666* Update the `simd` crate dependency to 0.2.0. 667 668### 0.6.10 669 670* Reserve enough space for NCRs when encoding to ISO-2022-JP. 671* Correct max length calculations for multibyte decoders. 672* Correct max length calculations before BOM sniffing has been 673 performed. 674* Correctly calculate max length when encoding from UTF-16 to GBK. 675 676### 0.6.9 677 678* [Don't prepend anything when gb18030 range decode 679 fails](https://github.com/whatwg/encoding/issues/110). (Spec change.) 680 681### 0.6.8 682 683* Correcly handle the case where the first buffer contains potentially 684 partial BOM and the next buffer is the last buffer. 685* Decode byte `7F` correctly in ISO-2022-JP. 686* Make UTF-16 to UTF-8 encode write closer to the end of the buffer. 687* Implement `Hash` for `Encoding`. 688 689### 0.6.7 690 691* [Map half-width katakana to full-width katana in ISO-2022-JP 692 encoder](https://github.com/whatwg/encoding/issues/105). (Spec change.) 693* Give `InputEmpty` correct precedence over `OutputFull` when encoding 694 with replacement and the output buffer passed in is too short or the 695 remaining space in the output buffer is too small after a replacement. 696 697### 0.6.6 698 699* Correct max length calculation when a partial BOM prefix is part of 700 the decoder's state. 701 702### 0.6.5 703 704* Correct max length calculation in various encoders. 705* Correct max length calculation in the UTF-16 decoder. 706* Derive `PartialEq` and `Eq` for the `CoderResult`, `DecoderResult` 707 and `EncoderResult` types. 708 709### 0.6.4 710 711* Avoid panic when encoding with replacement and the destination buffer is 712 too short to hold one numeric character reference. 713 714### 0.6.3 715 716* Add support for 32-bit big-endian hosts. (For real this time.) 717 718### 0.6.2 719 720* Fix a panic from subslicing with bad indices in 721 `Encoder::encode_from_utf16`. (Due to an oversight, it lacked the fix that 722 `Encoder::encode_from_utf8` already had.) 723* Micro-optimize error status accumulation in non-streaming case. 724 725### 0.6.1 726 727* Avoid panic near integer overflow in a case that's unlikely to actually 728 happen. 729* Address Clippy lints. 730 731### 0.6.0 732 733* Make the methods for computing worst-case buffer size requirements check 734 for integer overflow. 735* Upgrade rayon to 0.7.0. 736 737### 0.5.1 738 739* Reorder methods for better documentation readability. 740* Add support for big-endian hosts. (Only 64-bit case actually tested.) 741* Optimize the ALU (non-SIMD) case for 32-bit ARM instead of x86_64. 742 743### 0.5.0 744 745* Avoid allocating an excessively long buffers in non-streaming decode. 746* Fix the behavior of ISO-2022-JP and replacement decoders near the end of the 747 output buffer. 748* Annotate the result structs with `#[must_use]`. 749 750### 0.4.0 751 752* Split FFI into a separate crate. 753* Performance tweaks. 754* CJK binary size and encoding performance changes. 755* Parallelize UTF-8 validation in the case of long buffers (with optional 756 feature `parallel-utf8`). 757* Borrow even with ISO-2022-JP when possible. 758 759### 0.3.2 760 761* Fix moving pointers to alignment in ALU-based ASCII acceleration. 762* Fix errors in documentation and improve documentation. 763 764### 0.3.1 765 766* Fix UTF-8 to UTF-16 decode for byte sequences beginning with 0xEE. 767* Make UTF-8 to UTF-8 decode SSE2-accelerated when feature `simd-accel` is used. 768* When decoding and encoding ASCII-only input from or to an ASCII-compatible 769 encoding using the non-streaming API, return a borrow of the input. 770* Make encode from UTF-16 to UTF-8 faster. 771 772### 0.3 773 774* Change the references to the instances of `Encoding` from `const` to `static` 775 to make the referents unique across crates that use the refernces. 776* Introduce non-reference-typed `FOO_INIT` instances of `Encoding` to allow 777 foreign crates to initialize `static` arrays with references to `Encoding` 778 instances even under Rust's constraints that prohibit the initialization of 779 `&'static Encoding`-typed array items with `&'static Encoding`-typed 780 `statics`. 781* Document that the above two points will be reverted if Rust changes `const` 782 to work so that cross-crate usage keeps the referents unique. 783* Return `Cow`s from Rust-only non-streaming methods for encode and decode. 784* `Encoding::for_bom()` returns the length of the BOM. 785* ASCII-accelerated conversions for encodings other than UTF-16LE, UTF-16BE, 786 ISO-2022-JP and x-user-defined. 787* Add SSE2 acceleration behind the `simd-accel` feature flag. (Requires 788 nightly Rust.) 789* Fix panic with long bogus labels. 790* Map [0xCA to U+05BA in windows-1255](https://github.com/whatwg/encoding/issues/73). 791 (Spec change.) 792* Correct the [end of the Shift_JIS EUDC range](https://github.com/whatwg/encoding/issues/53). 793 (Spec change.) 794 795### 0.2.4 796 797* Polish FFI documentation. 798 799### 0.2.3 800 801* Fix UTF-16 to UTF-8 encode. 802 803### 0.2.2 804 805* Add `Encoder.encode_from_utf8_to_vec_without_replacement()`. 806 807### 0.2.1 808 809* Add `Encoding.is_ascii_compatible()`. 810 811* Add `Encoding::for_bom()`. 812 813* Make `==` for `Encoding` use name comparison instead of pointer comparison, 814 because uses of the encoding constants in different crates result in 815 different addresses and the constant cannot be turned into statics without 816 breaking other things. 817 818### 0.2.0 819 820The initial release. 821