• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..01-Sep-2019-

src/H01-Sep-2019-38,83133,928

.cargo-checksum.jsonH A D01-Sep-20195.3 KiB11

.travis.ymlH A D01-Sep-2019441 1716

CONTRIBUTING.mdH A D01-Sep-20191.7 KiB4631

COPYRIGHTH A D01-Sep-20191 KiB2721

Cargo.tomlH A D01-Sep-20191.4 KiB5144

Ideas.mdH A D01-Sep-20193.6 KiB7955

LICENSE-APACHEH A D01-Sep-201911.1 KiB203169

LICENSE-MITH A D01-Sep-20192.2 KiB5546

README.mdH A D01-Sep-201915.8 KiB404303

generate-encoding-data.pyH A D01-Sep-201961.4 KiB2,0301,635

rustfmt.tomlH A D01-Sep-201955 32

README.md

1# encoding_rs
2
3[![Build Status](https://travis-ci.org/hsivonen/encoding_rs.svg?branch=master)](https://travis-ci.org/hsivonen/encoding_rs)
4[![crates.io](https://meritbadge.herokuapp.com/encoding_rs)](https://crates.io/crates/encoding_rs)
5[![docs.rs](https://docs.rs/encoding_rs/badge.svg)](https://docs.rs/encoding_rs/)
6[![Apache 2 / MIT dual-licensed](https://img.shields.io/badge/license-Apache%202%20%2F%20MIT-blue.svg)](https://github.com/hsivonen/encoding_rs/blob/master/COPYRIGHT)
7
8encoding_rs an implementation of the (non-JavaScript parts of) the
9[Encoding Standard](https://encoding.spec.whatwg.org/) written in Rust and
10used in Gecko (starting with Firefox 56).
11
12Additionally, the `mem` module provides various operations for dealing with
13in-RAM text (as opposed to data that's coming from or going to an IO boundary).
14The `mem` module is a module instead of a separate crate due to internal
15implementation detail efficiencies.
16
17## Functionality
18
19Due to the Gecko use case, encoding_rs supports decoding to and encoding from
20UTF-16 in addition to supporting the usual Rust use case of decoding to and
21encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly
22to accommodate the C++ side of Gecko.
23
24Specifically, encoding_rs does the following:
25
26* Decodes a stream of bytes in an Encoding Standard-defined character encoding
27  into valid aligned native-endian in-RAM UTF-16 (units of `u16` / `char16_t`).
28* Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16
29  (units of `u16` / `char16_t`) into a sequence of bytes in an Encoding
30  Standard-defined character encoding as if the lone surrogates had been
31  replaced with the REPLACEMENT CHARACTER before performing the encode.
32  (Gecko's UTF-16 is potentially invalid.)
33* Decodes a stream of bytes in an Encoding Standard-defined character
34  encoding into valid UTF-8.
35* Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding
36  Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
37* Does the above in streaming (input and output split across multiple
38  buffers) and non-streaming (whole input in a single buffer and whole
39  output in a single buffer) variants.
40* Avoids copying (borrows) when possible in the non-streaming cases when
41  decoding to or encoding from UTF-8.
42* Resolves textual labels that identify character encodings in
43  protocol text into type-safe objects representing the those encodings
44  conceptually.
45* Maps the type-safe encoding objects onto strings suitable for
46  returning from `document.characterSet`.
47* Validates UTF-8 (in common instruction set scenarios a bit faster for Web
48  workloads than the standard library; hopefully will get upstreamed some
49  day) and ASCII.
50
51Additionally, `encoding_rs::mem` does the following:
52
53* Checks if a byte buffer contains only ASCII.
54* Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
55* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16
56  buffer contains only Latin1 code points (below U+0100).
57* Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16
58  buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior
59  (suitable for checking if the Unicode Bidirectional Algorithm can be optimized
60  out).
61* Combined versions of the above two checks.
62* Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
63* Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
64* Converts UTF-8 and UTF-16 to Latin1 (if in range).
65* Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
66* Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
67* Copies ASCII from one buffer to another up to the first non-ASCII byte.
68* Converts ASCII to UTF-16 up to the first non-ASCII byte.
69* Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.
70
71## Licensing
72
73Please see the file named
74[COPYRIGHT](https://github.com/hsivonen/encoding_rs/blob/master/COPYRIGHT).
75
76## API Documentation
77
78Generated [API documentation](https://docs.rs/encoding_rs/) is available
79online.
80
81## C and C++ bindings
82
83An FFI layer for encoding_rs is available as a
84[separate crate](https://github.com/hsivonen/encoding_c). The crate comes
85with a [demo C++ wrapper](https://github.com/hsivonen/encoding_c/blob/master/include/encoding_rs_cpp.h)
86using the C++ standard library and [GSL](https://github.com/Microsoft/GSL/) types.
87
88For the Gecko context, there's a
89[C++ wrapper using the MFBT/XPCOM types](https://searchfox.org/mozilla-central/source/intl/Encoding.h#100).
90
91These bindings do not cover the `mem` module.
92
93## Sample programs
94
95* [Rust](https://github.com/hsivonen/recode_rs)
96* [C](https://github.com/hsivonen/recode_c)
97* [C++](https://github.com/hsivonen/recode_cpp)
98
99## Optional features
100
101There are currently three optional cargo features:
102
103### `simd-accel`
104
105Enables SSE2 acceleration on x86 and x86_64 and NEON acceleration on Aarch64.
106Requires nightly Rust. _Enabling this cargo feature is recommended when
107building for x86, x86_64 or Aarch64 on nightly Rust._ The intention is for the
108functionality enabled by this feature to become the normal on-by-default
109behavior once explicit SIMD becames available on all Rust release channels.
110
111Enabling this feature breaks the build unless the target is x86 with SSE2
112(Rust's default 32-bit x86 target, `i686`, has SSE2, but Linux distros may
113use an x86 target without SSE2, i.e. `i586` in `rustup` terms), x86_64 or
114Aarch64.
115
116### `serde`
117
118Enables support for serializing and deserializing `&'static Encoding`-typed
119struct fields using [Serde][1].
120
121[1]: https://serde.rs/
122
123### `no-static-ideograph-encoder-tables`
124
125Makes the binary size smaller at the expense of ideograph _encode_ speed for
126Chinese and Japanese legacy encodings. (Does _not_ affect decode speed.)
127
128The speed resulting from enabling this feature is believed to be acceptable
129for Web browser-exposed encoder use cases. However, the result is likely
130unacceptable for other applications that need to produce output in Chinese or
131Japanese legacy encodings. (But applications really should always be using
132UTF-8 for output.)
133
134## Performance goals
135
136For decoding to UTF-16, the goal is to perform at least as well as Gecko's old
137uconv. For decoding to UTF-8, the goal is to perform at least as well as
138rust-encoding.
139
140Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent
141to `memcpy` and UTF-16 to UTF-8 should be fast.)
142
143Speed is a non-goal when encoding to legacy encodings. Encoding to legacy
144encodings should not be optimized for speed at the expense of code size as long
145as form submission and URL parsing in Gecko don't become noticeably too slow
146in real-world use.
147
148Currently, by default, encoding_rs builds with limited encoder-specific
149accelation tables for GB2312 Level 1 Hanzi, Big5 Level 1 Hanzi and JIS X
1500208 Level 1 Kanji. These tables use binary search and strike a balance
151between not having encoder-specific tables at all (doing linear search
152over the decode-optimized tables) and having larger directly-indexable
153encoder-side tables. It is not clear that anyone wants this in-between
154approach, and it may be changed in the future.
155
156In the interest of binary size, Firefox builds with the
157`no-static-ideograph-encoder-tables` cargo feature, which omits
158the encoder-specific tables and performs linear search over the
159decode-optimized tables. With realistic work loads, this seemed fast enough
160not to be user-visibly slow on Raspberry Pi 3 (which stood in for a phone
161for testing) in the Web-exposed encoder use cases.
162
163A framework for measuring performance is [available separately][2].
164
165[2]: https://github.com/hsivonen/encoding_bench/
166
167## Rust Version Compatibility
168
169It is a goal to support the latest stable Rust, the latest nightly Rust and
170the version of Rust that's used for Firefox Nightly (currently 1.19.0).
171These are tested on Travis.
172
173Additionally, beta and the oldest known to work Rust version (currently
1741.15.0) are tested on Travis. The oldest Rust known to work is tested as
175a canary so that when the oldest known to work no longer works, the change
176can be documented here. At this time, there is no firm commitment to support
177a version older than what's required by Firefox, but there isn't an active
178plan to make changes that would make 1.15.0 no longer work, either.
179
180## Compatibility with rust-encoding
181
182A compatibility layer that implements the rust-encoding API on top of
183encoding_rs is
184[provided as a separate crate](https://github.com/hsivonen/encoding_rs_compat)
185(cannot be uploaded to crates.io). The compatibility layer was originally
186written with the assuption that Firefox would need it, but it is not currently
187used in Firefox.
188
189## Roadmap
190
191- [x] Design the low-level API.
192- [x] Provide Rust-only convenience features.
193- [x] Provide an stl/gsl-flavored C++ API.
194- [x] Implement all decoders and encoders.
195- [x] Add unit tests for all decoders and encoders.
196- [x] Finish BOM sniffing variants in Rust-only convenience features.
197- [x] Document the API.
198- [x] Publish the crate on crates.io.
199- [x] Create a solution for measuring performance.
200- [x] Accelerate ASCII conversions using SSE2 on x86.
201- [x] Accelerate ASCII conversions using ALU register-sized operations on
202      non-x86 architectures (process an `usize` instead of `u8` at a time).
203- [x] Split FFI into a separate crate so that the FFI doesn't interfere with
204      LTO in pure-Rust usage.
205- [x] Compress CJK indices by making use of sequential code points as well
206      as Unicode-ordered parts of indices.
207- [x] Make lookups by label or name use binary search that searches from the
208      end of the label/name to the start.
209- [x] Make labels with non-ASCII bytes fail fast.
210- [ ] Parallelize UTF-8 validation using [Rayon](https://github.com/nikomatsakis/rayon).
211- [x] Provide an XPCOM/MFBT-flavored C++ API.
212- [ ] Investigate accelerating single-byte encode with a single fast-tracked
213      range per encoding.
214- [x] Replace uconv with encoding_rs in Gecko.
215- [x] Implement the rust-encoding API in terms of encoding_rs.
216- [x] Add SIMD acceleration for Aarch64.
217- [ ] Investigate the use of NEON on 32-bit ARM.
218- [ ] Investigate Björn Höhrmann's lookup table acceleration for UTF-8 as
219      adapted to Rust in rust-encoding.
220
221## Release Notes
222
223### 0.7.2
224
225* Add the `mem` module.
226* Refactor SIMD code which can affect performance outside the `mem`
227  module.
228
229### 0.7.1
230
231* When encoding from invalid UTF-16, correctly handle U+DC00 followed by
232  another low surrogate.
233
234### 0.7.0
235
236* [Make `replacement` a label of the replacement
237  encoding.](https://github.com/whatwg/encoding/issues/70) (Spec change.)
238* Remove `Encoding::for_name()`. (`Encoding::for_label(foo).unwrap()` is
239  now close enough after the above label change.)
240* Remove the `parallel-utf8` cargo feature.
241* Add optional Serde support for `&'static Encoding`.
242* Performance tweaks for ASCII handling.
243* Performance tweaks for UTF-8 validation.
244* SIMD support on aarch64.
245
246### 0.6.11
247
248* Make `Encoder::has_pending_state()` public.
249* Update the `simd` crate dependency to 0.2.0.
250
251### 0.6.10
252
253* Reserve enough space for NCRs when encoding to ISO-2022-JP.
254* Correct max length calculations for multibyte decoders.
255* Correct max length calculations before BOM sniffing has been
256  performed.
257* Correctly calculate max length when encoding from UTF-16 to GBK.
258
259### 0.6.9
260
261* [Don't prepend anything when gb18030 range decode
262  fails](https://github.com/whatwg/encoding/issues/110). (Spec change.)
263
264### 0.6.8
265
266* Correcly handle the case where the first buffer contains potentially
267  partial BOM and the next buffer is the last buffer.
268* Decode byte `7F` correctly in ISO-2022-JP.
269* Make UTF-16 to UTF-8 encode write closer to the end of the buffer.
270* Implement `Hash` for `Encoding`.
271
272### 0.6.7
273
274* [Map half-width katakana to full-width katana in ISO-2022-JP
275  encoder](https://github.com/whatwg/encoding/issues/105). (Spec change.)
276* Give `InputEmpty` correct precedence over `OutputFull` when encoding
277  with replacement and the output buffer passed in is too short or the
278  remaining space in the output buffer is too small after a replacement.
279
280### 0.6.6
281
282* Correct max length calculation when a partial BOM prefix is part of
283  the decoder's state.
284
285### 0.6.5
286
287* Correct max length calculation in various encoders.
288* Correct max length calculation in the UTF-16 decoder.
289* Derive `PartialEq` and `Eq` for the `CoderResult`, `DecoderResult`
290  and `EncoderResult` types.
291
292### 0.6.4
293
294* Avoid panic when encoding with replacement and the destination buffer is
295  too short to hold one numeric character reference.
296
297### 0.6.3
298
299* Add support for 32-bit big-endian hosts. (For real this time.)
300
301### 0.6.2
302
303* Fix a panic from subslicing with bad indices in
304  `Encoder::encode_from_utf16`. (Due to an oversight, it lacked the fix that
305  `Encoder::encode_from_utf8` already had.)
306* Micro-optimize error status accumulation in non-streaming case.
307
308### 0.6.1
309
310* Avoid panic near integer overflow in a case that's unlikely to actually
311  happen.
312* Address Clippy lints.
313
314### 0.6.0
315
316* Make the methods for computing worst-case buffer size requirements check
317  for integer overflow.
318* Upgrade rayon to 0.7.0.
319
320### 0.5.1
321
322* Reorder methods for better documentation readability.
323* Add support for big-endian hosts. (Only 64-bit case actually tested.)
324* Optimize the ALU (non-SIMD) case for 32-bit ARM instead of x86_64.
325
326### 0.5.0
327
328* Avoid allocating an excessively long buffers in non-streaming decode.
329* Fix the behavior of ISO-2022-JP and replacement decoders near the end of the
330  output buffer.
331* Annotate the result structs with `#[must_use]`.
332
333### 0.4.0
334
335* Split FFI into a separate crate.
336* Performance tweaks.
337* CJK binary size and encoding performance changes.
338* Parallelize UTF-8 validation in the case of long buffers (with optional
339  feature `parallel-utf8`).
340* Borrow even with ISO-2022-JP when possible.
341
342### 0.3.2
343
344* Fix moving pointers to alignment in ALU-based ASCII acceleration.
345* Fix errors in documentation and improve documentation.
346
347### 0.3.1
348
349* Fix UTF-8 to UTF-16 decode for byte sequences beginning with 0xEE.
350* Make UTF-8 to UTF-8 decode SSE2-accelerated when feature `simd-accel` is used.
351* When decoding and encoding ASCII-only input from or to an ASCII-compatible
352  encoding using the non-streaming API, return a borrow of the input.
353* Make encode from UTF-16 to UTF-8 faster.
354
355### 0.3
356
357* Change the references to the instances of `Encoding` from `const` to `static`
358  to make the referents unique across crates that use the refernces.
359* Introduce non-reference-typed `FOO_INIT` instances of `Encoding` to allow
360  foreign crates to initialize `static` arrays with references to `Encoding`
361  instances even under Rust's constraints that prohibit the initialization of
362  `&'static Encoding`-typed array items with `&'static Encoding`-typed
363  `statics`.
364* Document that the above two points will be reverted if Rust changes `const`
365  to work so that cross-crate usage keeps the referents unique.
366* Return `Cow`s from Rust-only non-streaming methods for encode and decode.
367* `Encoding::for_bom()` returns the length of the BOM.
368* ASCII-accelerated conversions for encodings other than UTF-16LE, UTF-16BE,
369  ISO-2022-JP and x-user-defined.
370* Add SSE2 acceleration behind the `simd-accel` feature flag. (Requires
371  nightly Rust.)
372* Fix panic with long bogus labels.
373* Map [0xCA to U+05BA in windows-1255](https://github.com/whatwg/encoding/issues/73).
374  (Spec change.)
375* Correct the [end of the Shift_JIS EUDC range](https://github.com/whatwg/encoding/issues/53).
376  (Spec change.)
377
378### 0.2.4
379
380* Polish FFI documentation.
381
382### 0.2.3
383
384* Fix UTF-16 to UTF-8 encode.
385
386### 0.2.2
387
388* Add `Encoder.encode_from_utf8_to_vec_without_replacement()`.
389
390### 0.2.1
391
392* Add `Encoding.is_ascii_compatible()`.
393
394* Add `Encoding::for_bom()`.
395
396* Make `==` for `Encoding` use name comparison instead of pointer comparison,
397  because uses of the encoding constants in different crates result in
398  different addresses and the constant cannot be turned into statics without
399  breaking other things.
400
401### 0.2.0
402
403The initial release.
404