• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

examples/H03-May-2022-9580

src/H03-May-2022-6,8375,221

.cargo-checksum.jsonH A D03-May-202289 11

.gitignoreH A D28-Aug-201618 32

.travis.ymlH A D28-Aug-2016388 1413

Cargo.tomlH A D28-Aug-20161.2 KiB4837

MakefileH A D28-Aug-20161.9 KiB6253

README.mdH A D28-Aug-20167.8 KiB215162

README.md

1[Encoding][doc] 0.2.33
2======================
3
4[![Encoding on Travis CI][travis-image]][travis]
5
6[travis-image]: https://travis-ci.org/lifthrasiir/rust-encoding.png
7[travis]: https://travis-ci.org/lifthrasiir/rust-encoding
8
9Character encoding support for Rust. (also known as `rust-encoding`)
10It is based on [WHATWG Encoding Standard](http://encoding.spec.whatwg.org/),
11and also provides an advanced interface for error detection and recovery.
12
13## Usage
14
15[Complete Documentation][doc]
16
17[doc]: https://lifthrasiir.github.io/rust-encoding/
18
19## Usage
20
21Put this in your `Cargo.toml`:
22
23```toml
24[dependencies]
25encoding = "0.2"
26```
27
28Then put this in your crate root:
29
30```rust
31extern crate encoding;
32```
33
34## Overview
35
36To encode a string:
37
38~~~~ {.rust}
39use encoding::{Encoding, EncoderTrap};
40use encoding::all::ISO_8859_1;
41
42assert_eq!(ISO_8859_1.encode("caf\u{e9}", EncoderTrap::Strict),
43           Ok(vec![99,97,102,233]));
44~~~~
45
46To encode a string with unrepresentable characters:
47
48~~~~ {.rust}
49use encoding::{Encoding, EncoderTrap};
50use encoding::all::ISO_8859_2;
51
52assert!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Strict).is_err());
53assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Replace),
54           Ok(vec![65,99,109,101,63]));
55assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::Ignore),
56           Ok(vec![65,99,109,101]));
57assert_eq!(ISO_8859_2.encode("Acme\u{a9}", EncoderTrap::NcrEscape),
58           Ok(vec![65,99,109,101,38,35,49,54,57,59]));
59~~~~
60
61To decode a byte sequence:
62
63~~~~ {.rust}
64use encoding::{Encoding, DecoderTrap};
65use encoding::all::ISO_8859_1;
66
67assert_eq!(ISO_8859_1.decode(&[99,97,102,233], DecoderTrap::Strict),
68           Ok("caf\u{e9}".to_string()));
69~~~~
70
71To decode a byte sequence with invalid sequences:
72
73~~~~ {.rust}
74use encoding::{Encoding, DecoderTrap};
75use encoding::all::ISO_8859_6;
76
77assert!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Strict).is_err());
78assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Replace),
79           Ok("Acme\u{fffd}".to_string()));
80assert_eq!(ISO_8859_6.decode(&[65,99,109,101,169], DecoderTrap::Ignore),
81           Ok("Acme".to_string()));
82~~~~
83
84To encode or decode the input into the already allocated buffer:
85
86~~~~ {.rust}
87use encoding::{Encoding, EncoderTrap, DecoderTrap};
88use encoding::all::{ISO_8859_2, ISO_8859_6};
89
90let mut bytes = Vec::new();
91let mut chars = String::new();
92
93assert!(ISO_8859_2.encode_to("Acme\u{a9}", EncoderTrap::Ignore, &mut bytes).is_ok());
94assert!(ISO_8859_6.decode_to(&[65,99,109,101,169], DecoderTrap::Replace, &mut chars).is_ok());
95
96assert_eq!(bytes, [65,99,109,101]);
97assert_eq!(chars, "Acme\u{fffd}");
98~~~~
99
100A practical example of custom encoder traps:
101
102~~~~ {.rust}
103use encoding::{Encoding, ByteWriter, EncoderTrap, DecoderTrap};
104use encoding::types::RawEncoder;
105use encoding::all::ASCII;
106
107// hexadecimal numeric character reference replacement
108fn hex_ncr_escape(_encoder: &mut RawEncoder, input: &str, output: &mut ByteWriter) -> bool {
109    let escapes: Vec<String> =
110        input.chars().map(|ch| format!("&#x{:x};", ch as isize)).collect();
111    let escapes = escapes.concat();
112    output.write_bytes(escapes.as_bytes());
113    true
114}
115static HEX_NCR_ESCAPE: EncoderTrap = EncoderTrap::Call(hex_ncr_escape);
116
117let orig = "Hello, 世界!".to_string();
118let encoded = ASCII.encode(&orig, HEX_NCR_ESCAPE).unwrap();
119assert_eq!(ASCII.decode(&encoded, DecoderTrap::Strict),
120           Ok("Hello, &#x4e16;&#x754c;!".to_string()));
121~~~~
122
123Getting the encoding from the string label, as specified in WHATWG Encoding standard:
124
125~~~~ {.rust}
126use encoding::{Encoding, DecoderTrap};
127use encoding::label::encoding_from_whatwg_label;
128use encoding::all::WINDOWS_949;
129
130let euckr = encoding_from_whatwg_label("euc-kr").unwrap();
131assert_eq!(euckr.name(), "windows-949");
132assert_eq!(euckr.whatwg_name(), Some("euc-kr")); // for the sake of compatibility
133let broken = &[0xbf, 0xec, 0xbf, 0xcd, 0xff, 0xbe, 0xd3];
134assert_eq!(euckr.decode(broken, DecoderTrap::Replace),
135           Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));
136
137// corresponding Encoding native API:
138assert_eq!(WINDOWS_949.decode(broken, DecoderTrap::Replace),
139           Ok("\u{c6b0}\u{c640}\u{fffd}\u{c559}".to_string()));
140~~~~
141
142## Types and Stuffs
143
144There are three main entry points to Encoding.
145
146**`Encoding`** is a single character encoding.
147It contains `encode` and `decode` methods for converting `String` to `Vec<u8>` and vice versa.
148For the error handling, they receive **traps** (`EncoderTrap` and `DecoderTrap` respectively)
149which replace any error with some string (e.g. `U+FFFD`) or sequence (e.g. `?`).
150You can also use `EncoderTrap::Strict` and `DecoderTrap::Strict` traps to stop on an error.
151
152There are two ways to get `Encoding`:
153
154* `encoding::all` has static items for every supported encoding.
155  You should use them when the encoding would not change or only handful of them are required.
156  Combined with link-time optimization, any unused encoding would be discarded from the binary.
157
158* `encoding::label` has functions to dynamically get an encoding from given string ("label").
159  They will return a static reference to the encoding,
160  which type is also known as `EncodingRef`.
161  It is useful when a list of required encodings is not available in advance,
162  but it will result in the larger binary and missed optimization opportunities.
163
164**`RawEncoder`** is an experimental incremental encoder.
165At each step of `raw_feed`, it receives a slice of string
166and emits any encoded bytes to a generic `ByteWriter` (normally `Vec<u8>`).
167It will stop at the first error if any, and would return a `CodecError` struct in that case.
168The caller is responsible for calling `raw_finish` at the end of encoding process.
169
170**`RawDecoder`** is an experimental incremental decoder.
171At each step of `raw_feed`, it receives a slice of byte sequence
172and emits any decoded characters to a generic `StringWriter` (normally `String`).
173Otherwise it is identical to `RawEncoder`s.
174
175One should prefer `Encoding::{encode,decode}` as a primary interface.
176`RawEncoder` and `RawDecoder` is experimental and can change substantially.
177See the additional documents on `encoding::types` module for more information on them.
178
179## Supported Encodings
180
181Encoding covers all encodings specified by WHATWG Encoding Standard and some more:
182
183* 7-bit strict ASCII (`ascii`)
184* UTF-8 (`utf-8`)
185* UTF-16 in little endian (`utf-16` or `utf-16le`) and big endian (`utf-16be`)
186* All single byte encoding in WHATWG Encoding Standard:
187    * IBM code page 866
188    * ISO 8859-{2,3,4,5,6,7,8,10,13,14,15,16}
189    * KOI8-R, KOI8-U
190    * MacRoman (`macintosh`), Macintosh Cyrillic encoding (`x-mac-cyrillic`)
191    * Windows code pages 874, 1250, 1251, 1252 (instead of ISO 8859-1), 1253,
192      1254 (instead of ISO 8859-9), 1255, 1256, 1257, 1258
193* All multi byte encodings in WHATWG Encoding Standard:
194    * Windows code page 949 (`euc-kr`, since the strict EUC-KR is hardly used)
195    * EUC-JP and Windows code page 932 (`shift_jis`,
196      since it's the most widespread extension to Shift_JIS)
197    * ISO-2022-JP with asymmetric JIS X 0212 support
198      (Note: this is not yet up to date to the current standard)
199    * GBK
200    * GB 18030
201    * Big5-2003 with HKSCS-2008 extensions
202* Encodings that were originally specified by WHATWG Encoding Standard:
203    * HZ
204* ISO 8859-1 (distinct from Windows code page 1252)
205
206Parenthesized names refer to the encoding's primary name assigned by WHATWG Encoding Standard.
207
208Many legacy character encodings lack the proper specification,
209and even those that have a specification are highly dependent of the actual implementation.
210Consequently one should be careful when picking a desired character encoding.
211The only standards reliable in this regard are WHATWG Encoding Standard and
212[vendor-provided mappings from the Unicode consortium](http://www.unicode.org/Public/MAPPINGS/).
213Whenever in doubt, look at the source code and specifications for detailed explanations.
214
215