Name | Date | Size | #Lines | LOC | ||
---|---|---|---|---|---|---|
.. | 03-May-2022 | - | ||||
benches/ | H | 03-May-2022 | - | 26 | 21 | |
src/ | H | 03-May-2022 | - | 572 | 339 | |
.cargo-checksum.json | H A D | 03-May-2022 | 89 | 1 | 1 | |
.cargo_vcs_info.json | H A D | 01-Jan-1970 | 74 | 6 | 5 | |
.gitignore | H A D | 09-Jul-2016 | 88 | 10 | 9 | |
COPYING | H A D | 09-Jul-2016 | 126 | 4 | 2 | |
Cargo.toml | H A D | 01-Jan-1970 | 1.2 KiB | 33 | 30 | |
Cargo.toml.orig-cargo | H A D | 09-Jun-2019 | 687 | 20 | 17 | |
LICENSE-MIT | H A D | 09-Jul-2016 | 1.1 KiB | 22 | 17 | |
README.md | H A D | 09-Jun-2019 | 1.4 KiB | 55 | 38 | |
UNLICENSE | H A D | 09-Jul-2016 | 1.2 KiB | 25 | 20 |
README.md
1utf8-ranges 2=========== 3This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte 4ranges. This is useful when constructing byte based automata from Unicode. 5Stated differently, this lets one embed UTF-8 decoding as part of one's 6automaton. 7 8[![Linux build status](https://api.travis-ci.org/BurntSushi/utf8-ranges.png)](https://travis-ci.org/BurntSushi/utf8-ranges) 9[![](http://meritbadge.herokuapp.com/utf8-ranges)](https://crates.io/crates/utf8-ranges) 10 11Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org). 12 13 14### Documentation 15 16https://docs.rs/utf8-ranges 17 18 19### Example 20 21This shows how to convert a scalar value range (e.g., the basic multilingual 22plane) to a sequence of byte based character classes. 23 24 25```rust 26extern crate utf8_ranges; 27 28use utf8_ranges::Utf8Sequences; 29 30fn main() { 31 for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') { 32 println!("{:?}", range); 33 } 34} 35``` 36 37The output: 38 39```text 40[0-7F] 41[C2-DF][80-BF] 42[E0][A0-BF][80-BF] 43[E1-EC][80-BF][80-BF] 44[ED][80-9F][80-BF] 45[EE-EF][80-BF][80-BF] 46``` 47 48These ranges can then be used to build an automaton. Namely: 49 501. Every arbitrary sequence of bytes matches exactly one of the sequences of 51 ranges or none of them. 522. Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous 53 encodings of surrogate codepoints in UTF-8 cannot match any of the byte 54 ranges above.) 55