1bstr 2==== 3This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable 4their use as byte strings, where byte strings are _conventionally_ UTF-8. This 5differs from the standard library's `String` and `str` types in that they are 6not required to be valid UTF-8, but may be fully or partially valid UTF-8. 7 8[![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions) 9[![](https://meritbadge.herokuapp.com/bstr)](https://crates.io/crates/bstr) 10 11 12### Documentation 13 14https://docs.rs/bstr 15 16 17### When should I use byte strings? 18 19See this part of the documentation for more details: 20https://docs.rs/bstr/0.2.*/bstr/#when-should-i-use-byte-strings. 21 22The short story is that byte strings are useful when it is inconvenient or 23incorrect to require valid UTF-8. 24 25 26### Usage 27 28Add this to your `Cargo.toml`: 29 30```toml 31[dependencies] 32bstr = "0.2" 33``` 34 35 36### Examples 37 38The following two examples exhibit both the API features of byte strings and 39the I/O convenience functions provided for reading line-by-line quickly. 40 41This first example simply shows how to efficiently iterate over lines in 42stdin, and print out lines containing a particular substring: 43 44```rust 45use std::error::Error; 46use std::io::{self, Write}; 47 48use bstr::{ByteSlice, io::BufReadExt}; 49 50fn main() -> Result<(), Box<dyn Error>> { 51 let stdin = io::stdin(); 52 let mut stdout = io::BufWriter::new(io::stdout()); 53 54 stdin.lock().for_byte_line_with_terminator(|line| { 55 if line.contains_str("Dimension") { 56 stdout.write_all(line)?; 57 } 58 Ok(true) 59 })?; 60 Ok(()) 61} 62``` 63 64This example shows how to count all of the words (Unicode-aware) in stdin, 65line-by-line: 66 67```rust 68use std::error::Error; 69use std::io; 70 71use bstr::{ByteSlice, io::BufReadExt}; 72 73fn main() -> Result<(), Box<dyn Error>> { 74 let stdin = io::stdin(); 75 let mut words = 0; 76 stdin.lock().for_byte_line_with_terminator(|line| { 77 words += line.words().count(); 78 Ok(true) 79 })?; 80 println!("{}", words); 81 Ok(()) 82} 83``` 84 85This example shows how to convert a stream on stdin to uppercase without 86performing UTF-8 validation _and_ amortizing allocation. On standard ASCII 87text, this is quite a bit faster than what you can (easily) do with standard 88library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.) 89 90```rust 91use std::error::Error; 92use std::io::{self, Write}; 93 94use bstr::{ByteSlice, io::BufReadExt}; 95 96fn main() -> Result<(), Box<dyn Error>> { 97 let stdin = io::stdin(); 98 let mut stdout = io::BufWriter::new(io::stdout()); 99 100 let mut upper = vec![]; 101 stdin.lock().for_byte_line_with_terminator(|line| { 102 upper.clear(); 103 line.to_uppercase_into(&mut upper); 104 stdout.write_all(&upper)?; 105 Ok(true) 106 })?; 107 Ok(()) 108} 109``` 110 111This example shows how to extract the first 10 visual characters (as grapheme 112clusters) from each line, where invalid UTF-8 sequences are generally treated 113as a single character and are passed through correctly: 114 115```rust 116use std::error::Error; 117use std::io::{self, Write}; 118 119use bstr::{ByteSlice, io::BufReadExt}; 120 121fn main() -> Result<(), Box<dyn Error>> { 122 let stdin = io::stdin(); 123 let mut stdout = io::BufWriter::new(io::stdout()); 124 125 stdin.lock().for_byte_line_with_terminator(|line| { 126 let end = line 127 .grapheme_indices() 128 .map(|(_, end, _)| end) 129 .take(10) 130 .last() 131 .unwrap_or(line.len()); 132 stdout.write_all(line[..end].trim_end())?; 133 stdout.write_all(b"\n")?; 134 Ok(true) 135 })?; 136 Ok(()) 137} 138``` 139 140 141### Cargo features 142 143This crates comes with a few features that control standard library, serde 144and Unicode support. 145 146* `std` - **Enabled** by default. This provides APIs that require the standard 147 library, such as `Vec<u8>`. 148* `unicode` - **Enabled** by default. This provides APIs that require sizable 149 Unicode data compiled into the binary. This includes, but is not limited to, 150 grapheme/word/sentence segmenters. When this is disabled, basic support such 151 as UTF-8 decoding is still included. 152* `serde1` - **Disabled** by default. Enables implementations of serde traits 153 for the `BStr` and `BString` types. 154* `serde1-nostd` - **Disabled** by default. Enables implementations of serde 155 traits for the `BStr` type only, intended for use without the standard 156 library. Generally, you either want `serde1` or `serde1-nostd`, not both. 157 158 159### Minimum Rust version policy 160 161This crate's minimum supported `rustc` version (MSRV) is `1.41.1`. 162 163In general, this crate will be conservative with respect to the minimum 164supported version of Rust. MSRV may be bumped in minor version releases. 165 166 167### Future work 168 169Since this is meant to be a core crate, getting a `1.0` release is a priority. 170My hope is to move to `1.0` within the next year and commit to its API so that 171`bstr` can be used as a public dependency. 172 173A large part of the API surface area was taken from the standard library, so 174from an API design perspective, a good portion of this crate should be on solid 175ground already. The main differences from the standard library are in how the 176various substring search routines work. The standard library provides generic 177infrastructure for supporting different types of searches with a single method, 178where as this library prefers to define new methods for each type of search and 179drop the generic infrastructure. 180 181Some _probable_ future considerations for APIs include, but are not limited to: 182 183* A convenience layer on top of the `aho-corasick` crate. 184* Unicode normalization. 185* More sophisticated support for dealing with Unicode case, perhaps by 186 combining the use cases supported by [`caseless`](https://docs.rs/caseless) 187 and [`unicase`](https://docs.rs/unicase). 188* Add facilities for dealing with OS strings and file paths, probably via 189 simple conversion routines. 190 191Here are some examples that are _probably_ out of scope for this crate: 192 193* Regular expressions. 194* Unicode collation. 195 196The exact scope isn't quite clear, but I expect we can iterate on it. 197 198In general, as stated below, this crate brings lots of related APIs together 199into a single crate while simultaneously attempting to keep the total number of 200dependencies low. Indeed, every dependency of `bstr`, except for `memchr`, is 201optional. 202 203 204### High level motivation 205 206Strictly speaking, the `bstr` crate provides very little that can't already be 207achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of 208library crates. For example: 209 210* The standard library's 211 [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html) 212 can be used for incremental lossy decoding of `&[u8]`. 213* The 214 [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html) 215 crate can be used for iterating over graphemes (or words), but is only 216 implemented for `&str` types. One could use `Utf8Error` above to implement 217 grapheme iteration with the same semantics as what `bstr` provides (automatic 218 Unicode replacement codepoint substitution). 219* The [`twoway`](https://docs.rs/twoway) crate can be used for 220 fast substring searching on `&[u8]`. 221 222So why create `bstr`? Part of the point of the `bstr` crate is to provide a 223uniform API of coupled components instead of relying on users to piece together 224loosely coupled components from the crate ecosystem. For example, if you wanted 225to perform a search and replace in a `Vec<u8>`, then writing the code to do 226that with the `twoway` crate is not that difficult, but it's still additional 227glue code you have to write. This work adds up depending on what you're doing. 228Consider, for example, trimming and splitting, along with their different 229variants. 230 231In other words, `bstr` is partially a way of pushing back against the 232micro-crate ecosystem that appears to be evolving. Namely, it is a goal of 233`bstr` to keep its dependency list lightweight. For example, `serde` is an 234optional dependency because there is no feasible alternative. In service of 235this philosophy, currently, the only required dependency of `bstr` is `memchr`. 236 237 238### License 239 240This project is licensed under either of 241 242 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or 243 https://www.apache.org/licenses/LICENSE-2.0) 244 * MIT license ([LICENSE-MIT](LICENSE-MIT) or 245 https://opensource.org/licenses/MIT) 246 247at your option. 248 249The data in `src/unicode/data/` is licensed under the Unicode License Agreement 250([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)), although 251this data is only used in tests. 252