1bstr
2====
3This crate provides extension traits for `&[u8]` and `Vec<u8>` that enable
4their use as byte strings, where byte strings are _conventionally_ UTF-8. This
5differs from the standard library's `String` and `str` types in that they are
6not required to be valid UTF-8, but may be fully or partially valid UTF-8.
7
8[![Build status](https://github.com/BurntSushi/bstr/workflows/ci/badge.svg)](https://github.com/BurntSushi/bstr/actions)
9[![](https://meritbadge.herokuapp.com/bstr)](https://crates.io/crates/bstr)
10
11
12### Documentation
13
14https://docs.rs/bstr
15
16
17### When should I use byte strings?
18
19See this part of the documentation for more details:
20https://docs.rs/bstr/0.2.*/bstr/#when-should-i-use-byte-strings.
21
22The short story is that byte strings are useful when it is inconvenient or
23incorrect to require valid UTF-8.
24
25
26### Usage
27
28Add this to your `Cargo.toml`:
29
30```toml
31[dependencies]
32bstr = "0.2"
33```
34
35
36### Examples
37
38The following two examples exhibit both the API features of byte strings and
39the I/O convenience functions provided for reading line-by-line quickly.
40
41This first example simply shows how to efficiently iterate over lines in
42stdin, and print out lines containing a particular substring:
43
44```rust
45use std::error::Error;
46use std::io::{self, Write};
47
48use bstr::{ByteSlice, io::BufReadExt};
49
50fn main() -> Result<(), Box<dyn Error>> {
51    let stdin = io::stdin();
52    let mut stdout = io::BufWriter::new(io::stdout());
53
54    stdin.lock().for_byte_line_with_terminator(|line| {
55        if line.contains_str("Dimension") {
56            stdout.write_all(line)?;
57        }
58        Ok(true)
59    })?;
60    Ok(())
61}
62```
63
64This example shows how to count all of the words (Unicode-aware) in stdin,
65line-by-line:
66
67```rust
68use std::error::Error;
69use std::io;
70
71use bstr::{ByteSlice, io::BufReadExt};
72
73fn main() -> Result<(), Box<dyn Error>> {
74    let stdin = io::stdin();
75    let mut words = 0;
76    stdin.lock().for_byte_line_with_terminator(|line| {
77        words += line.words().count();
78        Ok(true)
79    })?;
80    println!("{}", words);
81    Ok(())
82}
83```
84
85This example shows how to convert a stream on stdin to uppercase without
86performing UTF-8 validation _and_ amortizing allocation. On standard ASCII
87text, this is quite a bit faster than what you can (easily) do with standard
88library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)
89
90```rust
91use std::error::Error;
92use std::io::{self, Write};
93
94use bstr::{ByteSlice, io::BufReadExt};
95
96fn main() -> Result<(), Box<dyn Error>> {
97    let stdin = io::stdin();
98    let mut stdout = io::BufWriter::new(io::stdout());
99
100    let mut upper = vec![];
101    stdin.lock().for_byte_line_with_terminator(|line| {
102        upper.clear();
103        line.to_uppercase_into(&mut upper);
104        stdout.write_all(&upper)?;
105        Ok(true)
106    })?;
107    Ok(())
108}
109```
110
111This example shows how to extract the first 10 visual characters (as grapheme
112clusters) from each line, where invalid UTF-8 sequences are generally treated
113as a single character and are passed through correctly:
114
115```rust
116use std::error::Error;
117use std::io::{self, Write};
118
119use bstr::{ByteSlice, io::BufReadExt};
120
121fn main() -> Result<(), Box<dyn Error>> {
122    let stdin = io::stdin();
123    let mut stdout = io::BufWriter::new(io::stdout());
124
125    stdin.lock().for_byte_line_with_terminator(|line| {
126        let end = line
127            .grapheme_indices()
128            .map(|(_, end, _)| end)
129            .take(10)
130            .last()
131            .unwrap_or(line.len());
132        stdout.write_all(line[..end].trim_end())?;
133        stdout.write_all(b"\n")?;
134        Ok(true)
135    })?;
136    Ok(())
137}
138```
139
140
141### Cargo features
142
143This crates comes with a few features that control standard library, serde
144and Unicode support.
145
146* `std` - **Enabled** by default. This provides APIs that require the standard
147  library, such as `Vec<u8>`.
148* `unicode` - **Enabled** by default. This provides APIs that require sizable
149  Unicode data compiled into the binary. This includes, but is not limited to,
150  grapheme/word/sentence segmenters. When this is disabled, basic support such
151  as UTF-8 decoding is still included.
152* `serde1` - **Disabled** by default. Enables implementations of serde traits
153  for the `BStr` and `BString` types.
154* `serde1-nostd` - **Disabled** by default. Enables implementations of serde
155  traits for the `BStr` type only, intended for use without the standard
156  library. Generally, you either want `serde1` or `serde1-nostd`, not both.
157
158
159### Minimum Rust version policy
160
161This crate's minimum supported `rustc` version (MSRV) is `1.41.1`.
162
163In general, this crate will be conservative with respect to the minimum
164supported version of Rust. MSRV may be bumped in minor version releases.
165
166
167### Future work
168
169Since this is meant to be a core crate, getting a `1.0` release is a priority.
170My hope is to move to `1.0` within the next year and commit to its API so that
171`bstr` can be used as a public dependency.
172
173A large part of the API surface area was taken from the standard library, so
174from an API design perspective, a good portion of this crate should be on solid
175ground already. The main differences from the standard library are in how the
176various substring search routines work. The standard library provides generic
177infrastructure for supporting different types of searches with a single method,
178where as this library prefers to define new methods for each type of search and
179drop the generic infrastructure.
180
181Some _probable_ future considerations for APIs include, but are not limited to:
182
183* A convenience layer on top of the `aho-corasick` crate.
184* Unicode normalization.
185* More sophisticated support for dealing with Unicode case, perhaps by
186  combining the use cases supported by [`caseless`](https://docs.rs/caseless)
187  and [`unicase`](https://docs.rs/unicase).
188* Add facilities for dealing with OS strings and file paths, probably via
189  simple conversion routines.
190
191Here are some examples that are _probably_ out of scope for this crate:
192
193* Regular expressions.
194* Unicode collation.
195
196The exact scope isn't quite clear, but I expect we can iterate on it.
197
198In general, as stated below, this crate brings lots of related APIs together
199into a single crate while simultaneously attempting to keep the total number of
200dependencies low. Indeed, every dependency of `bstr`, except for `memchr`, is
201optional.
202
203
204### High level motivation
205
206Strictly speaking, the `bstr` crate provides very little that can't already be
207achieved with the standard library `Vec<u8>`/`&[u8]` APIs and the ecosystem of
208library crates. For example:
209
210* The standard library's
211  [`Utf8Error`](https://doc.rust-lang.org/std/str/struct.Utf8Error.html)
212  can be used for incremental lossy decoding of `&[u8]`.
213* The
214  [`unicode-segmentation`](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/index.html)
215  crate can be used for iterating over graphemes (or words), but is only
216  implemented for `&str` types. One could use `Utf8Error` above to implement
217  grapheme iteration with the same semantics as what `bstr` provides (automatic
218  Unicode replacement codepoint substitution).
219* The [`twoway`](https://docs.rs/twoway) crate can be used for
220  fast substring searching on `&[u8]`.
221
222So why create `bstr`? Part of the point of the `bstr` crate is to provide a
223uniform API of coupled components instead of relying on users to piece together
224loosely coupled components from the crate ecosystem. For example, if you wanted
225to perform a search and replace in a `Vec<u8>`, then writing the code to do
226that with the `twoway` crate is not that difficult, but it's still additional
227glue code you have to write. This work adds up depending on what you're doing.
228Consider, for example, trimming and splitting, along with their different
229variants.
230
231In other words, `bstr` is partially a way of pushing back against the
232micro-crate ecosystem that appears to be evolving. Namely, it is a goal of
233`bstr` to keep its dependency list lightweight. For example, `serde` is an
234optional dependency because there is no feasible alternative. In service of
235this philosophy, currently, the only required dependency of `bstr` is `memchr`.
236
237
238### License
239
240This project is licensed under either of
241
242 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
243   https://www.apache.org/licenses/LICENSE-2.0)
244 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
245   https://opensource.org/licenses/MIT)
246
247at your option.
248
249The data in `src/unicode/data/` is licensed under the Unicode License Agreement
250([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)), although
251this data is only used in tests.
252