README.md
1# tendril
2
3**Warning**: This library is at a very early stage of development, and it
4contains a substantial amount of `unsafe` code. Use at your own risk!
5
6[![Build Status](https://travis-ci.org/servo/tendril.svg?branch=master)](https://travis-ci.org/servo/tendril)
7
8[API Documentation](http://doc.servo.org/tendril/index.html)
9
10## Introduction
11
12`Tendril` is a compact string/buffer type, optimized for zero-copy parsing.
13Tendrils have the semantics of owned strings, but are sometimes views into
14shared buffers. When you mutate a tendril, an owned copy is made if necessary.
15Further mutations occur in-place until the string becomes shared, e.g. with
16`clone()` or `subtendril()`.
17
18Buffer sharing is accomplished through thread-local (non-atomic) reference
19counting, which has very low overhead. The Rust type system will prevent
20you at compile time from sending a tendril between threads. (See below
21for thoughts on relaxing this restriction.)
22
23Whereas `String` allocates in the heap for any non-empty string, `Tendril` can
24store small strings (up to 8 bytes) in-line, without a heap allocation.
25`Tendril` is also smaller than `String` on 64-bit platforms — 16 bytes versus
2624. `Option<Tendril>` is the same size as `Tendril`, thanks to
27[`NonZero`][NonZero].
28
29The maximum length of a tendril is 4 GB. The library will panic if you attempt
30to go over the limit.
31
32## Formats and encoding
33
34`Tendril` uses [phantom types](http://rustbyexample.com/generics/phantom.html)
35to track a buffer's format. This determines at compile time which
36operations are available on a given tendril. For example, `Tendril<UTF8>` and
37`Tendril<Bytes>` can be borrowed as `&str` and `&[u8]` respectively.
38
39`Tendril` also integrates with
40[rust-encoding](https://github.com/lifthrasiir/rust-encoding) and has
41preliminary support for [WTF-8][] buffers.
42
43## C interface
44
45`Tendril` provides a C API, which allows Rust to efficiently exchange buffers
46with C or any other language.
47
48```c
49#include "tendril.h"
50
51int main() {
52 tendril t = TENDRIL_INIT;
53 tendril_sprintf(&t, "Hello, %d!\n", 2015);
54 tendril_fwrite(&t, stdout);
55 some_rust_library(t); // transfer ownership
56 return 0;
57}
58```
59
60See the [API documentation](https://github.com/kmcallister/tendril/blob/master/capi/include/tendril.h#L18)
61and the [test program](https://github.com/kmcallister/tendril/blob/master/capi/ctest/test.c).
62
63## Plans for the future
64
65### Ropes
66
67[html5ever][] will use `Tendril` as a zero-copy text representation. It would
68be good to preserve this all the way through to Servo's DOM. This would reduce
69memory consumption, and possibly speed up text shaping and painting. However,
70DOM text may conceivably be larger than 4 GB, and will anyway not be contiguous
71in memory around e.g. a character entity reference.
72
73*Solution:* Build a **[rope][] on top of these strings** and use that as
74Servo's representation of DOM text. We can perhaps do text shaping and/or
75painting in parallel for different chunks of a rope. html5ever can additionally
76use this rope type as a replacement for `BufferQueue`.
77
78Because the underlying buffers are reference-counted, the bulk of this rope
79is already a [persistent data structure][]. Consider what happens when
80appending two ropes to get a "new" rope. A vector-backed rope would copy a
81vector of small structs, one for each chunk, and would bump the corresponding
82refcounts. But it would not copy any of the string data.
83
84If we want more sharing, then a [2-3 finger tree][] could be a good choice.
85We would probably stick with `VecDeque` for ropes under a certain size.
86
87### UTF-16 compatibility
88
89SpiderMonkey expects text to be in UCS-2 format for the most part. The
90semantics of JavaScript strings are difficult to implement on UTF-8. This also
91applies to HTML parsing via `document.write`. Also, passing SpiderMonkey a
92string that isn't contiguous in memory will incur additional overhead and
93complexity, if not a full copy.
94
95*Solution:* Use **WTF-8 in parsing** and in the DOM. Servo will **convert to
96contiguous UTF-16 when necessary**. The conversion can easily be parallelized,
97if we find a practical need to convert huge chunks of text all at once.
98
99### Source span information
100
101Some html5ever API consumers want to know the originating location in the HTML
102source file(s) of each token or parse error. An example application would be a
103command-line HTML validator with diagnostic output similar to `rustc`'s.
104
105*Solution:* Accept **some metadata along with each input string**. The type of
106metadata is chosen by the API consumer; it defaults to `()`, which has size
107zero. For any non-inline string, we can provide the associated metadata as well
108as a byte offset.
109
110[NonZero]: http://doc.rust-lang.org/core/nonzero/struct.NonZero.html
111[html5ever]: https://github.com/servo/html5ever
112[WTF-8]: http://simonsapin.github.io/wtf-8/
113[rope]: http://en.wikipedia.org/wiki/Rope_%28data_structure%29
114[persistent data structure]: http://en.wikipedia.org/wiki/Persistent_data_structure
115[2-3 finger tree]: http://staff.city.ac.uk/~ross/papers/FingerTree.html
116