• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

cmd/vellum/H06-Sep-2018-

data/H03-May-2022-

docs/H03-May-2022-

levenshtein/H06-Sep-2018-

regexp/H06-Sep-2018-

utf8/H06-Sep-2018-

vendor/H06-Sep-2018-

.travis.ymlH A D06-Sep-2018458

CONTRIBUTING.mdH A D06-Sep-2018995

LICENSEH A D06-Sep-201811.1 KiB

README.mdH A D06-Sep-20187.1 KiB

automaton.goH A D06-Sep-20182.2 KiB

builder.goH A D06-Sep-201810 KiB

builder_test.goH A D06-Sep-20185 KiB

common.goH A D06-Sep-20186.4 KiB

common_test.goH A D06-Sep-20181.3 KiB

decoder_v1.goH A D06-Sep-20187.2 KiB

decoder_v1_test.goH A D06-Sep-201812.1 KiB

encoder_v1.goH A D06-Sep-20185.2 KiB

encoder_v1_test.goH A D06-Sep-20188.5 KiB

encoding.goH A D06-Sep-20182.3 KiB

example_test.goH A D06-Sep-20181.5 KiB

fst.goH A D06-Sep-20186.7 KiB

fst_iterator.goH A D06-Sep-20187.7 KiB

fst_iterator_test.goH A D06-Sep-201814.3 KiB

merge_iterator.goH A D06-Sep-20184.8 KiB

merge_iterator_test.goH A D06-Sep-20185.3 KiB

pack.goH A D06-Sep-20181.3 KiB

pack_test.goH A D06-Sep-20181.3 KiB

registry.goH A D06-Sep-20182.6 KiB

registry_test.goH A D06-Sep-20181.3 KiB

transducer.goH A D06-Sep-20181.8 KiB

vellum.goH A D06-Sep-20183.3 KiB

vellum_mmap.goH A D06-Sep-20181.3 KiB

vellum_nommap.goH A D06-Sep-2018790

vellum_test.goH A D06-Sep-201811.8 KiB

writer.goH A D06-Sep-20181.9 KiB

writer_test.goH A D06-Sep-20182 KiB

README.md

1# ![vellum](docs/logo.png) vellum
2
3[![Build Status](https://travis-ci.org/couchbase/vellum.svg?branch=master)](https://travis-ci.org/couchbase/vellum)
4[![Coverage Status](https://coveralls.io/repos/github/couchbase/vellum/badge.svg?branch=master)](https://coveralls.io/github/couchbase/vellum?branch=master)
5[![GoDoc](https://godoc.org/github.com/couchbase/vellum?status.svg)](https://godoc.org/github.com/couchbase/vellum)
6[![Go Report Card](https://goreportcard.com/badge/github.com/couchbase/vellum)](https://goreportcard.com/report/github.com/couchbase/vellum)
7[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
8
9A Go library implementing an FST (finite state transducer) capable of:
10  - mapping between keys ([]byte) and a value (uint64)
11  - enumerating keys in lexicographic order
12
13Some additional goals of this implementation:
14 - bounded memory use while building the FST
15 - streaming out FST data while building
16 - mmap FST runtime to support very large FTSs (optional)
17
18## Usage
19
20### Building an FST
21
22To build an FST, create a new builder using the `New()` method.  This method takes an `io.Writer` as an argument.  As the FST is being built, data will be streamed to the writer as soon as possible.  With this builder you **MUST** insert keys in lexicographic order.  Inserting keys out of order will result in an error.  After inserting the last key into the builder, you **MUST** call `Close()` on the builder.  This will flush all remaining data to the underlying writer.
23
24In memory:
25```go
26  var buf bytes.Buffer
27  builder, err := vellum.New(&buf, nil)
28  if err != nil {
29    log.Fatal(err)
30  }
31```
32
33To disk:
34```go
35  f, err := os.Create("/tmp/vellum.fst")
36  if err != nil {
37    log.Fatal(err)
38  }
39  builder, err := vellum.New(f, nil)
40  if err != nil {
41    log.Fatal(err)
42  }
43```
44
45**MUST** insert keys in lexicographic order:
46```go
47err = builder.Insert([]byte("cat"), 1)
48if err != nil {
49  log.Fatal(err)
50}
51
52err = builder.Insert([]byte("dog"), 2)
53if err != nil {
54  log.Fatal(err)
55}
56
57err = builder.Insert([]byte("fish"), 3)
58if err != nil {
59  log.Fatal(err)
60}
61
62err = builder.Close()
63if err != nil {
64  log.Fatal(err)
65}
66```
67
68### Using an FST
69
70After closing the builder, the data can be used to instantiate an FST.  If the data was written to disk, you can use the `Open()` method to mmap the file.  If the data is already in memory, or you wish to load/mmap the data yourself, you can instantiate the FST with the `Load()` method.
71
72Load in memory:
73```go
74  fst, err := vellum.Load(buf.Bytes())
75  if err != nil {
76    log.Fatal(err)
77  }
78```
79
80Open from disk:
81```go
82  fst, err := vellum.Open("/tmp/vellum.fst")
83  if err != nil {
84    log.Fatal(err)
85  }
86```
87
88Get key/value:
89```go
90  val, exists, err = fst.Get([]byte("dog"))
91  if err != nil {
92    log.Fatal(err)
93  }
94  if exists {
95    fmt.Printf("contains dog with val: %d\n", val)
96  } else {
97    fmt.Printf("does not contain dog")
98  }
99```
100
101Iterate key/values:
102```go
103  itr, err := fst.Iterator(startKeyInclusive, endKeyExclusive)
104  for err == nil {
105    key, val := itr.Current()
106    fmt.Printf("contains key: %s val: %d", key, val)
107    err = itr.Next()
108  }
109  if err != nil {
110    log.Fatal(err)
111  }
112```
113
114### How does the FST get built?
115
116A full example of the implementation is beyond the scope of this README, but let's consider a small example where we want to insert 3 key/value pairs.
117
118First we insert "are" with the value 4.
119
120![step1](docs/demo1.png)
121
122Next, we insert "ate" with the value 2.
123
124![step2](docs/demo2.png)
125
126Notice how the values associated with the transitions were adjusted so that by summing them while traversing we still get the expected value.
127
128At this point, we see that state 5 looks like state 3, and state 4 looks like state 2.  But, we cannot yet combine them because future inserts could change this.
129
130Now, we insert "see" with value 3.  Once it has been added, we now know that states 5 and 4 can longer change.  Since they are identical to 3 and 2, we replace them.
131
132![step3](docs/demo3.png)
133
134Again, we see that states 7 and 8 appear to be identical to 2 and 3.
135
136Having inserted our last key, we call `Close()` on the builder.
137
138![step4](docs/demo4.png)
139
140Now, states 7 and 8 can safely be replaced with 2 and 3.
141
142For additional information, see the references at the bottom of this document.
143
144### What does the serialized format look like?
145
146We've broken out a separate document on the [vellum disk format v1](docs/format.md).
147
148### What if I want to use this on a system that doesn't have mmap?
149
150The mmap library itself is guarded with system/architecture build tags, but we've also added an additional build tag in vellum.  If you'd like to Open() a file based representation of an FST, but not use mmap, you can build the library with the `nommap` build tag.  NOTE: if you do this, the entire FST will be read into memory.
151
152### Can I use this with Unicode strings?
153
154Yes, however this implementation is only aware of the byte representation you choose.  In order to find matches, you must work with some canonical byte representation of the string.  In the future, some encoding-aware traversals may be possible on top of the lower-level byte transitions.
155
156### How did this library come to be?
157
158In my work on the [Bleve](https://github.com/blevesearch/bleve) project I became aware of the power of the FST for many search-related tasks.  The obvious starting point for such a thing in Go was the [mafsa](https://github.com/smartystreets/mafsa) project.  While working with mafsa I encountered some issues.  First, it did not stream data to disk while building.  Second, it chose to use a rune as the fundamental unit of transition in the FST, but I felt using a byte would be more powerful in the end.  My hope is that higher-level encoding-aware traversals will be possible when necessary.  Finally, as I reported bugs and submitted PRs I learned that the mafsa project was mainly a research project and no longer being maintained.  I wanted to build something that could be used in production.  As the project advanced more and more techniques from the [BurntSushi/fst](https://github.com/BurntSushi/fst) were adapted to our implementation.
159
160### Are there tools to work with vellum files?
161
162Under the cmd/vellum subdirectory, there's a command-line tool which
163features subcommands that can allow you to create, inspect and query
164vellum files.
165
166### How can I generate a state transition diagram from a vellum file?
167
168The vellum command-line tool has a "dot" subcommand that can emit
169graphviz dot output data from an input vellum file.  The dot file can
170in turn be converted into an image using graphviz tools.  Example...
171
172    $ vellum dot myFile.vellum > output.dot
173    $ dot -Tpng output.dot -o output.png
174
175## Related Work
176
177Much credit goes to two existing projects:
178 - [mafsa](https://github.com/smartystreets/mafsa)
179 - [BurntSushi/fst](https://github.com/BurntSushi/fst)
180
181Most of the original implementation here started with my digging into the internals of mafsa.  As the implementation progressed, I continued to borrow ideas/approaches from the BurntSushi/fst library as well.
182
183For a great introduction to this topic, please read the blog post [Index 1,600,000,000 Keys with Automata and Rust](http://blog.burntsushi.net/transducers/)
184