README.md
1# prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![Build status](https://ci.appveyor.com/api/projects/status/24bepq85nnnk4scr/branch/master?svg=true)](https://ci.appveyor.com/project/jdkato/prose/branch/master) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://godoc.org/github.com/jdkato/prose) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)
2
3
4`prose` is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use.
5
6See the [GoDoc documentation](https://godoc.org/github.com/jdkato/prose) for more information.
7
8## Install
9
10```console
11$ go get github.com/jdkato/prose/...
12```
13
14> **NOTE**: When using some vendoring tools, such as `govendor`, you may need to include the `github.com/jdkato/prose/internal/` package in addition to the core package(s). See [#14](https://github.com/jdkato/prose/issues/14) for more information.
15
16## Usage
17
18### Contents
19
20* [Tokenizing](#tokenizing-godoc)
21* [Tagging](#tagging-godoc)
22* [Transforming](#transforming-godoc)
23* [Summarizing](#summarizing-godoc)
24* [Chunking](#chunking-godoc)
25* [License](#license)
26
27
28### Tokenizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/tokenize))
29
30Word, sentence, and regexp tokenizers are available. Every tokenizer implements the [same interface](https://godoc.org/github.com/jdkato/prose/tokenize#ProseTokenizer), which makes it easy to customize tokenization in other parts of the library.
31
32```go
33package main
34
35import (
36 "fmt"
37
38 "github.com/jdkato/prose/tokenize"
39)
40
41func main() {
42 text := "They'll save and invest more."
43 tokenizer := tokenize.NewTreebankWordTokenizer()
44 for _, word := range tokenizer.Tokenize(text) {
45 // [They 'll save and invest more .]
46 fmt.Println(word)
47 }
48}
49```
50
51### Tagging ([GoDoc](https://godoc.org/github.com/jdkato/prose/tag))
52
53The `tag` package includes a port of Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:
54
55| Library | Accuracy | 5-Run Average (sec) |
56|:--------|---------:|--------------------:|
57| NLTK | 0.893 | 7.224 |
58| `prose` | 0.961 | 2.538 |
59
60(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)
61
62```go
63package main
64
65import (
66 "fmt"
67
68 "github.com/jdkato/prose/tag"
69 "github.com/jdkato/prose/tokenize"
70)
71
72func main() {
73 text := "A fast and accurate part-of-speech tagger for Golang."
74 words := tokenize.NewTreebankWordTokenizer().Tokenize(text)
75
76 tagger := tag.NewPerceptronTagger()
77 for _, tok := range tagger.Tag(words) {
78 fmt.Println(tok.Text, tok.Tag)
79 }
80}
81```
82
83### Transforming ([GoDoc](https://godoc.org/github.com/jdkato/prose/transform))
84
85The `tranform` package implements a number of functions for changing the case of strings, including `Title`, `Snake`, `Pascal`, and `Camel`.
86
87Additionally, unlike `strings.Title`, `tranform.Title` adheres to common guidelines—including styles for both the [AP Stylebook](https://www.apstylebook.com/) and [The Chicago Manual of Style](http://www.chicagomanualofstyle.org/home.html). You can also add your own custom style by defining an [`IgnoreFunc`](https://godoc.org/github.com/jdkato/prose/transform#IgnoreFunc) callback.
88
89Inspiration and test data taken from [python-titlecase](https://github.com/ppannuto/python-titlecase) and [to-title-case](https://github.com/gouch/to-title-case).
90
91```go
92package main
93
94import (
95 "fmt"
96 "strings"
97
98 "github.com/jdkato/prose/transform"
99)
100
101func main() {
102 text := "the last of the mohicans"
103 tc := transform.NewTitleConverter(transform.APStyle)
104 fmt.Println(strings.Title(text)) // The Last Of The Mohicans
105 fmt.Println(tc.Title(text)) // The Last of the Mohicans
106}
107```
108
109### Summarizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/summarize))
110
111The `summarize` package includes functions for computing standard readability and usage statistics. It's among the most accurate implementations available due to its reliance on legitimate tokenizers (whereas others, like [readability-score](https://github.com/DaveChild/Text-Statistics/blob/master/src/DaveChild/TextStatistics/Text.php#L308), rely on naive regular expressions).
112
113It also includes a TL;DR algorithm for condensing text into a user-indicated number of paragraphs.
114
115```go
116package main
117
118import (
119 "fmt"
120
121 "github.com/jdkato/prose/summarize"
122)
123
124func main() {
125 doc := summarize.NewDocument("This is some interesting text.")
126 fmt.Println(doc.SMOG(), doc.FleschKincaid())
127}
128```
129
130### Chunking ([GoDoc](https://godoc.org/github.com/jdkato/prose/chunk))
131
132The `chunk` package implements named-entity extraction using a regular expression indicating what chunks you're looking for and pre-tagged input.
133
134```go
135package main
136
137import (
138 "fmt"
139
140 "github.com/jdkato/prose/chunk"
141 "github.com/jdkato/prose/tag"
142 "github.com/jdkato/prose/tokenize"
143)
144
145func main() {
146 words := tokenize.TextToWords("Go is an open source programming language created at Google.")
147 regex := chunk.TreebankNamedEntities
148
149 tagger := tag.NewPerceptronTagger()
150 for _, entity := range chunk.Chunk(tagger.Tag(words), regex) {
151 fmt.Println(entity) // [Go Google]
152 }
153}
154```
155
156## License
157
158If not otherwise specified (see below), the source files are distributed under MIT License found in the [LICENSE](https://github.com/jdkato/prose/blob/master/LICENSE) file.
159
160Additionally, the following files contain their own license information:
161
162- [`tag/aptag.go`](https://github.com/jdkato/prose/blob/master/tag/aptag.go): MIT © Matthew Honnibal.
163- [`tokenize/punkt.go`](https://github.com/jdkato/prose/blob/master/tokenize/punkt.go): MIT © Eric Bower.
164- [`tokenize/pragmatic.go`](https://github.com/jdkato/prose/blob/master/tokenize/pragmatic.go): MIT © Kevin S. Dias.
165