• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

chunk/H22-Dec-2020-

internal/H22-Dec-2020-

scripts/H22-Dec-2020-

summarize/H22-Dec-2020-

tag/H22-Dec-2020-

testdata/H03-May-2022-

tokenize/H22-Dec-2020-

transform/H22-Dec-2020-

.codeclimate.ymlH A D22-Dec-2020211

.gitignoreH A D22-Dec-2020287

.travis.ymlH A D22-Dec-2020790

AUTHORS.mdH A D22-Dec-202085

LICENSEH A D22-Dec-20201 KiB

MakefileH A D22-Dec-20201.5 KiB

README.mdH A D22-Dec-20206.4 KiB

appveyor.ymlH A D22-Dec-2020386

doc.goH A D22-Dec-2020167

go.modH A D22-Dec-2020287

go.sumH A D22-Dec-20201.8 KiB

README.md

1# prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![Build status](https://ci.appveyor.com/api/projects/status/24bepq85nnnk4scr/branch/master?svg=true)](https://ci.appveyor.com/project/jdkato/prose/branch/master) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://godoc.org/github.com/jdkato/prose) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)
2
3
4`prose` is Go library for text (primarily English at the moment) processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. The library's functionality is split into subpackages designed for modular use.
5
6See the [GoDoc documentation](https://godoc.org/github.com/jdkato/prose) for more information.
7
8## Install
9
10```console
11$ go get github.com/jdkato/prose/...
12```
13
14> **NOTE**: When using some vendoring tools, such as `govendor`, you may need to include the `github.com/jdkato/prose/internal/` package in addition to the core package(s). See [#14](https://github.com/jdkato/prose/issues/14) for more information.
15
16## Usage
17
18### Contents
19
20* [Tokenizing](#tokenizing-godoc)
21* [Tagging](#tagging-godoc)
22* [Transforming](#transforming-godoc)
23* [Summarizing](#summarizing-godoc)
24* [Chunking](#chunking-godoc)
25* [License](#license)
26
27
28### Tokenizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/tokenize))
29
30Word, sentence, and regexp tokenizers are available. Every tokenizer implements the [same interface](https://godoc.org/github.com/jdkato/prose/tokenize#ProseTokenizer), which makes it easy to customize tokenization in other parts of the library.
31
32```go
33package main
34
35import (
36    "fmt"
37
38    "github.com/jdkato/prose/tokenize"
39)
40
41func main() {
42    text := "They'll save and invest more."
43    tokenizer := tokenize.NewTreebankWordTokenizer()
44    for _, word := range tokenizer.Tokenize(text) {
45        // [They 'll save and invest more .]
46        fmt.Println(word)
47    }
48}
49```
50
51### Tagging ([GoDoc](https://godoc.org/github.com/jdkato/prose/tag))
52
53The `tag` package includes a port of Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:
54
55| Library | Accuracy | 5-Run Average (sec) |
56|:--------|---------:|--------------------:|
57| NLTK    |    0.893 |               7.224 |
58| `prose` |    0.961 |               2.538 |
59
60(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)
61
62```go
63package main
64
65import (
66    "fmt"
67
68    "github.com/jdkato/prose/tag"
69    "github.com/jdkato/prose/tokenize"
70)
71
72func main() {
73    text := "A fast and accurate part-of-speech tagger for Golang."
74    words := tokenize.NewTreebankWordTokenizer().Tokenize(text)
75
76    tagger := tag.NewPerceptronTagger()
77    for _, tok := range tagger.Tag(words) {
78        fmt.Println(tok.Text, tok.Tag)
79    }
80}
81```
82
83### Transforming ([GoDoc](https://godoc.org/github.com/jdkato/prose/transform))
84
85The `tranform` package implements a number of functions for changing the case of strings, including `Title`, `Snake`, `Pascal`, and `Camel`.
86
87Additionally, unlike `strings.Title`, `tranform.Title` adheres to common guidelines—including styles for both the [AP Stylebook](https://www.apstylebook.com/) and [The Chicago Manual of Style](http://www.chicagomanualofstyle.org/home.html). You can also add your own custom style by defining an [`IgnoreFunc`](https://godoc.org/github.com/jdkato/prose/transform#IgnoreFunc) callback.
88
89Inspiration and test data taken from [python-titlecase](https://github.com/ppannuto/python-titlecase) and [to-title-case](https://github.com/gouch/to-title-case).
90
91```go
92package main
93
94import (
95    "fmt"
96    "strings"
97
98    "github.com/jdkato/prose/transform"
99)
100
101func main() {
102    text := "the last of the mohicans"
103    tc := transform.NewTitleConverter(transform.APStyle)
104    fmt.Println(strings.Title(text))   // The Last Of The Mohicans
105    fmt.Println(tc.Title(text)) // The Last of the Mohicans
106}
107```
108
109### Summarizing ([GoDoc](https://godoc.org/github.com/jdkato/prose/summarize))
110
111The `summarize` package includes functions for computing standard readability and usage statistics. It's among the most accurate implementations available due to its reliance on legitimate tokenizers (whereas others, like [readability-score](https://github.com/DaveChild/Text-Statistics/blob/master/src/DaveChild/TextStatistics/Text.php#L308), rely on naive regular expressions).
112
113It also includes a TL;DR algorithm for condensing text into a user-indicated number of paragraphs.
114
115```go
116package main
117
118import (
119    "fmt"
120
121    "github.com/jdkato/prose/summarize"
122)
123
124func main() {
125    doc := summarize.NewDocument("This is some interesting text.")
126    fmt.Println(doc.SMOG(), doc.FleschKincaid())
127}
128```
129
130### Chunking ([GoDoc](https://godoc.org/github.com/jdkato/prose/chunk))
131
132The `chunk` package implements named-entity extraction using a regular expression indicating what chunks you're looking for and pre-tagged input.
133
134```go
135package main
136
137import (
138    "fmt"
139
140    "github.com/jdkato/prose/chunk"
141    "github.com/jdkato/prose/tag"
142    "github.com/jdkato/prose/tokenize"
143)
144
145func main() {
146    words := tokenize.TextToWords("Go is an open source programming language created at Google.")
147    regex := chunk.TreebankNamedEntities
148
149    tagger := tag.NewPerceptronTagger()
150    for _, entity := range chunk.Chunk(tagger.Tag(words), regex) {
151        fmt.Println(entity) // [Go Google]
152    }
153}
154```
155
156## License
157
158If not otherwise specified (see below), the source files are distributed under MIT License found in the [LICENSE](https://github.com/jdkato/prose/blob/master/LICENSE) file.
159
160Additionally, the following files contain their own license information:
161
162- [`tag/aptag.go`](https://github.com/jdkato/prose/blob/master/tag/aptag.go): MIT © Matthew Honnibal.
163- [`tokenize/punkt.go`](https://github.com/jdkato/prose/blob/master/tokenize/punkt.go): MIT © Eric Bower.
164- [`tokenize/pragmatic.go`](https://github.com/jdkato/prose/blob/master/tokenize/pragmatic.go): MIT © Kevin S. Dias.
165