• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

.github/workflows/H17-Dec-2021-141132

config/H17-Dec-2021-8376

docs/H03-May-2022-2,0221,506

include/H17-Dec-2021-645401

m4/H17-Dec-2021-257234

src/H17-Dec-2021-4,6414,110

tests/H03-May-2022-15,69114,619

.gitignoreH A D17-Dec-20211.7 KiB166145

AUTHORSH A D17-Dec-202168 54

COPYINGH A D17-Dec-202134.3 KiB675553

Makefile.amH A D17-Dec-2021281 127

NEWSH A D17-Dec-202115.3 KiB532434

READMEH A D17-Dec-202142 21

README.mdH A D17-Dec-20215.6 KiB11283

TODOH A D17-Dec-202184 42

bootstrap.shH A D17-Dec-20212.7 KiB9639

codemeta.jsonH A D17-Dec-20213.5 KiB115113

configure.acH A D17-Dec-20213.3 KiB145116

dox.cfgH A D17-Dec-2021106 KiB2,4951,940

ucto-icu.pc.inH A D17-Dec-2021267 119

ucto.pc.inH A D17-Dec-2021214 1210

README

1Please see README.md for more information
2

README.md

1[![GitHub build](https://github.com/LanguageMachines/ucto/actions/workflows/ucto.yml/badge.svg?branch=master)](https://github.com/LanguageMachines/ucto/actions/)
2[![Language Machines Badge](http://applejack.science.ru.nl/lamabadge.php/ucto)](http://applejack.science.ru.nl/languagemachines/)
3[![DOI](https://zenodo.org/badge/9028617.svg)](https://zenodo.org/badge/latestdoi/9028617)
4[![GitHub release](https://img.shields.io/github/release/LanguageMachines/ucto.svg)](https://GitHub.com/LanguageMachines/ucto/releases/)
5[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
6
7Ucto - A rule-based tokeniser
8================================
9
10    Centre for Language and Speech technology, Radboud University Nijmegen
11    Induction of Linguistic Knowledge Research Group, Tilburg University
12
13**Website**: https://languagemachines.github.io/ucto/
14
15Ucto tokenizes text files: it separates words from punctuation, and splits
16sentences. This is one of the first tasks for almost any Natural Language
17Processing application. Ucto offers several other basic preprocessing steps
18such as changing case that you can all use to make your text suited for further
19processing such as indexing, part-of-speech tagging, or machine translation.
20
21Ucto comes with tokenisation rules for several languages (packaged separately)
22and can be easily extended to suit other languages. It has been incorporated
23for tokenizing Dutch text in Frog (https://languagemachines.github.io/frog),
24our Dutch morpho-syntactic processor.
25
26The software is intended to be used from the command-line by researchers in
27Natural Language Processing or related areas, as well as software developers.
28An [Ucto python binding](https://github.com/proycon/python-ucto) is also available
29separately.
30
31Features:
32
33- Comes with tokenization rules for English, Dutch, French, Italian, Turkish,
34  Spanish, Portuguese and Swedish; easily extendible to other languages. Rules
35  consists of regular expressions and lists. They are
36  packaged separately as [uctodata](https://github.com/LanguageMachines/uctodata).
37- Recognizes units, currencies, abbreviations, and simple dates and times like dd-mm-yyyy
38- Recognizes paired quote spans, sentences, and paragraphs.
39- Produces UTF8 encoding and NFC output normalization, optionally accepting
40  other input encodings as well.
41- Ligature normalization (can undo for isntance fi,fl as single codepoints).
42- Optional conversion to all lowercase or uppercase.
43- Supports [FoLiA XML](https://proycon.github.io/folia)
44
45Ucto was written by Maarten van Gompel and Ko van der Sloot. Work on Ucto was
46funded by NWO, the Netherlands Organisation for Scientific Research, under the
47Implicit Linguistics project, the CLARIN-NL program, and the CLARIAH project.
48
49This software is available under the GNU Public License v3 (see the file
50COPYING).
51
52Installation
53------------------------------------------------------------
54
55To install ucto, first consult whether your distribution's package manager has an up-to-date package for it.
56If not, for easy installation of ucto and all dependencies, it is included as part of our software
57distribution [LaMachine](https://proycon.github.io/LaMachine).
58
59To compile and install manually from source, provided you have all the
60dependencies installed:
61
62    $ bash bootstrap.sh
63    $ ./configure
64    $ make
65    $ sudo make install
66
67You will need current versions of the following dependencies of our software:
68
69* [ticcutils](https://github.com/LanguageMachine/ticcutils) - A shared utility library
70* [libfolia](https://github.com/LanguageMachines/libfolia)  - A library for the FoLiA format.
71* [uctodata](https://github.com/LanguageMachines/uctodata)  - Data files for ucto, packaged separately
72
73As well as the following 3rd party dependencies:
74
75* ``icu`` - A C++ library for Unicode and Globalization support. On Debian/Ubuntu systems, install the package libicu-dev.
76* ``libxml2`` - An XML library. On Debian/Ubuntu systems install the package libxml2-dev.
77* ``libtextcat`` - A language detection package. On Debian/Ubuntu systems it is called libexttextcat-dev.
78* A sane build environment with a C++ compiler (e.g. gcc 4.9 or above or clang), autotools, libtool, pkg-config
79
80Usage
81------------------------------------------------------------
82
83Tokenize an english text file to standard output, tokens will be
84space-seperated, sentences delimiter by ``<utt>``:
85
86    $ ucto -L eng yourfile.txt
87
88The -L flag specifies the language (as a three letter iso-639-3 code), provided
89a configuration file exists for that language. The configurations are provided
90separately, for various languages, in the
91[uctodata](https://github.com/LanguageMachines/uctodata) package. Note that
92older versions of ucto used different two-letter codes, so you may need to
93update the way you invoke ucto.
94
95To output to file instead of standard output, just add another
96positional argument with the desired output filename.
97
98If you want each sentence on a separate line (i.e. newline delimited rather than delimited by
99``<utt>``), then pass the ``-n`` flag. If each sentence is already on one line
100in the input and you want to leave it at that, pass the ``-m`` flag.
101
102Tokenize plaintext to [FoLiA XML](https://proycon.github.io/folia) using the ``-X`` flag, you can specify an ID
103for the FoLiA document using the ``--id=`` flag.
104
105    $ ucto -L eng -X --id=hamlet hamlet.txt hamlet.folia.xml
106
107Note that in the FoLiA XML output, ucto encodes the class of the token (date, url, smiley, etc...) based
108on the rule that matched.
109
110For further documentation consult the [ucto
111documentation](https://ucto.readthedocs.io/en/latest/).
112