• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

doc/H03-May-2022-649582

lib/H03-May-2022-9,4387,470

m4/H03-May-2022-2,8612,672

po/H03-May-2022-1,3471,178

python/H20-Sep-2009-609514

src/H03-May-2022-1,3521,113

tests/H03-May-2022-6,5595,383

utils/H03-May-2022-12,98710,376

win32/H20-Sep-2009-601353

ABOUT-NLSH A D22-Mar-200974.7 KiB1,0691,022

AUTHORSH A D11-Mar-200928 21

ChangeLogH A D20-Sep-200934.8 KiB1,125877

INSTALLH A D08-Feb-20089.3 KiB238179

LICENSEH A D20-May-20091.4 KiB3024

Makefile.amH A D20-May-2009468 2619

Makefile.inH A D03-May-202222 KiB714630

NEWSH A D20-Sep-20097.3 KiB244163

READMEH A D19-May-20099.6 KiB236181

THANKSH A D19-Sep-2009930 3025

TODOH A D11-Mar-20091.6 KiB5135

aclocal.m4H A D18-Sep-2009265.6 KiB7,5696,808

config.h.inH A D18-Sep-20097 KiB259178

configureH A D03-May-2022522.8 KiB17,76014,744

configure.acH A D18-Sep-200915.8 KiB572507

tre.pc.inH A D11-Mar-2009247 119

tre.spec.inH A D03-May-20222.8 KiB11586

README

1Introduction
2
3   TRE is a lightweight, robust, and efficient POSIX compliant regexp
4   matching library with some exciting features such as approximate
5   (fuzzy) matching.
6
7   The matching algorithm used in TRE uses linear worst-case time in
8   the length of the text being searched, and quadratic worst-case
9   time in the length of the used regular expression. In other words,
10   the time complexity of the algorithm is O(M^2N), where M is the
11   length of the regular expression and N is the length of the
12   text. The used space is also quadratic on the length of the regex,
13   but does not depend on the searched string. This quadratic
14   behaviour occurs only on pathological cases which are probably very
15   rare in practice.
16
17Features
18
19   TRE is not just yet another regexp matcher. TRE has some features
20   which are not there in most free POSIX compatible
21   implementations. Most of these features are not present in non-free
22   implementations either, for that matter.
23
24Approximate matching
25
26   Approximate pattern matching allows matches to be approximate, that
27   is, allows the matches to be close to the searched pattern under
28   some measure of closeness. TRE uses the edit-distance measure (also
29   known as the Levenshtein distance) where characters can be
30   inserted, deleted, or substituted in the searched text in order to
31   get an exact match. Each insertion, deletion, or substitution adds
32   the distance, or cost, of the match. TRE can report the matches
33   which have a cost lower than some given threshold value. TRE can
34   also be used to search for matches with the lowest cost.
35
36   TRE includes a version of the agrep (approximate grep) command line
37   tool for approximate regexp matching in the style of grep. Unlike
38   other agrep implementations (like the one by Sun Wu and Udi Manber
39   from University of Arizona available here) TRE agrep allows full
40   regexps of any length, any number of errors, and non-uniform costs
41   for insertion, deletion and substitution.
42
43Strict standard conformance
44
45   POSIX defines the behaviour of regexp functions precisely. TRE
46   attempts to conform to these specifications as strictly as
47   possible. TRE always returns the correct matches for subpatterns,
48   for example. Very few other implementations do this correctly. In
49   fact, the only other implementations besides TRE that I am aware of
50   (free or not) that get it right are Rx by Tom Lord, Regex++ by John
51   Maddock, and the AT&T ast regex by Glenn Fowler and Doug McIlroy.
52
53   The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
54   or Open Group Base Specifications Issue 6, commonly referred to as
55   "POSIX".  It can be found online here. The relevant parts are the
56   base specifications on regular expressions (and the rationale) and
57   the description of the regcomp() API.
58
59   For an excellent survey on POSIX regexp matchers, see the testregex
60   pages by Glenn Fowler of AT&T Labs Research.
61
62Predictable matching speed
63
64   Because of the matching algorithm used in TRE, the maximum time
65   consumed by any regexec() call is always directly proportional to
66   the length of the searched string. There is one exception: if back
67   references are used, the matching may take time that grows
68   exponentially with the length of the string. This is because
69   matching back references is an NP complete problem, and almost
70   certainly requires exponential time to match in the worst case.
71
72Predictable and modest memory consumption
73
74   A regexec() call never allocates memory from the heap. TRE
75   allocates all the memory it needs during a regcomp() call, and some
76   temporary working space from the stack frame for the duration of
77   the regexec() call. The amount of temporary space needed is
78   constant during matching and does not depend on the searched
79   string. For regexps of reasonable size TRE needs less than 50K of
80   dynamically allocated memory during the regcomp() call, less than
81   20K for the compiled pattern buffer, and less than two kilobytes of
82   temporary working space from the stack frame during a regexec()
83   call. There is no time/memory tradeoff. TRE is also small in code
84   size; statically linking with TRE increases the executable size
85   less than 30K (gcc-3.2, x86, GNU/Linux).
86
87Wide character and multibyte character set support
88
89   TRE supports multibyte character sets. This makes it possible to
90   use regexps seamlessly with, for example, Japanese locales. TRE
91   also provides a wide character API.
92
93Binary pattern and data support
94
95   TRE provides APIs which allow binary zero characters both in
96   regexps and searched strings. The standard API cannot be easily
97   used to, for example, search for printable words from binary data
98   (although it is possible with some hacking). Searching for patterns
99   which contain binary zeroes embedded is not possible at all with
100   the standard API.
101
102Completely thread safe
103
104   TRE is completely thread safe. All the exported functions are
105   re-entrant, and a single compiled regexp object can be used
106   simultaneously in multiple contexts; e.g. in main() and a signal
107   handler, or in many threads of a multithreaded application.
108
109Portable
110
111   TRE is portable across multiple platforms. Here's a table of
112   platforms and compilers that have been successfully used to compile
113   and run TRE:
114
115      Platform(s)                       | Compiler(s)
116      ----------------------------------+------------
117      AIX 4.3.2 - 5.3.0                 | GCC, C for AIX compiler version 5
118      Compaq Tru64 UNIX V5.1A/B         | Compaq C V6.4-014 - V6.5-011
119      Cygwin 1.3 - 1.5                  | GCC
120      Digital UNIX V4.0                 | DEC C V5.9-005
121      FreeBSD 4 and above               | GCC
122      GNU/Linux systems on x86, x86_64, | GCC
123      ppc64, s390			|
124      HP-UX 10.20- 11.00                | GCC, HP C Compiler
125      IRIX 6.5                          | GCC, MIPSpro Compilers 7.3.1.3m
126      Max OS X				|
127      NetBSD 1.5 and above              | GCC, egcs
128      OpenBSD 3.3 and above             | GCC
129      Solaris 2.7-10 sparc/x86          | GCC, Sun Workshop 6 compilers
130      Windows 98 - XP                   | Microsoft Visual C++ 6.0
131
132   TRE 0.7.5 should compile without changes on all of the above
133   platforms.  Tell me if you are using TRE on a platform that is not
134   listed above, and I'll add it to the list. Also let me know if TRE
135   does not work on a listed platform.
136
137   Depending on the platform, you may need to install libutf8 to get
138   wide character and multibyte character set support.
139
140 Free
141
142   TRE is released under a license which is essentially the same as
143   the "2 clause" BSD-style license used in NetBSD.  See the file
144   LICENSE for details.
145
146Roadmap
147
148   There are currently two features, both related to collating
149   elements, missing from 100% POSIX compliance. These are:
150
151     * Support for collating elements (e.g. [[.<X>.]], where <X> is a
152       collating element). It is not possible to support
153       multi-character collating elements portably, since POSIX does
154       not define a way to determine whether a character sequence is a
155       multi-character collating element or not.
156
157     * Support for equivalence classes, for example [[=<X>=]], where
158       <X> is a collating element. An equivalence class matches any
159       character which has the same primary collation weight as
160       <X>. Again, POSIX provides no portable mechanism for
161       determining the primary collation weight of a collating
162       element.
163
164   Note that other portable regexp implementations don't support
165   collating elements either. The single exception is Regex++, which
166   comes with its own database for collating elements for different
167   locales. Support for collating elements and equivalence classes has
168   not been widely requested and is not very high on the TODO list at
169   the moment.
170
171   These are other features I'm planning to implement real soon now:
172
173     * All the missing GNU extensions enabled in GNU regex, such as
174       [[:<:]] and [[:>:]]
175
176     * A REG_SHORTEST regexec() flag for returning the shortest match
177       instead of the longest match.
178
179     * Perl-compatible syntax
180
181            [:^class:]
182               Matches anything but the characters in class. Note that
183               [^[:class:]] works already, this would be just a
184               convenience shorthand.
185
186            \A
187               Match only at beginning of string
188
189            \Z
190               Match only at end of string, or before newline at the end
191
192            \z
193               Match only at end of string
194
195            \l
196               Lowercase next char (think vi)
197
198            \u
199               Uppercase next char (think vi)
200
201            \L
202               Lowercase till \E (think vi)
203
204            \U
205               Uppercase till \E (think vi)
206
207            (?=pattern)
208               Zero-width positive look-ahead assertions.
209
210            (?!pattern)
211               Zero-width negative look-ahead assertions.
212
213            (?<=pattern)
214               Zero-width positive look-behind assertions.
215
216            (?<!pattern)
217               Zero-width negative look-behind assertions.
218
219   Documentation especially for the nonstandard features of TRE, such
220   as approximate matching, is a work in progress (with "progress"
221   loosely defined...)
222
223   Mailing lists
224
225   tre-general@lists.laurikari.net
226      This list is for any discussion on the TRE software, including
227      reporting bugs, feature requests, requests for help, and other
228      things.
229
230   tre-announce@lists.laurikari.net
231      Subscribe to this list to get announcements of new releases of
232      TRE.  Alternatively, you can subscribe to the freshmeat.net
233      project and get similar announcements that way, if you prefer.
234
235Ville Laurikari    <vl@iki.fi>
236