• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

examples/H19-Jun-2017-2,0161,635

lib/Regexp/H19-Jun-2017-3,3931,739

t/H19-Jun-2017-5,6344,992

xt/author/H19-Jun-2017-169

Changelog.iniH A D19-Jun-201719.2 KiB451444

ChangesH A D19-Jun-201720.9 KiB464425

LICENSEH A D19-Jun-201719.8 KiB379305

MANIFESTH A D19-Jun-2017908 4544

MANIFEST.SKIPH A D19-Jun-2017612 4839

META.jsonH A D19-Jun-20171.3 KiB5857

META.ymlH A D19-Jun-2017706 3029

Makefile.PLH A D19-Jun-20171.3 KiB7365

READMEH A D07-Apr-201113.2 KiB488355

TODOH A D17-Apr-2016101 53

README

1This file is the README for Regexp::Assemble version 0.35
2
3INSTALLATION
4
5perl Makefile.PL
6make
7make test
8make install
9
10TESTING
11
12This module requires the following modules for thorough testing:
13
14  Test::More
15  Test::File::Contents
16  Test::Pod
17  Test::Pod::Coverage
18  Test::Warn
19
20The test suite will make allowances for their eventual absence.
21
22It can also make use of Devel::Cover if available.
23
24UNINSTALLATION
25
26This is a pure-Perl module. The following one-liner should print
27out the canonical path of the file:
28
29  perl -MRegexp::Assemble -le 'print $INC{"Regexp/Assemble.pm"}'
30
31Just delete this file. There is also the question of the man page.
32Finding that is left as an exercise to the reader.
33
34BASIC USAGE
35
36use Regexp::Assemble;
37
38my $ra = Regexp::Assemble->new;
39$ra->add( 'ab+c' );
40$ra->add( 'ab+\\d*\\s+c' );
41$ra->add( 'a\\w+\\d+' );
42$ra->add( 'a\\d+' );
43print $ra->re; # prints (?:a(?:b+(?:\d*\s+)?c|(?:\w+)?\d+))
44
45or
46
47my $ra = Regexp::Assemble->new
48    ->add( 'foo', 'bar', 'baz', 'foom' );
49
50print "$_ matches\n" if /$ra/
51    for (qw/word more stuff food rabble bark/);
52
53or
54
55use Regexp::Assemble;
56my @word = qw/flip flop slip slop/;
57
58print Regexp::Assemble->new->add(@word)->as_string;
59    # produces [fs]l[io]p
60
61print Regexp::Assemble->new->add(@word)->reduce(0)->as_string;
62    # produces (?:fl(?:ip|op)|sl(?:ip|op))
63
64See the ./eg directory for some example scripts.
65
66ADVANCED USAGE
67
68If you want to match things with exceptions, you can use a two
69stage process to build a pattern with negative lookbehind. Consider
70the following script:
71
72== example begin ==
73use Regexp::Assemble;
74
75my $set = [
76    {
77        accept => [qw[ .cnn.net .cnn.com ]],
78        refuse => [qw[ ^media video ]],
79    },
80    {
81        accept => [qw[ .yahoo.com ]],
82    },
83];
84
85my $ra = Regexp::Assemble->new;
86
87for my $s( @$set ) {
88    my $refuse = do {
89        if( not exists $s->{refuse} ) {
90            '';
91        }
92        else {
93            '(?<!'
94            . Regexp::Assemble->new->add( @{$s->{refuse}} )->as_string
95            . ')'
96        }
97    };
98    $ra->add( map { s/\./\\./g; "$refuse$_\$" } @{$s->{accept}} );
99}
100
101my $re = $ra->re;
102
103print $ra->as_string, "\n";
104
105while( <> ) {
106    print;
107    chomp;
108    print "\t", (/$re/ ? 'yep' : 'nope'), "\n";
109}
110
111== example end ==
112
113and a datafile to run it on:
114
115== data begin ==
116media.cnn.com
117more.video.cnn.net
118super.media.cnn.com
119video.cnn.net
120video.yahoo.com
121www.cnn.com
122www.cnn.net
123www.yahoo.com
124== data end ==
125
126This lets us match arbitrary hosts within a domain, but at the same
127time excluding a subset of hosts that we wish to ignore.
128
129TRACKING REGULAR EXPRESSION MATCHES
130
131Regexp::Assemble can emit regular expressions that, when used correctly,
132can let you determine which original pattern gave rise to the match.
133This technique is known as tracking.
134
135== example begin ==
136
137use strict;
138use Regexp::Assemble;
139
140my $dispatch = {
141    'a-(\\d+)'        => sub { my $v = shift; print "speed $v->[1]\n"; },
142    'a-(\\d+)-(\\d+)' => sub { my $v = shift; print "pressure $v->[1] over $v->[2]\n"; },
143    'a-(\\w+)-(\\w+)' => sub { my $v = shift; print "message $v->[1] from $v->[2]\n"; },
144};
145
146my $re = Regexp::Assemble->new( track => 1 )->add( keys %$dispatch );
147
148while( <> ) {
149    chomp;
150    if( $re->match($_) ) {
151        $dispatch->{ $re->matched }( $re->mvar() );
152    }
153    else {
154        last if /q/;
155        print "\tignored\n";
156    }
157}
158
159== example end ==
160
161Run this and enter lines like a-234, a-654, a-345-345, a-dog-cat and so
162on. When the pattern matches a string, you can retrieve the pattern
163that caused the match to occur, and dispatch it to a routine that knows
164what to do about it. You can retrieve captured values too. In the above
165example, just remember that $v->[1] eq $1. $v->[0], a.k.a $re->mvar(0)
166happens to be the the same as the input parameter to match (although
167this is worked out from first principles, more or less, not simply by
168copying the parameter).
169
170I initially hoped that $^R would handle this sort of stuff for me, but
171there's a bug. Consider the following pattern:
172
173    a(?{1}) (?: b(?{2}) )?
174
175(whitespace added for clarity). This pattern will match both the strings
176'a' and 'ab', however, in both cases, $^R will be set to 1 aftewards. I
177would have hoped that after matching 'ab', that $^R would be set to 2.
178
179As of perl 5.9.5, this bug has been corrected in the regular expression
180engine, thanks to Yves Orton. Version 0.29 takes this into account, and
181as a result re 'eval' is no longer required in Perl 5.10.
182
183IMPLEMENTATION
184
185Consider a simple pattern 'costructive' we want to use to match against
186strings. This pattern is split into tokens, and is stored in a list:
187
188  [c o n s t r u c t i v e]
189
190At this point, if we want to produce a regular expression, we only need
191to join it up again:
192
193   my $pattern = join( '' => @path);
194   my $re = qr/$pattern/;
195
196Consider a second pattern 'containment'. Split into a list gives:
197
198  [c o n t a i n m e n t]
199
200We then have to merge this second path into the first path. At some
201point, the paths diverge. The first element path the point of
202divergence in the first path is replace by a node (a hash) and the two
203different paths carry on from there:
204
205  [c o n
206        |s => [s t r u c t i v e]
207        \t => [t a i n m e n t]
208  ]
209
210And then 'confinement':
211
212  [c o n
213        |s => [s t r u c t i v e]
214        |t => [t a i n m e n t]
215        \f => [f i n e m e n t]
216  ]
217
218What happens if we add a path that runs out in the middle of a
219previous path?  We add a node, and a "null-path" to indicate that
220the path can both continue on, and can also stop here:
221
222Add 'construct':
223
224  [c o n
225        |s => [s t r u c t
226        |                 | '' => undef
227        |                 \ i => [i v e]
228        |     ]
229        |t => [t a i n m e n t]
230        \f => [f i n e m e n t]
231  ]
232
233It should be obvious to see how the contruct branch will produce the
234pattern /construct(?:ive)?/ . Or for a longer path 'constructively':
235
236  [c o n
237        |s => [s t r u c t
238        |                 | '' => undef
239        |                 \ i => [i v e
240        |                              | '' => undef
241        |                              \ l => [l y]
242        |                        ]
243        |     ]
244        |t => [t a i n m e n t]
245        \f => [f i n e m e n t]
246  ]
247
248This is the state of the internal structure before reduction. When
249traversed it will produce a valid regular expression.
250
251The trick is how to perform the reduction. The key insight is to note
252that for any part of the trunk where the sibling paths do not end in
253a node, it it possible to reverse them, and insert them into their
254own R::A object and see what comes out:
255
256  [t a i n m e n t] =>
257  [t n e m n i a t]
258
259  [f i n e m e n t] =>
260  [t n e m e n i f]
261
262Gives:
263
264  [t n e m
265          | n => [n i a t]
266          \ e => [e n i f]
267  ]
268
269When the algorithm visits the other path (s => [s t r u c t ...]),
270it behaves differently. When a null path is seen, no reduction is
271performed at that node level. The resulting path would otherwise
272begin to admit matches that are are not permitted by any of the
273initial patterns. For instance, with bat, cat, and catty, you can
274hardly try to merge 'bat' and 'cat' to produce [bc]at, otherwise the
275resulting pattern would become [bc]at(ty)?, and that would incorrectly
276match 'batty'.
277
278After having visited the s, t, and f paths, the
279result is that t and f were reduced, and s failed. We therefore
280unreverse everything, and signal that this node cannot participate in
281any more reduction (the failures percolate up the tree back to the
282root).
283
284Unreversing the t, f reduction gives:
285
286  [ t => [t a i n] \
287    f => [f i n e] | m e n t ]
288
289When all is said and done, the final result gives
290
291  [c o n
292        |s => [s t r u c t
293        |                 | '' => undef
294        |                 \ i => [i v e
295        |                              | '' => undef
296        |                              \ l => [l y]
297        |                        ]
298        |     ]
299        [ t => [t a i n]
300          f => [f i n e] m e n t ]
301  ]
302
303When this data structure is traversed to build the pattern, it gives
304
305  con(struct(ive(ly)?)?|(fine|tain)ment)
306
307NB: The capturing syntax is used here, instead of the grouping
308syntax for readability issues only.
309
310On the other hand, if the s path contained only [s t r u c t], then
311the reduction would have gone succeeded. We would have a common
312head [t], shared by all three paths.
313
314  [t
315    | c => [c u r t s]
316    \ n => [n e m
317                 | n => [n i a t]
318                 \ e => [e n i f]
319           ]
320  ]
321
322And then consider that the path [c o u r t] had also been added to
323the object. We would then be able to reduce the t from the above
324reduction, and the t in [c o u r t]
325
326  [c o
327      | n => [n
328      |        | s => [s t r u c t]
329      |        | t => [t a i n m e n t]
330      |        \ f => [f i n e m e n t]
331      |      ]
332      \ u => [u r t]
333  ]
334
335gives
336
337  [c o
338      | n => [n
339      |        | s => [s t r u c]
340      |        \ f => [
341      |                 f => [f i n e]
342      |                 t => [t a i n]
343      |                 m e n
344      |               ]
345      |      ]
346      \ u => [u r]
347   t
348  ]
349
350(Here ends my ASCII art talents).
351
352The above structure would give
353
354  co(n(struc|(fine|tai)men)|ur)t
355
356In a nutshell, that's it. Seems like the code would be simple, huh?
357It turns out that no, there are lots of fiddly edge cases,
358especially sets of paths are the same as other sets of paths
359except for an optional sub-path. The canonical example that the
360test suite deals with is:
361
362  showeriness, showerless, showiness, showless.
363
364The final pattern is
365
366  show(er)?(in|l)ess
367
368If there are bugs to be found, it will be in cases that are even
369more pathological than this, e.g., something like:
370
371  show(er)?(i(a|po)?n|l)ess
372
373(although the above actually *does* work, I tried it)
374
375TESTING STRATEGY USED
376
377The code has been heavily tested using an approach based on
378combinatoric lists known as power sets. For instance, the power
379set of (a,b,c,d) is (assuming a join() on the results):
380
381 a ab abc abcd abd ac acd ad b bc bcd bd c cd d
382
383(along with the empty set). The power set of N elements contains
3842**N elements. (or 2**N-1 of we exclude the empty set).
385
386The testing approach was then to take the power set of the above
387power set and produce regular expressions from each element.
388
389For instance, at some point, we would encounter the set
390
391  abc ac bcd cd d
392
393From this we generate the pattern (?:(?:b?c)?d|ab?c). Once we have
394this pattern, we go back and check that it does in fact match the
395above 5 elements, and furthermore, that it does *not match* the
396remaining 10 elements of the power set not used in this iteration.
397
398And yes, that shook out a couple of bugs.
399
400As of this time, the following search space has been examined
401
402a b c         - complete
403a b c d       - complete
404a b c d e     - runs of 1-11, 20-31 complete, 12-17 partial
405a b c d e f   - runs of 1-5, 61-63 complete
406a b c d e f g - runs of 1-4, 125-127 complete
407
408The code for this is in the eg/stress-test script. Note: it can use
409months of CPU time if you're not careful. It requires the following
410modules:
411
412  Algorithm::Combinatorics
413  Data::PowerSet
414
415OTHER CONSIDERATIONS
416
417When tracking is in use, no reduction is performed.
418
419Pretty-printed (indented), and tracking is handled merely by calling
420different output routines. Each routine emits things in a different
421way, but the underlying structure remains the same. Which is one
422reason why you can't have pretty-printed tracked patterns (Well you
423can, but I haven't written the routine that would do so).
424
425Zero-width lookahead assertions can be added to the pattern. This may
426be a win, but it may also slow things down.
427
428DEBUGGING NOTES
429
430If you are curious, you can dump out the internal data struct with
431the following:
432
433  use Data::Dumper;
434  $Data::Dumper::Terse     = 0;
435  $Data::Dumper::Indent    = 0;
436  $Data::Dumper::Quotekeys = 0;
437  $Data::Dumper::Pair      = '=>';
438
439  print Dumper($r->_path);
440
441A more compact representation can be obtained with
442
443  print $r->dump;
444
445All that said, I'm now reasonably confident that it deals
446correctly with pretty much anything you're likely to throw at it.
447
448Two recent bugs were easy to spot in the code, and the fix was a
449couple of lines. Adding lookahead assertion was pretty simple to,
450even if it did result in a certain amount of code factoring. So I
451think that in general the structure of the code is a good one.
452
453The eg/debugging script offers a good strategy for dealing with
454assemblies that give rise to uncompilable patterns.
455
456STATUS
457
458This module is under active development. The module is managed in
459a Subversion repository, and thus, the latest working copy is
460available at
461
462  http://svnweb.mongueurs.net/Regexp-Assemble/trunk
463
464AUTHOR
465
466David Landgren
467
468I do appreciate getting e-mail, especially about Perl. Please keep in
469mind that I get a lot of spam, and take drastic measures to reduce the
470flow. One of the measures involves a gigantic regular expression that
471contains many thousands of patterns that match hostnames of dynamic
472dialup/residential/home IP addresses. That pattern is of course built
473with this module.
474
475It would be ironic if I rejected your mail coming from such an address.
476Please use your ISP's outbound MX, or pay what it takes to get your
477reverse DNS changed to something else.
478
479COPYRIGHT
480
481This module is copyright (C) David Landgren 2004-2008.
482All rights reserved.
483
484LICENSE
485
486This library is free software; you can redistribute it and/or modify
487it under the same terms as Perl itself.
488