• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

.github/workflows/H04-Nov-2021-10988

rebulk/H04-Nov-2021-6,9675,365

.coveragercH A D04-Nov-2021162 1010

.gitignoreH A D04-Nov-2021218 2719

CHANGELOG.mdH A D04-Nov-20211.8 KiB2720

LICENSEH A D04-Nov-20211.1 KiB2317

MANIFEST.inH A D04-Nov-2021107 87

README.mdH A D04-Nov-202118.5 KiB565403

pylintrcH A D04-Nov-202112.4 KiB389268

pytest.iniH A D04-Nov-202198 32

runtests.pyH A D04-Nov-2021258.7 KiB3,4883,454

setup.cfgH A D04-Nov-2021311 129

setup.pyH A D03-May-20222.4 KiB6449

tox.iniH A D04-Nov-2021138 86

README.md

1ReBulk
2======
3
4[![Latest Version](http://img.shields.io/pypi/v/rebulk.svg)](https://pypi.python.org/pypi/rebulk)
5[![MIT License](http://img.shields.io/badge/license-MIT-blue.svg)](https://pypi.python.org/pypi/rebulk)
6[![Build Status](https://img.shields.io/github/workflow/status/Toilal/rebulk/ci)](https://github.com/Toilal/rebulk/actions?query=workflow%3Aci)
7[![Coveralls](http://img.shields.io/coveralls/Toilal/rebulk.svg)](https://coveralls.io/r/Toilal/rebulk?branch=master)
8[![semantic-release](https://img.shields.io/badge/%20%20%F0%9F%93%A6%F0%9F%9A%80-semantic--release-e10079.svg)](https://github.com/relekang/python-semantic-release)
9
10
11ReBulk is a python library that performs advanced searches in strings
12that would be hard to implement using [re
13module](https://docs.python.org/3/library/re.html) or [String
14methods](https://docs.python.org/3/library/stdtypes.html#str) only.
15
16It includes some features like `Patterns`, `Match`, `Rule` that allows
17developers to build a custom and complex string matcher using a readable
18and extendable API.
19
20This project is hosted on GitHub: <https://github.com/Toilal/rebulk>
21
22Install
23=======
24
25```sh
26$ pip install rebulk
27```
28
29Usage
30=====
31
32Regular expression, string and function based patterns are declared in a
33`Rebulk` object. It use a fluent API to chain `string`, `regex`, and
34`functional` methods to define various patterns types.
35
36```python
37>>> from rebulk import Rebulk
38>>> bulk = Rebulk().string('brown').regex(r'qu\w+').functional(lambda s: (20, 25))
39```
40
41When `Rebulk` object is fully configured, you can call `matches` method
42with an input string to retrieve all `Match` objects found by registered
43pattern.
44
45```python
46>>> bulk.matches("The quick brown fox jumps over the lazy dog")
47[<brown:(10, 15)>, <quick:(4, 9)>, <jumps:(20, 25)>]
48```
49
50If multiple `Match` objects are found at the same position, only the
51longer one is kept.
52
53```python
54>>> bulk = Rebulk().string('lakers').string('la')
55>>> bulk.matches("the lakers are from la")
56[<lakers:(4, 10)>, <la:(20, 22)>]
57```
58
59String Patterns
60===============
61
62String patterns are based on
63[str.find](https://docs.python.org/3/library/stdtypes.html#str.find)
64method to find matches, but returns all matches in the string.
65`ignore_case` can be enabled to ignore case.
66
67```python
68>>> Rebulk().string('la').matches("lalalilala")
69[<la:(0, 2)>, <la:(2, 4)>, <la:(6, 8)>, <la:(8, 10)>]
70
71>>> Rebulk().string('la').matches("LalAlilAla")
72[<la:(8, 10)>]
73
74>>> Rebulk().string('la', ignore_case=True).matches("LalAlilAla")
75[<La:(0, 2)>, <lA:(2, 4)>, <lA:(6, 8)>, <la:(8, 10)>]
76```
77
78You can define several patterns with a single `string` method call.
79
80```python
81>>> Rebulk().string('Winter', 'coming').matches("Winter is coming...")
82[<Winter:(0, 6)>, <coming:(10, 16)>]
83```
84
85Regular Expression Patterns
86===========================
87
88Regular Expression patterns are based on a compiled regular expression.
89[re.finditer](https://docs.python.org/3/library/re.html#re.finditer)
90method is used to find matches.
91
92If [regex module](https://pypi.python.org/pypi/regex) is available, it
93can be used by rebulk instead of default [re
94module](https://docs.python.org/3/library/re.html). Enable it with `REBULK_REGEX_ENABLED=1` environment variable.
95
96```python
97>>> Rebulk().regex(r'l\w').matches("lolita")
98[<lo:(0, 2)>, <li:(2, 4)>]
99```
100
101You can define several patterns with a single `regex` method call.
102
103```python
104>>> Rebulk().regex(r'Wint\wr', r'com\w{3}').matches("Winter is coming...")
105[<Winter:(0, 6)>, <coming:(10, 16)>]
106```
107
108All keyword arguments from
109[re.compile](https://docs.python.org/3/library/re.html#re.compile) are
110supported.
111
112```python
113>>> import re  # import required for flags constant
114>>> Rebulk().regex('L[A-Z]KERS', flags=re.IGNORECASE) \
115...         .matches("The LaKeRs are from La")
116[<LaKeRs:(4, 10)>]
117
118>>> Rebulk().regex('L[A-Z]', 'L[A-Z]KERS', flags=re.IGNORECASE) \
119...         .matches("The LaKeRs are from La")
120[<La:(20, 22)>, <LaKeRs:(4, 10)>]
121
122>>> Rebulk().regex(('L[A-Z]', re.IGNORECASE), ('L[a-z]KeRs')) \
123...         .matches("The LaKeRs are from La")
124[<La:(20, 22)>, <LaKeRs:(4, 10)>]
125```
126
127If [regex module](https://pypi.python.org/pypi/regex) is available, it
128automatically supports repeated captures.
129
130```python
131>>> # If regex module is available, repeated_captures is True by default.
132>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+').matches("01-02-03-04")
133>>> matches[0].children # doctest:+SKIP
134[<01:(0, 2)>, <02:(3, 5)>, <03:(6, 8)>, <04:(9, 11)>]
135
136>>> # If regex module is not available, or if repeated_captures is forced to False.
137>>> matches = Rebulk().regex(r'(\d+)(?:-(\d+))+', repeated_captures=False) \
138...                   .matches("01-02-03-04")
139>>> matches[0].children
140[<01:(0, 2)+initiator=01-02-03-04>, <04:(9, 11)+initiator=01-02-03-04>]
141```
142
143-   `abbreviations`
144
145    Defined as a list of 2-tuple, each tuple is an abbreviation. It
146    simply replace `tuple[0]` with `tuple[1]` in the expression.
147
148    \>\>\> Rebulk().regex(r\'Custom-separators\',
149    abbreviations=\[(\"-\", r\"\[W\_\]+\")\])\...
150    .matches(\"Custom\_separators using-abbreviations\")
151    \[\<Custom\_separators:(0, 17)\>\]
152
153Functional Patterns
154===================
155
156Functional Patterns are based on the evaluation of a function.
157
158The function should have the same parameters as `Rebulk.matches` method,
159that is the input string, and must return at least start index and end
160index of the `Match` object.
161
162```python
163>>> def func(string):
164...     index = string.find('?')
165...     if index > -1:
166...         return 0, index - 11
167>>> Rebulk().functional(func).matches("Why do simple ? Forget about it ...")
168[<Why:(0, 3)>]
169```
170
171You can also return a dict of keywords arguments for `Match` object.
172
173You can define several patterns with a single `functional` method call,
174and function used can return multiple matches.
175
176Chain Patterns
177==============
178
179Chain Patterns are ordered composition of string, functional and regex
180patterns. Repeater can be set to define repetition on chain part.
181
182```python
183>>> r = Rebulk().regex_defaults(flags=re.IGNORECASE)\
184...             .defaults(children=True, formatter={'episode': int, 'version': int})\
185...             .chain()\
186...             .regex(r'e(?P<episode>\d{1,4})').repeater(1)\
187...             .regex(r'v(?P<version>\d+)').repeater('?')\
188...             .regex(r'[ex-](?P<episode>\d{1,4})').repeater('*')\
189...             .close() # .repeater(1) could be omitted as it's the default behavior
190>>> r.matches("This is E14v2-15-16-17").to_dict()  # converts matches to dict
191MatchesDict([('episode', [14, 15, 16, 17]), ('version', 2)])
192```
193
194Patterns parameters
195===================
196
197All patterns have options that can be given as keyword arguments.
198
199-   `validator`
200
201    Function to validate `Match` value given by the pattern. Can also be
202    a `dict`, to use `validator` with pattern named with key.
203
204    ```python
205    >>> def check_leap_year(match):
206    ...     return int(match.value) in [1980, 1984, 1988]
207    >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \
208    ...                   .matches("In year 1982 ...")
209    >>> len(matches)
210    0
211    >>> matches = Rebulk().regex(r'\d{4}', validator=check_leap_year) \
212    ...                   .matches("In year 1984 ...")
213    >>> len(matches)
214    1
215    ```
216
217Some base validator functions are available in `rebulk.validators`
218module. Most of those functions have to be configured using
219`functools.partial` to map them to function accepting a single `match`
220argument.
221
222-   `formatter`
223
224    Function to convert `Match` value given by the pattern. Can also be
225    a `dict`, to use `formatter` with matches named with key.
226
227    ```python
228    >>> def year_formatter(value):
229    ...     return int(value)
230    >>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \
231    ...                   .matches("In year 1982 ...")
232    >>> isinstance(matches[0].value, int)
233    True
234    ```
235
236-   `pre_match_processor` / `post_match_processor`
237
238    Function to mutagen or invalidate a match generated by a pattern.
239
240    Function has a single parameter which is the Match object. If
241    function returns False, it will be considered as an invalid match.
242    If function returns a match instance, it will replace the original
243    match with this instance in the process.
244
245-   `post_processor`
246
247    Function to change the default output of the pattern. Function
248    parameters are Matches list and Pattern object.
249
250-   `name`
251
252    The name of the pattern. It is automatically passed to `Match`
253    objects generated by this pattern.
254
255-   `tags`
256
257    A list of string that qualifies this pattern.
258
259-   `value`
260
261    Override value property for generated `Match` objects. Can also be a
262    `dict`, to use `value` with pattern named with key.
263
264-   `validate_all`
265
266    By default, validator is called for returned `Match` objects only.
267    Enable this option to validate them all, parent and children
268    included.
269
270-   `format_all`
271
272    By default, formatter is called for returned `Match` values only.
273    Enable this option to format them all, parent and children included.
274
275-   `disabled`
276
277    A `function(context)` to disable the pattern if returning `True`.
278
279-   `children`
280
281    If `True`, all children `Match` objects will be retrieved instead of
282    a single parent `Match` object.
283
284-   `private`
285
286    If `True`, `Match` objects generated from this pattern are available
287    internally only. They will be removed at the end of `Rebulk.matches`
288    method call.
289
290-   `private_parent`
291
292    Force parent matches to be returned and flag them as private.
293
294-   `private_children`
295
296    Force children matches to be returned and flag them as private.
297
298-   `private_names`
299
300    Matches names that will be declared as private
301
302-   `ignore_names`
303
304    Matches names that will be ignored from the pattern output, after
305    validation.
306
307-   `marker`
308
309    If `true`, `Match` objects generated from this pattern will be
310    markers matches instead of standard matches. They won\'t be included
311    in `Matches` sequence, but will be available in `Matches.markers`
312    sequence (see `Markers` section).
313
314Match
315=====
316
317A `Match` object is the result created by a registered pattern.
318
319It has a `value` property defined, and position indices are available
320through `start`, `end` and `span` properties.
321
322In some case, it contains children `Match` objects in `children`
323property, and each child `Match` object reference its parent in `parent`
324property. Also, a `name` property can be defined for the match.
325
326If groups are defined in a Regular Expression pattern, each group match
327will be converted to a single `Match` object. If a group has a name
328defined (`(?P<name>group)`), it is set as `name` property in a child
329`Match` object. The whole regexp match (`re.group(0)`) will be converted
330to the main `Match` object, and all subgroups (1, 2, \... n) will be
331converted to `children` matches of the main `Match` object.
332
333```python
334>>> matches = Rebulk() \
335...         .regex(r"One, (?P<one>\w+), Two, (?P<two>\w+), Three, (?P<three>\w+)") \
336...         .matches("Zero, 0, One, 1, Two, 2, Three, 3, Four, 4")
337>>> matches
338[<One, 1, Two, 2, Three, 3:(9, 33)>]
339>>> for child in matches[0].children:
340...     '%s = %s' % (child.name, child.value)
341'one = 1'
342'two = 2'
343'three = 3'
344```
345
346It\'s possible to retrieve only children by using `children` parameters.
347You can also customize the way structure is generated with `every`,
348`private_parent` and `private_children` parameters.
349
350```python
351>>> matches = Rebulk() \
352...         .regex(r"One, (?P<one>\w+), Two, (?P<two>\w+), Three, (?P<three>\w+)", children=True) \
353...         .matches("Zero, 0, One, 1, Two, 2, Three, 3, Four, 4")
354>>> matches
355[<1:(14, 15)+name=one+initiator=One, 1, Two, 2, Three, 3>, <2:(22, 23)+name=two+initiator=One, 1, Two, 2, Three, 3>, <3:(32, 33)+name=three+initiator=One, 1, Two, 2, Three, 3>]
356```
357
358Match object has the following properties that can be given to Pattern
359objects
360
361-   `formatter`
362
363    Function to convert `Match` value given by the pattern. Can also be
364    a `dict`, to use `formatter` with matches named with key.
365
366    ```python
367    >>> def year_formatter(value):
368    ...     return int(value)
369    >>> matches = Rebulk().regex(r'\d{4}', formatter=year_formatter) \
370    ...                   .matches("In year 1982 ...")
371    >>> isinstance(matches[0].value, int)
372    True
373    ```
374
375-   `format_all`
376
377    By default, formatter is called for returned `Match` values only.
378    Enable this option to format them all, parent and children included.
379
380-   `conflict_solver`
381
382    A `function(match, conflicting_match)` used to solve conflict.
383    Returned object will be removed from matches by `ConflictSolver`
384    default rule. If `__default__` string is returned, it will fallback
385    to default behavior keeping longer match.
386
387Matches
388=======
389
390A `Matches` object holds the result of `Rebulk.matches` method call.
391It\'s a sequence of `Match` objects and it behaves like a list.
392
393All methods accepts a `predicate` function to filter `Match` objects
394using a callable, and an `index` int to retrieve a single element from
395default returned matches.
396
397It has the following additional methods and properties on it.
398
399-   `starting(index, predicate=None, index=None)`
400
401    Retrieves a list of `Match` objects that starts at given index.
402
403-   `ending(index, predicate=None, index=None)`
404
405    Retrieves a list of `Match` objects that ends at given index.
406
407-   `previous(match, predicate=None, index=None)`
408
409    Retrieves a list of `Match` objects that are previous and nearest to
410    match.
411
412-   `next(match, predicate=None, index=None)`
413
414    Retrieves a list of `Match` objects that are next and nearest to
415    match.
416
417-   `tagged(tag, predicate=None, index=None)`
418
419    Retrieves a list of `Match` objects that have the given tag defined.
420
421-   `named(name, predicate=None, index=None)`
422
423    Retrieves a list of `Match` objects that have the given name.
424
425-   `range(start=0, end=None, predicate=None, index=None)`
426
427    Retrieves a list of `Match` objects for given range, sorted from
428    start to end.
429
430-   `holes(start=0, end=None, formatter=None, ignore=None, predicate=None, index=None)`
431
432    Retrieves a list of *hole* `Match` objects for given range. A hole
433    match is created for each range where no match is available.
434
435-   `conflicting(match, predicate=None, index=None)`
436
437    Retrieves a list of `Match` objects that conflicts with given match.
438
439-   `chain_before(self, position, seps, start=0, predicate=None, index=None)`:
440
441    Retrieves a list of chained matches, before position, matching
442    predicate and separated by characters from seps only.
443
444-   `chain_after(self, position, seps, end=None, predicate=None, index=None)`:
445
446    Retrieves a list of chained matches, after position, matching
447    predicate and separated by characters from seps only.
448
449-   `at_match(match, predicate=None, index=None)`
450
451    Retrieves a list of `Match` objects at the same position as match.
452
453-   `at_span(span, predicate=None, index=None)`
454
455    Retrieves a list of `Match` objects from given (start, end) tuple.
456
457-   `at_index(pos, predicate=None, index=None)`
458
459    Retrieves a list of `Match` objects from given position.
460
461-   `names`
462
463    Retrieves a sequence of all `Match.name` properties.
464
465-   `tags`
466
467    Retrieves a sequence of all `Match.tags` properties.
468
469-   `to_dict(details=False, first_value=False, enforce_list=False)`
470
471    Convert to an ordered dict, with `Match.name` as key and
472    `Match.value` as value.
473
474    It\'s a subclass of
475    [OrderedDict](https://docs.python.org/2/library/collections.html#collections.OrderedDict),
476    that contains a `matches` property which is a dict with `Match.name`
477    as key and list of `Match` objects as value.
478
479    If `first_value` is `True` and distinct values are found for the
480    same name, value will be wrapped to a list. If `False`, first value
481    only will be kept and values lists can be retrieved with
482    `values_list` which is a dict with `Match.name` as key and list of
483    `Match.value` as value.
484
485    if `enforce_list` is `True`, all values will be wrapped to a list,
486    even if a single value is found.
487
488    If `details` is True, `Match.value` objects are replaced with
489    complete `Match` object.
490
491-   `markers`
492
493    A custom `Matches` sequences specialized for `markers` matches (see
494    below)
495
496Markers
497=======
498
499If you have defined some patterns with `markers` property, then
500`Matches.markers` points to a special `Matches` sequence that contains
501only `markers` matches. This sequence supports all methods from
502`Matches`.
503
504Markers matches are not intended to be used in final result, but can be
505used to implement a `Rule`.
506
507Rules
508=====
509
510Rules are a convenient and readable way to implement advanced
511conditional logic involving several `Match` objects. When a rule is
512triggered, it can perform an action on `Matches` object, like filtering
513out, adding additional tags or renaming.
514
515Rules are implemented by extending the abstract `Rule` class. They are
516registered using `Rebulk.rule` method by giving either a `Rule`
517instance, a `Rule` class or a module containing `Rule classes` only.
518
519For a rule to be triggered, `Rule.when` method must return `True`, or a
520non empty list of `Match` objects, or any other truthy object. When
521triggered, `Rule.then` method is called to perform the action with
522`when_response` parameter defined as the response of `Rule.when` call.
523
524Instead of implementing `Rule.then` method, you can define `consequence`
525class property with a Consequence classe or instance, like
526`RemoveMatch`, `RenameMatch` or `AppendMatch`. You can also use a list
527of consequence when required : `when_response` must then be iterable,
528and elements of this iterable will be given to each consequence in the
529same order.
530
531When many rules are registered, it can be useful to set `priority` class
532variable to define a priority integer between all rule executions
533(higher priorities will be executed first). You can also define
534`dependency` to declare another Rule class as dependency for the current
535rule, meaning that it will be executed before.
536
537For all rules with the same `priority` value, `when` is called before,
538and `then` is called after all.
539
540```python
541>>> from rebulk import Rule, RemoveMatch
542
543>>> class FirstOnlyRule(Rule):
544...     consequence = RemoveMatch
545...
546...     def when(self, matches, context):
547...         grabbed = matches.named("grabbed", 0)
548...         if grabbed and matches.previous(grabbed):
549...             return grabbed
550
551>>> rebulk = Rebulk()
552
553>>> rebulk.regex("This match(.*?)grabbed", name="grabbed")
554<...Rebulk object ...>
555>>> rebulk.regex("if it's(.*?)first match", private=True)
556<...Rebulk object at ...>
557>>> rebulk.rules(FirstOnlyRule)
558<...Rebulk object at ...>
559
560>>> rebulk.matches("This match is grabbed only if it's the first match")
561[<This match is grabbed:(0, 21)+name=grabbed>]
562>>> rebulk.matches("if it's NOT the first match, This match is NOT grabbed")
563[]
564```
565