README.md
1
2
3
4# rematch2
5
6> Match Regular Expressions with a Nicer 'API'
7
8[![Linux Build Status](https://travis-ci.org/r-lib/rematch2.svg?branch=master)](https://travis-ci.org/r-lib/rematch2)
9[![Windows Build status](https://ci.appveyor.com/api/projects/status/github/r-lib/rematch2?svg=true)](https://ci.appveyor.com/project/gaborcsardi/rematch2)
10[![](http://www.r-pkg.org/badges/version/rematch2)](http://www.r-pkg.org/pkg/rematch2)
11[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/rematch2)](http://www.r-pkg.org/pkg/rematch2)
12[![Coverage Status](https://img.shields.io/codecov/c/github/r-lib/rematch2/master.svg)](https://codecov.io/github/r-lib/rematch2?branch=master)
13
14A small wrapper on regular expression matching functions `regexpr`
15and `gregexpr` to return the results in tidy data frames.
16
17---
18
19 - [Installation](#installation)
20 - [Rematch vs rematch2](#rematch-vs-rematch2)
21 - [Usage](#usage)
22 - [First match](#first-match)
23 - [All matches](#all-matches)
24 - [Match positions](#match-positions)
25 - [License](#license)
26
27## Installation
28
29
30```r
31install.packages("rematch2")
32```
33
34## Rematch vs rematch2
35
36Note that `rematch2` is not compatible with the original `rematch` package.
37There are at least three major changes:
38* The order of the arguments for the functions is different. In
39 `rematch2` the `text` vector is first, and `pattern` is second.
40* In the result, `.match` is the last column instead of the first.
41* `rematch2` returns `tibble` data frames. See
42 https://github.com/hadley/tibble.
43
44## Usage
45
46### First match
47
48
49```r
50library(rematch2)
51```
52
53With capture groups:
54
55```r
56dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
57 "76-03-02", "2012-06-30", "2015-01-21 19:58")
58isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
59re_match(text = dates, pattern = isodate)
60```
61
62```
63#> # A tibble: 7 x 5
64#> `` `` `` .text .match
65#> <chr> <chr> <chr> <chr> <chr>
66#> 1 2016 04 20 2016-04-20 2016-04-20
67#> 2 1977 08 08 1977-08-08 1977-08-08
68#> 3 <NA> <NA> <NA> not a date <NA>
69#> 4 <NA> <NA> <NA> 2016 <NA>
70#> 5 <NA> <NA> <NA> 76-03-02 <NA>
71#> 6 2012 06 30 2012-06-30 2012-06-30
72#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
73```
74
75Named capture groups:
76
77```r
78isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
79re_match(text = dates, pattern = isodaten)
80```
81
82```
83#> # A tibble: 7 x 5
84#> year month day .text .match
85#> <chr> <chr> <chr> <chr> <chr>
86#> 1 2016 04 20 2016-04-20 2016-04-20
87#> 2 1977 08 08 1977-08-08 1977-08-08
88#> 3 <NA> <NA> <NA> not a date <NA>
89#> 4 <NA> <NA> <NA> 2016 <NA>
90#> 5 <NA> <NA> <NA> 76-03-02 <NA>
91#> 6 2012 06 30 2012-06-30 2012-06-30
92#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
93```
94
95A slightly more complex example:
96
97```r
98github_repos <- c(
99 "metacran/crandb",
100 "jeroenooms/curl@v0.9.3",
101 "jimhester/covr#47",
102 "hadley/dplyr@*release",
103 "r-lib/remotes@550a3c7d3f9e1493a2ba",
104 "/$&@R64&3"
105)
106owner_rx <- "(?:(?<owner>[^/]+)/)?"
107repo_rx <- "(?<repo>[^/@#]+)"
108subdir_rx <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
109ref_rx <- "(?:@(?<ref>[^*].*))"
110pull_rx <- "(?:#(?<pull>[0-9]+))"
111release_rx <- "(?:@(?<release>[*]release))"
112
113subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
114github_rx <- sprintf(
115 "^(?:%s%s%s%s|(?<catchall>.*))$",
116 owner_rx, repo_rx, subdir_rx, subtype_rx
117)
118re_match(text = github_repos, pattern = github_rx)
119```
120
121```
122#> # A tibble: 6 x 9
123#> owner repo subdir ref pull release catchall
124#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
125#> 1 metacran crandb
126#> 2 jeroenooms curl v0.9.3
127#> 3 jimhester covr 47
128#> 4 hadley dplyr *release
129#> 5 r-lib remotes 550a3c7d3f9e1493a2ba
130#> 6 /$&@R64&3
131#> # ... with 2 more variables: .text <chr>, .match <chr>
132```
133
134### All matches
135
136Extract all names, and also first names and last names:
137
138
139```r
140name_rex <- paste0(
141 "(?<first>[[:upper:]][[:lower:]]+) ",
142 "(?<last>[[:upper:]][[:lower:]]+)"
143)
144notables <- c(
145 " Ben Franklin and Jefferson Davis",
146 "\tMillard Fillmore"
147)
148not <- re_match_all(notables, name_rex)
149not
150```
151
152```
153#> # A tibble: 2 x 4
154#> first last .text .match
155#> <list> <list> <chr> <list>
156#> 1 <chr [2]> <chr [2]> Ben Franklin and Jefferson Davis <chr [2]>
157#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore" <chr [1]>
158```
159
160
161```r
162not$first
163```
164
165```
166#> [[1]]
167#> [1] "Ben" "Jefferson"
168#>
169#> [[2]]
170#> [1] "Millard"
171```
172
173```r
174not$last
175```
176
177```
178#> [[1]]
179#> [1] "Franklin" "Davis"
180#>
181#> [[2]]
182#> [1] "Fillmore"
183```
184
185```r
186not$.match
187```
188
189```
190#> [[1]]
191#> [1] "Ben Franklin" "Jefferson Davis"
192#>
193#> [[2]]
194#> [1] "Millard Fillmore"
195```
196
197### Match positions
198
199`re_exec` and `re_exec_all` are similar to `re_match` and `re_match_all`,
200but they also return match positions. These functions return match
201records. A match record has three components: `match`, `start`, `end`, and
202each component can be a vector. It is similar to a data frame in this
203respect.
204
205
206```r
207pos <- re_exec(notables, name_rex)
208pos
209```
210
211```
212#> # A tibble: 2 x 4
213#> first last .text .match
214#> * <list> <list> <chr> <list>
215#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]>
216#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]>
217```
218
219Unfortunately R does not allow hierarchical data frames (i.e. a column of a
220data frame cannot be another data frame), but `rematch2` defines some
221special classes and an `$` operator, to make it easier to extract parts
222of `re_exec` and `re_exec_all` matches. You simply query the `match`,
223`start` or `end` part of a column:
224
225
226```r
227pos$first$match
228```
229
230```
231#> [1] "Ben" "Millard"
232```
233
234```r
235pos$first$start
236```
237
238```
239#> [1] 3 2
240```
241
242```r
243pos$first$end
244```
245
246```
247#> [1] 5 8
248```
249
250`re_exec_all` is very similar, but these queries return lists, with
251arbitrary number of matches:
252
253
254```r
255allpos <- re_exec_all(notables, name_rex)
256allpos
257```
258
259```
260#> # A tibble: 2 x 4
261#> first last .text .match
262#> <list> <list> <chr> <list>
263#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]>
264#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]>
265```
266
267
268```r
269allpos$first$match
270```
271
272```
273#> [[1]]
274#> [1] "Ben" "Jefferson"
275#>
276#> [[2]]
277#> [1] "Millard"
278```
279
280```r
281allpos$first$start
282```
283
284```
285#> [[1]]
286#> [1] 3 20
287#>
288#> [[2]]
289#> [1] 2
290```
291
292```r
293allpos$first$end
294```
295
296```
297#> [[1]]
298#> [1] 5 28
299#>
300#> [[2]]
301#> [1] 8
302```
303
304## License
305
306MIT © Mango Solutions, Gábor Csárdi
307