1 2 3 4# rematch2 5 6> Match Regular Expressions with a Nicer 'API' 7 8[![Linux Build Status](https://travis-ci.org/r-lib/rematch2.svg?branch=master)](https://travis-ci.org/r-lib/rematch2) 9[![Windows Build status](https://ci.appveyor.com/api/projects/status/github/r-lib/rematch2?svg=true)](https://ci.appveyor.com/project/gaborcsardi/rematch2) 10[![](http://www.r-pkg.org/badges/version/rematch2)](http://www.r-pkg.org/pkg/rematch2) 11[![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/rematch2)](http://www.r-pkg.org/pkg/rematch2) 12[![Coverage Status](https://img.shields.io/codecov/c/github/r-lib/rematch2/master.svg)](https://codecov.io/github/r-lib/rematch2?branch=master) 13 14A small wrapper on regular expression matching functions `regexpr` 15and `gregexpr` to return the results in tidy data frames. 16 17--- 18 19 - [Installation](#installation) 20 - [Rematch vs rematch2](#rematch-vs-rematch2) 21 - [Usage](#usage) 22 - [First match](#first-match) 23 - [All matches](#all-matches) 24 - [Match positions](#match-positions) 25 - [License](#license) 26 27## Installation 28 29 30```r 31install.packages("rematch2") 32``` 33 34## Rematch vs rematch2 35 36Note that `rematch2` is not compatible with the original `rematch` package. 37There are at least three major changes: 38* The order of the arguments for the functions is different. In 39 `rematch2` the `text` vector is first, and `pattern` is second. 40* In the result, `.match` is the last column instead of the first. 41* `rematch2` returns `tibble` data frames. See 42 https://github.com/hadley/tibble. 43 44## Usage 45 46### First match 47 48 49```r 50library(rematch2) 51``` 52 53With capture groups: 54 55```r 56dates <- c("2016-04-20", "1977-08-08", "not a date", "2016", 57 "76-03-02", "2012-06-30", "2015-01-21 19:58") 58isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])" 59re_match(text = dates, pattern = isodate) 60``` 61 62``` 63#> # A tibble: 7 x 5 64#> `` `` `` .text .match 65#> <chr> <chr> <chr> <chr> <chr> 66#> 1 2016 04 20 2016-04-20 2016-04-20 67#> 2 1977 08 08 1977-08-08 1977-08-08 68#> 3 <NA> <NA> <NA> not a date <NA> 69#> 4 <NA> <NA> <NA> 2016 <NA> 70#> 5 <NA> <NA> <NA> 76-03-02 <NA> 71#> 6 2012 06 30 2012-06-30 2012-06-30 72#> 7 2015 01 21 2015-01-21 19:58 2015-01-21 73``` 74 75Named capture groups: 76 77```r 78isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])" 79re_match(text = dates, pattern = isodaten) 80``` 81 82``` 83#> # A tibble: 7 x 5 84#> year month day .text .match 85#> <chr> <chr> <chr> <chr> <chr> 86#> 1 2016 04 20 2016-04-20 2016-04-20 87#> 2 1977 08 08 1977-08-08 1977-08-08 88#> 3 <NA> <NA> <NA> not a date <NA> 89#> 4 <NA> <NA> <NA> 2016 <NA> 90#> 5 <NA> <NA> <NA> 76-03-02 <NA> 91#> 6 2012 06 30 2012-06-30 2012-06-30 92#> 7 2015 01 21 2015-01-21 19:58 2015-01-21 93``` 94 95A slightly more complex example: 96 97```r 98github_repos <- c( 99 "metacran/crandb", 100 "jeroenooms/curl@v0.9.3", 101 "jimhester/covr#47", 102 "hadley/dplyr@*release", 103 "r-lib/remotes@550a3c7d3f9e1493a2ba", 104 "/$&@R64&3" 105) 106owner_rx <- "(?:(?<owner>[^/]+)/)?" 107repo_rx <- "(?<repo>[^/@#]+)" 108subdir_rx <- "(?:/(?<subdir>[^@#]*[^@#/]))?" 109ref_rx <- "(?:@(?<ref>[^*].*))" 110pull_rx <- "(?:#(?<pull>[0-9]+))" 111release_rx <- "(?:@(?<release>[*]release))" 112 113subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx) 114github_rx <- sprintf( 115 "^(?:%s%s%s%s|(?<catchall>.*))$", 116 owner_rx, repo_rx, subdir_rx, subtype_rx 117) 118re_match(text = github_repos, pattern = github_rx) 119``` 120 121``` 122#> # A tibble: 6 x 9 123#> owner repo subdir ref pull release catchall 124#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> 125#> 1 metacran crandb 126#> 2 jeroenooms curl v0.9.3 127#> 3 jimhester covr 47 128#> 4 hadley dplyr *release 129#> 5 r-lib remotes 550a3c7d3f9e1493a2ba 130#> 6 /$&@R64&3 131#> # ... with 2 more variables: .text <chr>, .match <chr> 132``` 133 134### All matches 135 136Extract all names, and also first names and last names: 137 138 139```r 140name_rex <- paste0( 141 "(?<first>[[:upper:]][[:lower:]]+) ", 142 "(?<last>[[:upper:]][[:lower:]]+)" 143) 144notables <- c( 145 " Ben Franklin and Jefferson Davis", 146 "\tMillard Fillmore" 147) 148not <- re_match_all(notables, name_rex) 149not 150``` 151 152``` 153#> # A tibble: 2 x 4 154#> first last .text .match 155#> <list> <list> <chr> <list> 156#> 1 <chr [2]> <chr [2]> Ben Franklin and Jefferson Davis <chr [2]> 157#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore" <chr [1]> 158``` 159 160 161```r 162not$first 163``` 164 165``` 166#> [[1]] 167#> [1] "Ben" "Jefferson" 168#> 169#> [[2]] 170#> [1] "Millard" 171``` 172 173```r 174not$last 175``` 176 177``` 178#> [[1]] 179#> [1] "Franklin" "Davis" 180#> 181#> [[2]] 182#> [1] "Fillmore" 183``` 184 185```r 186not$.match 187``` 188 189``` 190#> [[1]] 191#> [1] "Ben Franklin" "Jefferson Davis" 192#> 193#> [[2]] 194#> [1] "Millard Fillmore" 195``` 196 197### Match positions 198 199`re_exec` and `re_exec_all` are similar to `re_match` and `re_match_all`, 200but they also return match positions. These functions return match 201records. A match record has three components: `match`, `start`, `end`, and 202each component can be a vector. It is similar to a data frame in this 203respect. 204 205 206```r 207pos <- re_exec(notables, name_rex) 208pos 209``` 210 211``` 212#> # A tibble: 2 x 4 213#> first last .text .match 214#> * <list> <list> <chr> <list> 215#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]> 216#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]> 217``` 218 219Unfortunately R does not allow hierarchical data frames (i.e. a column of a 220data frame cannot be another data frame), but `rematch2` defines some 221special classes and an `$` operator, to make it easier to extract parts 222of `re_exec` and `re_exec_all` matches. You simply query the `match`, 223`start` or `end` part of a column: 224 225 226```r 227pos$first$match 228``` 229 230``` 231#> [1] "Ben" "Millard" 232``` 233 234```r 235pos$first$start 236``` 237 238``` 239#> [1] 3 2 240``` 241 242```r 243pos$first$end 244``` 245 246``` 247#> [1] 5 8 248``` 249 250`re_exec_all` is very similar, but these queries return lists, with 251arbitrary number of matches: 252 253 254```r 255allpos <- re_exec_all(notables, name_rex) 256allpos 257``` 258 259``` 260#> # A tibble: 2 x 4 261#> first last .text .match 262#> <list> <list> <chr> <list> 263#> 1 <list [3]> <list [3]> Ben Franklin and Jefferson Davis <list [3]> 264#> 2 <list [3]> <list [3]> "\tMillard Fillmore" <list [3]> 265``` 266 267 268```r 269allpos$first$match 270``` 271 272``` 273#> [[1]] 274#> [1] "Ben" "Jefferson" 275#> 276#> [[2]] 277#> [1] "Millard" 278``` 279 280```r 281allpos$first$start 282``` 283 284``` 285#> [[1]] 286#> [1] 3 20 287#> 288#> [[2]] 289#> [1] 2 290``` 291 292```r 293allpos$first$end 294``` 295 296``` 297#> [[1]] 298#> [1] 5 28 299#> 300#> [[2]] 301#> [1] 8 302``` 303 304## License 305 306MIT © Mango Solutions, Gábor Csárdi 307