• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

R/H30-Jul-2021-1,611886

build/H03-May-2022-

demo/H06-Jan-2021-9163

inst/H15-Oct-2021-601455

man/H30-Jul-2021-706607

tests/H19-Dec-2020-1,268993

vignettes/H15-Oct-2021-384292

DESCRIPTIONH A D16-Oct-20211.1 KiB3332

LICENSEH A D19-Dec-202043 32

MD5H A D16-Oct-20214 KiB7574

NAMESPACEH A D15-Oct-20211.9 KiB8078

NEWS.mdH A D15-Oct-20217.8 KiB206136

README.mdH A D15-Oct-20214.5 KiB11387

README.md

1
2<!-- README.md is generated from README.Rmd. Please edit that file -->
3
4# rvest <img src="man/figures/logo.png" align="right" height="139"/>
5
6<!-- badges: start -->
7
8[![CRAN
9status](https://www.r-pkg.org/badges/version/rvest)](https://cran.r-project.org/package=rvest)
10[![R-CMD-check](https://github.com/tidyverse/rvest/workflows/R-CMD-check/badge.svg)](https://github.com/tidyverse/rvest/actions)
11[![Codecov test
12coverage](https://codecov.io/gh/tidyverse/rvest/branch/master/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/rvest?branch=master)
13
14<!-- badges: end -->
15
16## Overview
17
18rvest helps you scrape (or harvest) data from web pages. It is designed
19to work with [magrittr](https://github.com/tidyverse/magrittr) to make
20it easy to express common web scraping tasks, inspired by libraries like
21[beautiful soup](https://www.crummy.com/software/BeautifulSoup/) and
22[RoboBrowser](http://robobrowser.readthedocs.io/en/latest/readme.html).
23
24If you’re scraping multiple pages, I highly recommend using rvest in
25concert with [polite](https://dmi3kno.github.io/polite/). The polite
26package ensures that you’re respecting the
27[robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
28and not hammering the site with too many requests.
29
30## Installation
31
32``` r
33# The easiest way to get rvest is to install the whole tidyverse:
34install.packages("tidyverse")
35
36# Alternatively, install just rvest:
37install.packages("rvest")
38```
39
40## Usage
41
42``` r
43library(rvest)
44
45# Start by reading a HTML page with read_html():
46starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")
47
48# Then find elements that match a css selector or XPath expression
49# using html_elements(). In this example, each <section> corresponds
50# to a different film
51films <- starwars %>% html_elements("section")
52films
53#> {xml_nodeset (7)}
54#> [1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ...
55#> [2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ...
56#> [3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ...
57#> [4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ...
58#> [5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ...
59#> [6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ...
60#> [7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ...
61
62# Then use html_element() to extract one element per film. Here
63# we the title is given by the text inside <h2>
64title <- films %>%
65  html_element("h2") %>%
66  html_text2()
67title
68#> [1] "The Phantom Menace"      "Attack of the Clones"
69#> [3] "Revenge of the Sith"     "A New Hope"
70#> [5] "The Empire Strikes Back" "Return of the Jedi"
71#> [7] "The Force Awakens"
72
73# Or use html_attr() to get data out of attributes. html_attr() always
74# returns a string so we convert it to an integer using a readr function
75episode <- films %>%
76  html_element("h2") %>%
77  html_attr("data-id") %>%
78  readr::parse_integer()
79episode
80#> [1] 1 2 3 4 5 6 7
81```
82
83If the page contains tabular data you can convert it directly to a data
84frame with `html_table()`:
85
86``` r
87html <- read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")
88
89html %>%
90  html_element(".tracklist") %>%
91  html_table()
92#> # A tibble: 29 × 4
93#>    No.   Title                       `Performer(s)`                       Length
94#>    <chr> <chr>                       <chr>                                <chr>
95#>  1 1.    "\"Everything Is Awesome\"" "Tegan and Sara featuring The Lonel… 2:43
96#>  2 2.    "\"Prologue\""              ""                                   2:28
97#>  3 3.    "\"Emmett's Morning\""      ""                                   2:00
98#>  4 4.    "\"Emmett Falls in Love\""  ""                                   1:11
99#>  5 5.    "\"Escape\""                ""                                   3:26
100#>  6 6.    "\"Into the Old West\""     ""                                   1:00
101#>  7 7.    "\"Wyldstyle Explains\""    ""                                   1:21
102#>  8 8.    "\"Emmett's Mind\""         ""                                   2:17
103#>  9 9.    "\"The Transformation\""    ""                                   1:46
104#> 10 10.   "\"Saloons and Wagons\""    ""                                   3:38
105#> # … with 19 more rows
106```
107
108## Code of Conduct
109
110Please note that the rvest project is released with a [Contributor Code
111of Conduct](https://rvest.tidyverse.org/CODE_OF_CONDUCT.html). By
112contributing to this project, you agree to abide by its terms.
113