Name | Date | Size | #Lines | LOC | ||
---|---|---|---|---|---|---|
.. | 03-May-2022 | - | ||||
R/ | H | 30-Jul-2021 | - | 1,611 | 886 | |
build/ | H | 03-May-2022 | - | |||
demo/ | H | 06-Jan-2021 | - | 91 | 63 | |
inst/ | H | 15-Oct-2021 | - | 601 | 455 | |
man/ | H | 30-Jul-2021 | - | 706 | 607 | |
tests/ | H | 19-Dec-2020 | - | 1,268 | 993 | |
vignettes/ | H | 15-Oct-2021 | - | 384 | 292 | |
DESCRIPTION | H A D | 16-Oct-2021 | 1.1 KiB | 33 | 32 | |
LICENSE | H A D | 19-Dec-2020 | 43 | 3 | 2 | |
MD5 | H A D | 16-Oct-2021 | 4 KiB | 75 | 74 | |
NAMESPACE | H A D | 15-Oct-2021 | 1.9 KiB | 80 | 78 | |
NEWS.md | H A D | 15-Oct-2021 | 7.8 KiB | 206 | 136 | |
README.md | H A D | 15-Oct-2021 | 4.5 KiB | 113 | 87 |
README.md
1 2<!-- README.md is generated from README.Rmd. Please edit that file --> 3 4# rvest <img src="man/figures/logo.png" align="right" height="139"/> 5 6<!-- badges: start --> 7 8[![CRAN 9status](https://www.r-pkg.org/badges/version/rvest)](https://cran.r-project.org/package=rvest) 10[![R-CMD-check](https://github.com/tidyverse/rvest/workflows/R-CMD-check/badge.svg)](https://github.com/tidyverse/rvest/actions) 11[![Codecov test 12coverage](https://codecov.io/gh/tidyverse/rvest/branch/master/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/rvest?branch=master) 13 14<!-- badges: end --> 15 16## Overview 17 18rvest helps you scrape (or harvest) data from web pages. It is designed 19to work with [magrittr](https://github.com/tidyverse/magrittr) to make 20it easy to express common web scraping tasks, inspired by libraries like 21[beautiful soup](https://www.crummy.com/software/BeautifulSoup/) and 22[RoboBrowser](http://robobrowser.readthedocs.io/en/latest/readme.html). 23 24If you’re scraping multiple pages, I highly recommend using rvest in 25concert with [polite](https://dmi3kno.github.io/polite/). The polite 26package ensures that you’re respecting the 27[robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) 28and not hammering the site with too many requests. 29 30## Installation 31 32``` r 33# The easiest way to get rvest is to install the whole tidyverse: 34install.packages("tidyverse") 35 36# Alternatively, install just rvest: 37install.packages("rvest") 38``` 39 40## Usage 41 42``` r 43library(rvest) 44 45# Start by reading a HTML page with read_html(): 46starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html") 47 48# Then find elements that match a css selector or XPath expression 49# using html_elements(). In this example, each <section> corresponds 50# to a different film 51films <- starwars %>% html_elements("section") 52films 53#> {xml_nodeset (7)} 54#> [1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ... 55#> [2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ... 56#> [3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ... 57#> [4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ... 58#> [5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ... 59#> [6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ... 60#> [7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ... 61 62# Then use html_element() to extract one element per film. Here 63# we the title is given by the text inside <h2> 64title <- films %>% 65 html_element("h2") %>% 66 html_text2() 67title 68#> [1] "The Phantom Menace" "Attack of the Clones" 69#> [3] "Revenge of the Sith" "A New Hope" 70#> [5] "The Empire Strikes Back" "Return of the Jedi" 71#> [7] "The Force Awakens" 72 73# Or use html_attr() to get data out of attributes. html_attr() always 74# returns a string so we convert it to an integer using a readr function 75episode <- films %>% 76 html_element("h2") %>% 77 html_attr("data-id") %>% 78 readr::parse_integer() 79episode 80#> [1] 1 2 3 4 5 6 7 81``` 82 83If the page contains tabular data you can convert it directly to a data 84frame with `html_table()`: 85 86``` r 87html <- read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565") 88 89html %>% 90 html_element(".tracklist") %>% 91 html_table() 92#> # A tibble: 29 × 4 93#> No. Title `Performer(s)` Length 94#> <chr> <chr> <chr> <chr> 95#> 1 1. "\"Everything Is Awesome\"" "Tegan and Sara featuring The Lonel… 2:43 96#> 2 2. "\"Prologue\"" "" 2:28 97#> 3 3. "\"Emmett's Morning\"" "" 2:00 98#> 4 4. "\"Emmett Falls in Love\"" "" 1:11 99#> 5 5. "\"Escape\"" "" 3:26 100#> 6 6. "\"Into the Old West\"" "" 1:00 101#> 7 7. "\"Wyldstyle Explains\"" "" 1:21 102#> 8 8. "\"Emmett's Mind\"" "" 2:17 103#> 9 9. "\"The Transformation\"" "" 1:46 104#> 10 10. "\"Saloons and Wagons\"" "" 3:38 105#> # … with 19 more rows 106``` 107 108## Code of Conduct 109 110Please note that the rvest project is released with a [Contributor Code 111of Conduct](https://rvest.tidyverse.org/CODE_OF_CONDUCT.html). By 112contributing to this project, you agree to abide by its terms. 113