Web scraping data for use in R-Studio. Ask Question Asked 2 years, 7 months ago. Active 2 years, 6 months ago. Viewed 386 times 2. I am wanting to pull the data out of this server site and into R-Studio. I am new to R so not at all sure what is possible. An R community blog edited by RStudio. R Views Home About Contributors. Home: About: Contributors: R Views An R community blog edited by Boston, MA. 290 Tags Web scraping. Virtual Morel Foraging with R. Bryan Lewis 2019-05-13. Recents Cheat Sheets. 2021 R Conferences. January 2020: 'Top 40' New CRAN Packages.
A relatively new form of data collection is to scrape data of the internet. This involves simply directing R (or Python) to a specify web site (technically, a URL) and then collect certain elements of one or more web pages. A web page consists of HTML elements (or “nodes”) and as long as you know which element you want to grab, it is relatively straightforward for R to do so.
You can download all the code and data for this module as an Rstudio project:
Scraping with rvest
The most popular package for web scraping in R is rvest. The only challange in using this package (and any other scraper package) is finding the name/identity of the part of the web page you are interested in.
Scraping with rvest starts out by you telling R what web page (URL) you want to scrape (parts of). R will then read the web page (so you have to be online for this to work) and you can then tell R which specific elements of the page you want to retain. Let’s consider a simple example.
Scraping a Wikipedia Table
Suppose we want a dataframe of the latest population counts for each official country. A table with the relevant data exists on this Wikipedia page. This is a small but pretty scraping example in that what we are really interested in (the table of population counts) is a part of a bigger web page.
We start by having R read in the relevant web page using the read_html function in rvest:
Ok - so far so good. Now comes the hard part: What is the name of the table we are interested in? We can clearly see it on the Wikipedia page - but what is that table’s name in the underlying html code? There are multiple ways to get at this. First, of all we can simply query the scraped page using the html_nodes function in order to list all nodes that are tables:
This is a little bit of a mess, but if you inspect the output of the first table (“wikitable”) in the R console, you will see that this is the table we want. A more elegant approach is to look at the actual html code underlying the web page. In Google’s Chrome browser you can hit Fn+F12 to get to the browser’s Developer Tools. You can then click on the Elements browser (the little square with a cursor in it) and then hover over different parts of the web page to see the relevant part of the html code. The image below shows that the population table is a “wikitable” class. You can also use a browser plug-in for the same functionality. A popular Chrome exentionsion for this purpose is selectorgadget.
Having now identified the name/class of what we need we can pick that element out of the web page (here using pipe syntax):
First we grab the relevant table (note that you need to add a “.” in front of the class name - or you could write (“table.wikitable”)). The last function html_table turns the html table into an R data frame (with NAs added where values are missing):
Python Version
Reading html tables can be accomplished using nothing more than pandas:
Here df contains the same data as the WorldPop data frame above.
Download mac os for windows 7.
Scraping user reviews from WebMD
Let us now try a more ambitious scraping project: Scraping a large number of online user reviews. We will scrape reviews from prescription drug users from the website WebMD. This is a general health web site with articles, blog posts and other information with the goal of managing users’ health. One of the features of the web site is to host online reviews of a large number of precription drugs.
Suppose we want to scrape user reviews for the drug Abilify. This is a drug prescribed to treat various mental health and mood disorders (e.g., bipolar disorder and schizophrenia). We start by navigating to the review page which is here. Here we make a number of observations. First of all, not all reviews are listed on one page - they are distributed across a large number of pages with only 5 reviews per page. Second of all, a user review consists of multiple components: A 5 point rating for each of three measures, a text review (called “comment”), the date of the review, the condition of the reviewer and some demographs (gender and age).
In order to get going on this scraping project, we use the following sequential strategy:
- First figure out how to scrape one page of reviews and write your own R function that performs this action.
- Find out how many pages of reviews there are in total and what the naming convention is of each page.
- Run the function defined in 1. on each page in 2. and bind all reviews together into one large data frame.
Ok - let’s start with the first point, i.e., how to scrape one page. We pick the first page of reviews. Using Chrome Developer Tools or Selectorgadget, we find that that each user review (including user characteristics) is contained in a class called “userPost”. We start by scraping all userPost classes:
The resulting node set has five elements - one for each of the five user reviews on the first page:
Now we can simply pick out the relevant parts of each userPost class. For reviewer info, review data and reviewer condition we use (having first found the relevant class names using Developer Tools or Selectorgadget)
Torrent app windows 8. which gives us
Notice that the text string for reviewer condition contains a lot of superfluous html tags. We can get rid of them by extracting the last part of the string:
Finally we can pull out the actual reviewer comments:
Notice that we get 10 elements here - not 5. This is because reviews on WebMD are listed both in short form (where you only see the first part of the review) and full form (where you see the full review). We only want to keep the full review which is simply every other element of the comments (i.e., the 2nd, 4th, 6th, 8th and 10th element). We pick those out and then remove the part of the text string that says “Comment” and “Hide Full Comment”:
We also want to retain the star ratings on each of the three used measures (“Effectiveness”, “Ease of Use”, “Satisfaction”). The html class for these is “current-rating” so pick out those first:
We only care about the last number of the text string. There are different ways of getting at that - here we use a regular expression to extract the number of the text string:
To match each rating with the correct category we also retain the category names:
For later use it will be useful to store this data as a data frame with one row per review and a column for each of the three measures. We can do this with the following code (note that the spread comments turns a “tall and thin” data frame into a “short and wide” data frame):
Finally, we may also be interested in the number of users who found a particular review helpful:
At this point we have figured out how to scrape everything we need from one page. Let’s put all this into a function that we can call repeatedly:
Now we have to find out how many pages of reviews we need to scrape. To figure this out, note that the each page has listed the total number of reviews available (as an html node called “postPaging”):
We get this number from the first page of reviews. We also break up the URL into two components. The first part is the base URL for the relevant drug. The second contains the page index for each page of reviews (called “pageIndex”). This is just an integer that counts up one at a time. This way we can easily traverse all review pages in increments of one.
To extract the review count from this string, we split it and grab the relevant number:
Over how many pages are these 1810 reviews distributed? Well, this is easy since there are 5 reviews per page (we just have to careful to ge the right count in case the total isn’t divisible by 5):
Ok - now we are done - we can srape all pages:
At this point, I would put all the code into one function:
We can now run the scraper on as many drugs as we like:
Python Version
More involved scraping tasks like the one above can be accomplished using the beautifulsoup module in Python. When scraping with beautifulsoup you will often need to pretend to be a browser - otherwise the website might not allow you access. First we set up a browser and then scrape the contents of the web page:
The content of the web page is now contained in the object soup. At this point we are back to where we were above using rvest. You simply pick out the relevant content that you are interested in retaining. The key function in beautifulsoup is find_all (this is similar to html_node used in R above). The input into the function are the tags you want to find and any specific attributes (e.g., a class or id). In the following we will find an retain each reviewer’s condition, information and review comments:
The scraped information contains html tags (e.g., ') that we don’t want so we extract only the text and put the result in a list:
Rstudio Cloud
Rstudio Web Scraping Free
Recall from above that the review text for this web page has two text fields for each review - one short and one longer containing the full review. This means that the 5 reviews on the first page results in a list of length 10. Everyother element is either a short or long review so we extract 5 short reviews and 5 full reviews from this list
Finally, we collect everything and save it in a pandas data frame:
You can now repeat what we did above: put this into a function (maybe add other information you want to scrape) and then repeat the function for each web page.
Copyright © 2020 Karsten T. Hansen, All rights reserved.
rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:
R Studio Web Scraping Download
rvest in action
To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html()
:
To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want: strong span
. (If you haven’t heard of selectorgadget, make sure to read vignette('selectorgadget')
- it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node()
to find the first node that matches that selector, extract its contents with html_text()
, and convert it to numeric with as.numeric()
:
We use a similar process to extract the cast, using html_nodes()
to find all nodes that match the selector:
The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node()
and [[
to find it, then coerce it to a data frame with html_table()
:
Other important functions
If you prefer, you can use xpath selectors instead of css:
html_nodes(doc, xpath = '//table//td')
).Extract the tag names with
html_tag()
, text withhtml_text()
, a single attribute withhtml_attr()
or all attributes withhtml_attrs()
.Detect and repair text encoding problems with
guess_encoding()
andrepair_encoding()
.Navigate around a website as if you’re in a browser with
html_session()
,jump_to()
,follow_link()
,back()
, andforward()
. Extract, modify and submit forms withhtml_form()
,set_values()
andsubmit_form()
. (This is still a work in progress, so I’d love your feedback.)
Rstudio Download Page
To see these functions in action, check out package demos with demo(package = 'rvest')
.