RVEST Package Update: Powerful New Web Scraping Capabilities
Web scraping in R is evolving with exciting new experimental features in the RVEST package. The latest update includes functionality that may eventually eliminate the need for the R-Selenium package, offering a more streamlined approach to interactive web scraping.
The New read_html_live Function
Traditional web scraping with RVEST’s standard read_html function sometimes encounters roadblocks when websites actively block scraping attempts. The new experimental function read_html_live offers a solution by providing browser-like capabilities without requiring a visible browser instance.
When using read_html_live, you create a live connection to a webpage that allows for interactions similar to what R-Selenium offers, including:
- Clicking on elements
- Finding HTML elements
- Pressing keyboard keys
- Typing text into fields
- Scrolling through pages
Practical Application: Scraping Car Listings
Using Kelley Blue Book as an example, we can scrape data about used cars under $20,000. The traditional read_html approach fails with an error, but read_html_live successfully establishes a connection.
When working with read_html_live, it’s important to close any previously open sessions with:
page$session$close()
The returned object is of class ‘live_html’ and contains methods for interacting with the page. Despite being a custom class, you can still use standard RVEST functions on it.
Handling Dynamic Content
One advantage of this approach is the ability to handle dynamically loaded content. For example, when scraping car listings that load as you scroll:
- First identify elements using CSS selectors
- Extract initial content with html_text() from RVEST
- Use page$scroll_by() to scroll down and load more content
- Re-run the extraction to get the complete dataset
This solves a common problem with traditional scraping where only initially loaded content is accessible.
Navigation Between Pages
The update also allows for navigation between pages using the click function:
page$click("button[aria-label='Next page']")
After clicking, you can extract data from the new page. The ability to see changes in URLs (such as ?firstRecord=25 for page 2) confirms successful navigation.
Live Viewing of Sessions
For debugging or monitoring purposes, the page$view() function opens a new tab showing the current state of your scraping session. This feature, while still slightly unstable, provides valuable visual confirmation of your scraping progress.
Closing Sessions
When finished with scraping, properly close the session with:
page$session$close()
This disconnects the websocket and frees up resources.
Future Potential
While still experimental, these new RVEST features show great promise for simplifying web scraping workflows in R. The ability to handle dynamic content and interact with pages without additional packages could significantly streamline data collection processes.
As development continues and stability improves, these capabilities may transform how R users approach web scraping tasks, potentially making RVEST the only package needed for most web scraping projects.