Data Scraping Techniques: Converting Web Tables and PDFs to Usable Spreadsheets
Data scraping has become an essential skill for journalists and researchers who need to extract information from websites and PDFs that don’t offer easy download options. This technique allows professionals to transform difficult-to-use data formats into workable spreadsheets for analysis.
What is Data Scraping?
Data scraping is the process of extracting information from web pages or documents and converting it into a format that can be easily analyzed, like spreadsheets or text documents. Many government agencies publish data online to fulfill transparency requirements, but often in formats that make analysis difficult. Scraping bridges this gap.
Scraping Web Tables
One of the most powerful methods for scraping tables from websites involves using Google Sheets’ IMPORTHTML function. This formula connects directly to web data and automatically updates when the source changes.
The basic formula structure is:
=IMPORTHTML("URL", "table", element_number)
Where:
- URL is the web address containing the data
- “table” specifies you’re importing a table
- element_number identifies which table on the page to import (starting with 0)
After pasting this formula into cell A1 of a Google Sheet, the data will populate automatically. To work with this data without disrupting the live connection, users should:
- Select all data by clicking the rectangle in the upper left corner
- Copy the selection
- Create a new sheet tab
- Use “Paste Special > Values Only” to strip out the formula
This creates an editable copy while preserving the original scraped data.
Scraping PDFs
PDFs present special challenges for data extraction. Even when organizations are asked to provide data in spreadsheet format, they often deliver PDFs instead. Two effective tools for converting PDF tables to spreadsheets are:
1. Tabula
This desktop application allows users to:
- Import PDF files locally (enhancing security)
- Select specific tables within the document
- Export selected data as CSV files
Since the extraction happens locally rather than online, Tabula is particularly useful for sensitive documents.
2. PDF2Excel
This browser-based tool offers a simpler but less secure approach:
- Upload PDFs through the website
- Automatically convert tables to Excel format
- Download the resulting spreadsheet
While convenient, PDF2Excel stores uploaded documents temporarily on their servers, making it less suitable for confidential information.
Best Practices
When working with scraped data, several practices help maintain data integrity:
- Always title spreadsheets clearly for easy retrieval
- Label sheets as “Original” and “Edited” to preserve raw data
- Create data moats (empty rows or columns) to separate totals from individual entries
- Verify totals against the original source
These techniques ensure that even when government agencies or other sources publish data in challenging formats, analysts can still access and work with the information effectively.
As public records become increasingly important for accountability and research, data scraping provides a crucial bridge between published information and meaningful analysis.