How to Scrape Tables from Websites and Convert to Pandas DataFrames
Website scraping is a powerful technique for data collection, and one of the most common needs is extracting tabular data. In this comprehensive guide, we’ll explore how to scrape tables from websites and convert them into Pandas DataFrames, allowing for easy data manipulation and export to Excel or CSV files.
Required Libraries
To get started with table scraping, you’ll need to import the following libraries:
- requests – for making HTTP requests
- pandas – for data manipulation
- BeautifulSoup – for parsing HTML
Here’s the code to import these libraries:
import requests import pandas as pd from bs4 import BeautifulSoup
Basic Table Scraping Example
Let’s start with a simple example of scraping a table containing running personal best times.
Step 1: Parse the HTML
First, we need to parse the HTML using BeautifulSoup:
soup = BeautifulSoup(html, 'html.parser') table = soup.find('table')
Step 2: Extract Table Headers
Next, we extract the table headers using a list comprehension:
headers = [th.get_text(strip=True) for th in table.find_all('th')]
Step 3: Extract Table Rows
Then, we extract the table rows, skipping the header row:
rows = [] for tr in table.find_all('tr')[1:]: cells = [td.get_text(strip=True) for td in tr.find_all('td')] if cells: rows.append(cells)
Step 4: Create a Pandas DataFrame
Finally, we create a Pandas DataFrame using the extracted headers and rows:
df = pd.DataFrame(rows, columns=headers)
This gives us a clean DataFrame with the table data, which can be easily manipulated or exported.
Advanced Table Scraping: Real-World Example
Let’s tackle a more complex example by scraping concert data from Wikipedia.
Handling Multiple Tables on a Page
When dealing with multiple tables on a page, we need a strategy to identify the specific table we want:
url = 'https://en.wikipedia.org/wiki/WorldWired_Tour' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'html.parser') # Get all tables on the page tables = soup.find_all('table') # Find the specific table by looking for a header containing '2017' table = None for table_header in soup.find_all('th'): if '2017' in table_header.get_text(): table = table_header.find_parent('table') break
Data Cleaning Challenges
Real-world data often requires cleaning. Here are some techniques demonstrated in our example:
1. Renaming columns
df.rename(columns={'Date 2017': 'date'}, inplace=True)
2. Removing unwanted characters
df['date'] = df['date'].str.replace('[\[\]]', '', regex=True)
3. Forward filling missing values
When values are missing due to merged cells in the original table:
df['city'] = df['city'].fillna(method='ffill') df['country'] = df['country'].fillna(method='ffill') df['venue'] = df['venue'].fillna(method='ffill')
4. Standardizing column names
df.columns = df.columns.str.strip() df.columns = df.columns.str.replace(' ', '_').str.lower()
5. Moving misplaced data to correct columns
Sometimes data appears in the wrong columns and needs to be relocated:
def is_attendance(val): if pd.isna(val): return False pattern = r'\d{1,3}(?:,\d{3})+' return bool(re.search(pattern, str(val))) for col in ['venue', 'opening_act']: mask = df[col].apply(is_attendance) df.loc[mask, 'attendance'] = df.loc[mask, col] df.loc[mask, col] = None
Exporting the Data
After cleaning, we can export the DataFrame to a CSV file:
df.to_csv('concert_data.csv', index=False)
Best Practices for Table Scraping
- Understand the HTML structure – Inspect the page to understand how tables are structured
- Target specific tables – Develop strategies to identify the exact table you need
- Always extract headers first – Get the column names before processing rows
- Skip header rows – When extracting rows, skip the first row if it contains headers
- Clean your data – Real-world tables often require significant cleaning
- Handle missing or misplaced data – Develop strategies for fixing data inconsistencies
Conclusion
Scraping tables from websites and converting them to Pandas DataFrames provides a powerful way to collect and analyze data. While the process may seem complex at first, following the step-by-step approach outlined in this guide will help you master this valuable skill. Remember that real-world data often requires cleaning, so be prepared to apply various data manipulation techniques to get your data into a usable format.