How to Scrape Tables from Websites and Convert to Pandas DataFrames

How to Scrape Tables from Websites and Convert to Pandas DataFrames

Website scraping is a powerful technique for data collection, and one of the most common needs is extracting tabular data. In this comprehensive guide, we’ll explore how to scrape tables from websites and convert them into Pandas DataFrames, allowing for easy data manipulation and export to Excel or CSV files.

Required Libraries

To get started with table scraping, you’ll need to import the following libraries:

  • requests – for making HTTP requests
  • pandas – for data manipulation
  • BeautifulSoup – for parsing HTML

Here’s the code to import these libraries:

import requests
import pandas as pd
from bs4 import BeautifulSoup

Basic Table Scraping Example

Let’s start with a simple example of scraping a table containing running personal best times.

Step 1: Parse the HTML

First, we need to parse the HTML using BeautifulSoup:

soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')

Step 2: Extract Table Headers

Next, we extract the table headers using a list comprehension:

headers = [th.get_text(strip=True) for th in table.find_all('th')]

Step 3: Extract Table Rows

Then, we extract the table rows, skipping the header row:

rows = []
for tr in table.find_all('tr')[1:]:
    cells = [td.get_text(strip=True) for td in tr.find_all('td')]
    if cells:
        rows.append(cells)

Step 4: Create a Pandas DataFrame

Finally, we create a Pandas DataFrame using the extracted headers and rows:

df = pd.DataFrame(rows, columns=headers)

This gives us a clean DataFrame with the table data, which can be easily manipulated or exported.

Advanced Table Scraping: Real-World Example

Let’s tackle a more complex example by scraping concert data from Wikipedia.

Handling Multiple Tables on a Page

When dealing with multiple tables on a page, we need a strategy to identify the specific table we want:

url = 'https://en.wikipedia.org/wiki/WorldWired_Tour'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')

# Get all tables on the page
tables = soup.find_all('table')

# Find the specific table by looking for a header containing '2017'
table = None
for table_header in soup.find_all('th'):
    if '2017' in table_header.get_text():
        table = table_header.find_parent('table')
        break

Data Cleaning Challenges

Real-world data often requires cleaning. Here are some techniques demonstrated in our example:

1. Renaming columns

df.rename(columns={'Date 2017': 'date'}, inplace=True)

2. Removing unwanted characters

df['date'] = df['date'].str.replace('[\[\]]', '', regex=True)

3. Forward filling missing values

When values are missing due to merged cells in the original table:

df['city'] = df['city'].fillna(method='ffill')
df['country'] = df['country'].fillna(method='ffill')
df['venue'] = df['venue'].fillna(method='ffill')

4. Standardizing column names

df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace(' ', '_').str.lower()

5. Moving misplaced data to correct columns

Sometimes data appears in the wrong columns and needs to be relocated:

def is_attendance(val):
    if pd.isna(val):
        return False
    pattern = r'\d{1,3}(?:,\d{3})+'
    return bool(re.search(pattern, str(val)))

for col in ['venue', 'opening_act']:
    mask = df[col].apply(is_attendance)
    df.loc[mask, 'attendance'] = df.loc[mask, col]
    df.loc[mask, col] = None

Exporting the Data

After cleaning, we can export the DataFrame to a CSV file:

df.to_csv('concert_data.csv', index=False)

Best Practices for Table Scraping

  1. Understand the HTML structure – Inspect the page to understand how tables are structured
  2. Target specific tables – Develop strategies to identify the exact table you need
  3. Always extract headers first – Get the column names before processing rows
  4. Skip header rows – When extracting rows, skip the first row if it contains headers
  5. Clean your data – Real-world tables often require significant cleaning
  6. Handle missing or misplaced data – Develop strategies for fixing data inconsistencies

Conclusion

Scraping tables from websites and converting them to Pandas DataFrames provides a powerful way to collect and analyze data. While the process may seem complex at first, following the step-by-step approach outlined in this guide will help you master this valuable skill. Remember that real-world data often requires cleaning, so be prepared to apply various data manipulation techniques to get your data into a usable format.

Leave a Comment