How to Extract Data from Websites Using Beautiful Soup and Pandas
Web scraping is a powerful technique that allows developers to extract data from websites in a structured format. This tutorial explains the process of extracting tabular data from a hockey website using Python libraries.
Setting Up the Environment
To begin web scraping, you’ll need to import the necessary libraries:
- Beautiful Soup – for parsing HTML content
- Pandas – for data manipulation and analysis
- Requests – for making HTTP requests
The basic imports look like this:
from bs4 import BeautifulSoup
import requests
import pandas as pd
Fetching the Web Page
The first step is to retrieve the web page content using the requests library:
url = "[hockey website URL]"
response = requests.get(url)
page = response.text
Once you have the page content, you need to create a Beautiful Soup object to parse the HTML:
soup = BeautifulSoup(page, 'html.parser')
Locating the Data Table
The target data is contained in an HTML table with the class “table”. To find this table:
table = soup.find_all('table', class_='table')[0]
The [0] index is used because there might be multiple tables with the same class, and we’re interested in the first one.
Extracting Table Headers
Next, extract the table headers to create the columns for your pandas DataFrame:
headers = table.find_all('th')
df = pd.DataFrame(columns=[header.text for header in headers])
Extracting Table Rows
To extract the data from each row:
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if data: # Skip header row
row_data = [td.text.strip() for td in data]
# Add to DataFrame
length = len(df)
df.loc[length] = row_data
Understanding the Data Structure
The extracted hockey data included several columns:
- Team name (e.g., Chicago Black Hawks)
- Year (e.g., 1991)
- Wins (e.g., 26)
- Losses (e.g., 29)
- Win percentage (e.g., 0.45)
- Goals for (e.g., 247)
- Goals against (e.g., 236)
- Plus/minus (e.g., 21)
Data Manipulation
Once you have the data in a pandas DataFrame, you can manipulate it as needed:
- Remove unnecessary rows (e.g.,
df = df.drop(0)
to remove the first row) - Filter specific columns
- Perform calculations or aggregations
- Export to various formats (CSV, Excel, etc.)
Conclusion
Web scraping with Beautiful Soup and pandas provides a powerful way to extract structured data from websites. This example demonstrated how to extract a data table from a hockey website, but the same principles can be applied to many other websites with tabular data.
Remember that when scraping websites, it’s important to respect the website’s terms of service and robots.txt file, and to avoid making too many requests in a short period of time.