How to Extract Data from Websites Using Beautiful Soup and Pandas

Web scraping is a powerful technique that allows developers to extract data from websites in a structured format. This tutorial explains the process of extracting tabular data from a hockey website using Python libraries.

Setting Up the Environment

To begin web scraping, you’ll need to import the necessary libraries:

Beautiful Soup – for parsing HTML content
Pandas – for data manipulation and analysis
Requests – for making HTTP requests

The basic imports look like this:

from bs4 import BeautifulSoup import requests import pandas as pd

Fetching the Web Page

The first step is to retrieve the web page content using the requests library:

url = "[hockey website URL]" response = requests.get(url) page = response.text

Once you have the page content, you need to create a Beautiful Soup object to parse the HTML:

soup = BeautifulSoup(page, 'html.parser')

Locating the Data Table

The target data is contained in an HTML table with the class “table”. To find this table:

table = soup.find_all('table', class_='table')[0]

The [0] index is used because there might be multiple tables with the same class, and we’re interested in the first one.

Extracting Table Headers

Next, extract the table headers to create the columns for your pandas DataFrame:

headers = table.find_all('th') df = pd.DataFrame(columns=[header.text for header in headers])

Extracting Table Rows

To extract the data from each row:

rows = table.find_all('tr') for row in rows: data = row.find_all('td') if data: # Skip header row row_data = [td.text.strip() for td in data] # Add to DataFrame length = len(df) df.loc[length] = row_data

Understanding the Data Structure

The extracted hockey data included several columns:

Team name (e.g., Chicago Black Hawks)
Year (e.g., 1991)
Wins (e.g., 26)
Losses (e.g., 29)
Win percentage (e.g., 0.45)
Goals for (e.g., 247)
Goals against (e.g., 236)
Plus/minus (e.g., 21)

Data Manipulation

Once you have the data in a pandas DataFrame, you can manipulate it as needed:

Remove unnecessary rows (e.g., df = df.drop(0) to remove the first row)
Filter specific columns
Perform calculations or aggregations
Export to various formats (CSV, Excel, etc.)

Conclusion

Web scraping with Beautiful Soup and pandas provides a powerful way to extract structured data from websites. This example demonstrated how to extract a data table from a hockey website, but the same principles can be applied to many other websites with tabular data.

Remember that when scraping websites, it’s important to respect the website’s terms of service and robots.txt file, and to avoid making too many requests in a short period of time.