Complete Guide to UFC Data Scraping with Python

Web scraping is a powerful technique that uses automated software to extract data from websites. While it might sound complicated, Python makes it relatively easy with libraries like requests, Selenium, and Beautiful Soup.

Understanding Web Scraping Fundamentals

According to Google, web scraping is “the technique that uses automated software such as bots or crawlers to extract data from websites like text and links and it extracts this information using the underlying HTML.” Python makes this process straightforward with powerful libraries.

The first step for web scraping is understanding which library to use for your specific needs:

Requests library: Use when you don’t need to interact directly with the web page and a simple HTTP request will provide all the HTML you need.
Selenium library: Use when you need to interact with the web page (clicking, typing, hovering) which changes the underlying HTML.
Beautiful Soup library: Used alongside either of the above to easily extract or “scrape” the HTML code.

Setting Up the UFC Data Scraping Project

For this project, we’ll scrape fighter data from the UFC stats website. The goal is to automatically extract all the meaningful fight information from event links. After analyzing the website structure, we can determine that we only need the requests library rather than Selenium, as we can access all the data with direct HTTP requests.

Initial Setup and Imports

First, we import the necessary libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

Connecting to the UFC Website

We start by connecting to the UFC webpage and extracting all page information:

response = requests.get("http://ufcstats.com/statistics/events/completed?page=all")
soup = BeautifulSoup(response.content, "html.parser")

Extracting Event URLs

Next, we need to extract all the UFC event URLs from the main page:

table_rows = soup.find_all("tr", class_="b-statistics__table-row")
ufc_events_url = []

for i in range(2, len(table_rows)):
    event_link = table_rows[i].find("a").get("href")
    ufc_events_url.append(event_link)

Creating a Data Structure for Fight Data

We’ll create a pandas DataFrame to store all the fight data:

fighter_data = pd.DataFrame({
    "event": ["-"],
    "date": ["-"],
    "location": ["-"],
    "wl": ["-"],
    "fighter_a": ["-"],
    "fighter_b": ["-"],
    "kd": ["-"],
    "str": ["-"],
    "td": ["-"],
    "sub": ["-"],
    "weight_class": ["-"],
    "method": ["-"],
    "round": ["-"],
    "time": ["-"],
    "perf": [0],
    "sub_bonus": [0],
    "ko_bonus": [0],
    "fight_bonus": [0],
    "belt": [0]
})

Scraping Individual Event Data

For each event URL, we extract detailed information about the fights:

Event title, date, and location
Fighter names and matchup details
Fight results, method, round, and time
Weight class information
Performance bonuses information

The process involves navigating the HTML structure of each page, identifying the relevant tags, and extracting the text and attribute information.

Extracting Fight Details

For each fight in an event, we extract:

for i in range(len(ufc_events_url)):
    url = ufc_events_url[i]
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Get event title
    event_title = soup.find("span", class_="b-content__title-highlight").get_text(strip=True)
    
    # Get date and location
    date_location = soup.find_all("li", class_="b-list__box-list-item")
    event_loc = {}
    for tag in date_location:
        title_text = tag.find("i").get_text(strip=True)
        all_text = tag.get_text(strip=True)
        event_loc[title_text] = all_text.replace(title_text, "")
    
    # Extract fights
    fight_table = soup.find_all("tr", class_="b-fight-details__table-row")
    
    # Process each fight
    for tr in fight_table:
        # Save all TD tags
        tds = tr.find_all("td")
        
        # Extract fight information
        # ... (code for extracting specific fight details)

Handling Special Cases: Performance Bonuses and Title Fights

The UFC awards various bonuses for exceptional performances, including:

Performance of the Night bonus
Knockout of the Night bonus
Submission of the Night bonus
Fight of the Night bonus (given to both fighters)

Additionally, championship fights are indicated with a belt icon. We extract this information from image tags in the weight class column:

img_list = tds[4].find_all("img")
bonus_dict = {"perf.png": 0, "sub.png": 0, "ko.png": 0, "fight.png": 0, "belt.png": 0}

if img_list:
    for img in img_list:
        src = img.get("src")
        key = src.split("/")[-1]
        bonus_dict[key] = 1

Saving the Data

After extracting all fight data, we save it to a CSV file:

fighter_data.to_csv("UFC_events_data.csv", index=False)

Efficiency of Web Scraping

This automated approach is significantly more efficient than manual data collection. The script can process 725 UFC events (containing 8,094 fights) in just three minutes, compared to the manual process which could take days or weeks.

Conclusion

Web scraping with Python provides a powerful way to collect large amounts of data quickly and efficiently. By understanding the structure of websites and using the right libraries, you can automate data collection tasks that would otherwise be extremely time-consuming if done manually. The UFC data collected in this project is ready for exploratory data analysis and can provide valuable insights into fighter performance, fight outcomes, and UFC event statistics.