Complete Guide to UFC Data Scraping with Python
Web scraping is a powerful technique that uses automated software to extract data from websites. While it might sound complicated, Python makes it relatively easy with libraries like requests, Selenium, and Beautiful Soup.
Understanding Web Scraping Fundamentals
According to Google, web scraping is “the technique that uses automated software such as bots or crawlers to extract data from websites like text and links and it extracts this information using the underlying HTML.” Python makes this process straightforward with powerful libraries.
The first step for web scraping is understanding which library to use for your specific needs:
- Requests library: Use when you don’t need to interact directly with the web page and a simple HTTP request will provide all the HTML you need.
- Selenium library: Use when you need to interact with the web page (clicking, typing, hovering) which changes the underlying HTML.
- Beautiful Soup library: Used alongside either of the above to easily extract or “scrape” the HTML code.
Setting Up the UFC Data Scraping Project
For this project, we’ll scrape fighter data from the UFC stats website. The goal is to automatically extract all the meaningful fight information from event links. After analyzing the website structure, we can determine that we only need the requests library rather than Selenium, as we can access all the data with direct HTTP requests.
Initial Setup and Imports
First, we import the necessary libraries:
import requests from bs4 import BeautifulSoup import pandas as pd import time
Connecting to the UFC Website
We start by connecting to the UFC webpage and extracting all page information:
response = requests.get("http://ufcstats.com/statistics/events/completed?page=all") soup = BeautifulSoup(response.content, "html.parser")
Extracting Event URLs
Next, we need to extract all the UFC event URLs from the main page:
table_rows = soup.find_all("tr", class_="b-statistics__table-row") ufc_events_url = [] for i in range(2, len(table_rows)): event_link = table_rows[i].find("a").get("href") ufc_events_url.append(event_link)
Creating a Data Structure for Fight Data
We’ll create a pandas DataFrame to store all the fight data:
fighter_data = pd.DataFrame({ "event": ["-"], "date": ["-"], "location": ["-"], "wl": ["-"], "fighter_a": ["-"], "fighter_b": ["-"], "kd": ["-"], "str": ["-"], "td": ["-"], "sub": ["-"], "weight_class": ["-"], "method": ["-"], "round": ["-"], "time": ["-"], "perf": [0], "sub_bonus": [0], "ko_bonus": [0], "fight_bonus": [0], "belt": [0] })
Scraping Individual Event Data
For each event URL, we extract detailed information about the fights:
- Event title, date, and location
- Fighter names and matchup details
- Fight results, method, round, and time
- Weight class information
- Performance bonuses information
The process involves navigating the HTML structure of each page, identifying the relevant tags, and extracting the text and attribute information.
Extracting Fight Details
For each fight in an event, we extract:
for i in range(len(ufc_events_url)): url = ufc_events_url[i] response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") # Get event title event_title = soup.find("span", class_="b-content__title-highlight").get_text(strip=True) # Get date and location date_location = soup.find_all("li", class_="b-list__box-list-item") event_loc = {} for tag in date_location: title_text = tag.find("i").get_text(strip=True) all_text = tag.get_text(strip=True) event_loc[title_text] = all_text.replace(title_text, "") # Extract fights fight_table = soup.find_all("tr", class_="b-fight-details__table-row") # Process each fight for tr in fight_table: # Save all TD tags tds = tr.find_all("td") # Extract fight information # ... (code for extracting specific fight details)
Handling Special Cases: Performance Bonuses and Title Fights
The UFC awards various bonuses for exceptional performances, including:
- Performance of the Night bonus
- Knockout of the Night bonus
- Submission of the Night bonus
- Fight of the Night bonus (given to both fighters)
Additionally, championship fights are indicated with a belt icon. We extract this information from image tags in the weight class column:
img_list = tds[4].find_all("img") bonus_dict = {"perf.png": 0, "sub.png": 0, "ko.png": 0, "fight.png": 0, "belt.png": 0} if img_list: for img in img_list: src = img.get("src") key = src.split("/")[-1] bonus_dict[key] = 1
Saving the Data
After extracting all fight data, we save it to a CSV file:
fighter_data.to_csv("UFC_events_data.csv", index=False)
Efficiency of Web Scraping
This automated approach is significantly more efficient than manual data collection. The script can process 725 UFC events (containing 8,094 fights) in just three minutes, compared to the manual process which could take days or weeks.
Conclusion
Web scraping with Python provides a powerful way to collect large amounts of data quickly and efficiently. By understanding the structure of websites and using the right libraries, you can automate data collection tasks that would otherwise be extremely time-consuming if done manually. The UFC data collected in this project is ready for exploratory data analysis and can provide valuable insights into fighter performance, fight outcomes, and UFC event statistics.