How to Scrape Website Data Using Selenium and Python

How to Scrape Website Data Using Selenium and Python

Web scraping is a powerful technique for extracting data from websites. In this article, we’ll explore how to scrape population data from Worldometer using Selenium with Python and save the results to an Excel file.

What is Selenium?

Selenium is a powerful tool for controlling web browsers through programs and performing browser automation. It works with all major browsers, operates across all major operating systems, and supports scripts written in various programming languages including Python and Java.

Setting Up Your Environment

Before starting, you’ll need to install Selenium in your Python environment. You can do this using pip:

pip install selenium

You’ll also need to install ChromeDriverManager and pandas:

pip install webdriver-manager pandas

The Code Structure

Let’s break down the components of our web scraping script:

1. Import the necessary libraries

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

2. Configure Chrome Options

Setting up options allows us to configure how Chrome behaves during scraping:

options = Options()
options.add_experimental_option("detach", True) # Keeps the browser open after scraping finishes

3. Initialize the Chrome WebDriver

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

4. Open the Target Website

url = "https://www.worldometers.info/world-population/india-population/"
driver.get(url)

5. Wait for Elements to Load

Using WebDriverWait ensures the page is fully loaded before scraping:

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, '//table')))

6. Extract Data from the Table

We extract headers and rows from the population table:

table = driver.find_element(By.XPATH, '//table')
headers = [header.text for header in table.find_elements(By.TAG_NAME, 'th')]
rows = []
for row in table.find_elements(By.TAG_NAME, 'tr')[1:]:
columns = [column.text for column in row.find_elements(By.TAG_NAME, 'td')]
rows.append(columns)

7. Create a DataFrame and Save to Excel

df = pd.DataFrame(rows, columns=headers)
df.to_excel('india_population.xlsx', index=False)

Error Handling

It’s important to implement proper error handling in your scraping script. Wrap the main code in a try-except block to catch any exceptions:

try:
# Your scraping code here
except Exception as e:
print(f"An error occurred: {e}")
finally:
if driver is not None:
driver.quit()

Running the Script

When you run the script, it will open a Chrome browser, navigate to the Worldometer page for India’s population, scrape the data from the population table, and save it to an Excel file named ‘india_population.xlsx’.

Conclusion

Selenium is a versatile tool for web scraping, especially when dealing with dynamic websites that require browser interaction. The combination of Selenium with pandas makes it easy to extract, manipulate, and save data for further analysis.

Remember that when scraping websites, it’s important to review the website’s terms of service and robots.txt file to ensure you’re not violating any usage policies. Additionally, implement appropriate delays in your scraping to avoid overwhelming the target server.

Leave a Comment