How to Scrape Text and Images from Websites Using Python

Web scraping is an essential skill for data collection, especially when creating datasets for fine-tuning machine learning models. In this comprehensive guide, we’ll explore how to scrape both text data and images from websites using Python and the Beautiful Soup library.

Text Data Scraping

When it comes to extracting structured data from websites, tables are common targets. Here’s how to scrape tabular data:

Setting Up Your Environment

First, import the necessary libraries:

from bs4 import BeautifulSoup import requests import pandas as pd

Beautiful Soup is designed specifically for scraping purposes and offers various functionalities to extract data easily. If you’re using Google Colab, Beautiful Soup is pre-installed. For local installations, use:

pip install beautifulsoup4

Sending Requests to the Website

Define your target URL and send a request:

url = "your_target_url" response = requests.get(url)

Always check if the response status code is 200, indicating a successful request. If you get 404 or other error codes, the website may be blocking your requests.

Parsing HTML Content

Pass the response to Beautiful Soup for parsing:

soup = BeautifulSoup(response.text, 'html.parser')

Navigating to Specific Elements

To extract specific elements like tables, use browser developer tools (inspect element) to identify the HTML structure. For tables, you can use:

tables = soup.find_all('table') first_table = tables[0]

You can also find elements by their class attributes:

target_table = soup.find('table', class_='wikitable sortable')

Extracting Table Headers

Headers are typically in <th> tags:

headers = [header.text.strip() for header in target_table.find_all('th')]

Creating a DataFrame

Initialize an empty DataFrame with the extracted headers:

df = pd.DataFrame(columns=headers)

Extracting Table Data

Loop through table rows and extract data from <td> tags:

for row in target_table.find_all('tr')[1:]: # Skip header row row_data = [cell.text.strip() for cell in row.find_all('td')] if row_data: # Skip empty rows df.loc[len(df)] = row_data

Saving the Data

Save the scraped data as a CSV file:

df.to_csv('scraped_data.csv', index=False)

Image Scraping

Scraping images follows a similar approach but requires handling file downloads.

Setting Up for Image Scraping

Import additional libraries for file operations:

import os from urllib.parse import urljoin

Creating a Directory for Images

Create a folder to store downloaded images:

folder_name = 'downloaded_images' if not os.path.exists(folder_name): os.makedirs(folder_name)

Finding Image Elements

Locate all image tags on the page:

images = soup.find_all('img') print(f'Found {len(images)} images')

Downloading Images

Loop through image tags and download each image:

counter = 0 for image in images: try: img_url = image.get('src') if not img_url: continue


        # Create full URL if relative path
        full_url = urljoin(url, img_url)
        # Download image
        img_data = requests.get(full_url).content
        # Save image
        img_name = f'image_{counter}.jpg'
        img_path = os.path.join(folder_name, img_name)
        with open(img_path, 'wb') as f:
            f.write(img_data)
        counter += 1
        print(f'Downloaded: {img_name}')
    except Exception as e:
        print(f'Error downloading image: {e}')

print(f'Total images downloaded: {counter}')

Best Practices for Web Scraping

Always check the website’s robots.txt file and terms of service before scraping
Add delays between requests to avoid overloading the server
Use headers to identify your scraper appropriately
Implement error handling for robust scraping
Consider using proxy servers for large-scale scraping operations

Conclusion

Web scraping is a powerful technique for data collection that can be applied to various use cases, particularly when creating custom datasets for machine learning model training. With tools like Beautiful Soup, Python makes it relatively straightforward to extract both textual and image data from websites.

By mastering these techniques, you can create custom datasets from publicly available information and use them for analysis, machine learning, or other data-driven applications.