Modern Web Scraping with LLMs: Using CrawlForAI to Extract Structured Data

Web scraping has evolved significantly with the advent of Large Language Models (LLMs). While traditional web scraping tools like Beautiful Soup have been around for years, they require complex rules to extract information from HTML. Modern approaches using LLMs can simplify this process considerably.

CrawlForAI is an open-source package that leverages LLMs to extract information directly from web pages. Unlike traditional scrapers that see raw HTML, LLMs can understand content contextually, making extraction more intuitive and scalable.

Setting Up CrawlForAI

To get started with CrawlForAI, you’ll need to:

Create a virtual environment: conda create name_of_environment python=3.x
Install necessary packages: CrawlForAI, OpenAI, and python-dotenv for managing API keys
Install LightLLM proxy to interface with different LLMs using a consistent API
Install the Playwright browser extension if required

The Cost Consideration

A critical factor rarely discussed is the cost associated with using LLMs for web scraping. In the demonstration, scraping a single webpage with tables consumed approximately 150,000 tokens across 25 requests, costing about eight cents using DeepSeek models.

While this might seem negligible for small-scale operations, costs can escalate quickly when scraping millions of pages. Fortunately, CrawlForAI provides options for web scraping without an LLM, which can be more cost-effective for large-scale operations.

Implementing Web Scraping with LLMs

The implementation process is straightforward:

Specify the URL(s) to scrape
Provide detailed instructions about what information to extract
Configure your LLM provider (DeepSeek, Gemini, etc.)
Define the output schema for structured data extraction
Configure browser settings
Execute the scraping process

The beauty of using LLMs for web scraping is the ability to generate structured outputs in specific formats like JSON, which can be directly integrated into databases.

Model Selection and Performance

Different models offer varying performance and cost profiles:

DeepSeek V3 provided accurate results but took about 93 seconds to process a page
Gemini Flash processed the same page in about 60 seconds

Interestingly, the experiments revealed that prompts need to be tailored to specific models. A prompt that works well with one model might not produce the desired results with another, even from the same provider.

Key Takeaways

When implementing LLM-based web scraping:

Consider the cost implications, especially for large-scale operations
Validate extracted data to ensure accuracy
Customize prompts for each specific model
Explore CrawlForAI’s non-LLM options for cost-sensitive applications

With proper implementation, LLM-based web scraping can provide more intuitive and flexible data extraction compared to traditional methods, allowing for logical deductions rather than just following regex-based rules.