Revolutionizing Web Data Extraction with LLMs: Insights from Zyte

The field of web data extraction is experiencing a revolution thanks to Large Language Models (LLMs). Ivan Sanchez, a data scientist at Zyte, recently shared valuable insights on how these powerful AI models are transforming the web scraping industry, making previously complex extraction tasks trivial while presenting new opportunities and challenges.

The Evolution of Web Data Extraction at Zyte

Zyte provides comprehensive web data extraction services that go beyond simple scraping. The company offers solutions for:

Data extraction
Web crawling
Anti-ban systems and proxy management
CAPTCHA solving technologies
Custom extraction projects
Legal compliance services

Before the rise of LLMs, Zyte relied on standard schemas – predefined data structures for common extraction targets like product information. These schemas worked with deep neural networks that classified HTML nodes into specific data types (price, description, etc.).

However, many customers had specialized needs beyond standard schemas. For example, one food retailer wanted to know whether specific ingredients were present in product descriptions, while another needed to determine if delivery was free based on complex conditions (like minimum order values).

LLM-Powered Extraction Approaches

Sanchez outlined three main approaches Zyte is implementing to leverage LLMs:

1. Direct LLM Extraction

The most straightforward approach is using LLMs directly for each page. The process works by:

Receiving a URL and schema request
Obtaining the HTML via browser
Simplifying the HTML through a preprocessing pipeline
Having the LLM extract the structured data

While effective, this approach can be expensive at scale – about half a cent per page, which becomes significant when processing millions of pages.

2. LLM-Generated Code for Extraction

A more cost-effective approach involves using LLMs to generate extraction code:

Providing 5-6 URLs as examples of what needs extraction
Having the LLM generate custom code based on those examples
Storing the code and using it for extracting data from many additional pages

This approach dramatically reduces costs but requires addressing challenges like code safety, detecting when pages change, and verifying code accuracy.

3. Hybrid Approach

Zyte is also exploring a hybrid method that combines the strengths of both approaches:

Using LLM-generated code to identify and select the relevant parts of a page
Allowing that code to call LLMs when needed for complex analyses

This provides the flexibility of direct LLM extraction with the cost-efficiency of code generation.

Optimization Strategies

To make LLM extraction more cost-effective, Zyte implements several optimization techniques:

Using smaller models: Fine-tuned 8B parameter models instead of 165B parameter models
Sending only essential text: Truncating HTML to relevant sections before sending to LLMs
Batch processing: Addressing memory bandwidth limitations
Context reuse: Improving efficiency in the pre-fuel phase

Advanced LLM Techniques

Sanchez highlighted emerging technologies making code generation more powerful:

ReAct prompting: Combining reasoning and actions in a loop – especially effective with powerful models like Gemini 2.5 Pro
Self-discovery and debugging: Using print statements to help LLMs understand what’s happening when their code runs

In one case study, Sanchez demonstrated how Gemini 2.5 Pro could analyze complex product pages and discover high-resolution images hidden in nested JSON structures through iterative self-debugging – a task that would be challenging even for human developers.

Challenges and Risks

As agents powered by LLMs become more capable, Sanchez warned of several potential issues:

In-context reward hacking: LLMs might find undesirable shortcuts to accomplish goals
Memory/rack-based poisoning: Vulnerabilities when accessing potentially malicious content
Improper tool use: Security risks when allowing LLMs to use external tools
Ethical and legal concerns: Respecting robots.txt files and website terms of service

Looking Ahead

The best approach to web data extraction depends on specific needs. Direct LLM extraction works well for smaller projects, while code generation is more suitable for large-scale operations like scraping major e-commerce platforms.

While LLM agents show immense potential for data extraction, they require careful implementation and monitoring to avoid technical and ethical pitfalls. As these technologies continue to evolve, companies like Zyte are working to harness their capabilities while addressing the associated challenges.