Revolutionizing Web Data Extraction with LLMs: Insights from Zyte

Revolutionizing Web Data Extraction with LLMs: Insights from Zyte

The field of web data extraction is experiencing a revolution thanks to Large Language Models (LLMs). Ivan Sanchez, a data scientist at Zyte, recently shared valuable insights on how these powerful AI models are transforming the web scraping industry, making previously complex extraction tasks trivial while presenting new opportunities and challenges.

The Evolution of Web Data Extraction at Zyte

Zyte provides comprehensive web data extraction services that go beyond simple scraping. The company offers solutions for:

  • Data extraction
  • Web crawling
  • Anti-ban systems and proxy management
  • CAPTCHA solving technologies
  • Custom extraction projects
  • Legal compliance services

Before the rise of LLMs, Zyte relied on standard schemas – predefined data structures for common extraction targets like product information. These schemas worked with deep neural networks that classified HTML nodes into specific data types (price, description, etc.).

However, many customers had specialized needs beyond standard schemas. For example, one food retailer wanted to know whether specific ingredients were present in product descriptions, while another needed to determine if delivery was free based on complex conditions (like minimum order values).

LLM-Powered Extraction Approaches

Sanchez outlined three main approaches Zyte is implementing to leverage LLMs:

1. Direct LLM Extraction

The most straightforward approach is using LLMs directly for each page. The process works by:

  1. Receiving a URL and schema request
  2. Obtaining the HTML via browser
  3. Simplifying the HTML through a preprocessing pipeline
  4. Having the LLM extract the structured data

While effective, this approach can be expensive at scale – about half a cent per page, which becomes significant when processing millions of pages.

2. LLM-Generated Code for Extraction

A more cost-effective approach involves using LLMs to generate extraction code:

  1. Providing 5-6 URLs as examples of what needs extraction
  2. Having the LLM generate custom code based on those examples
  3. Storing the code and using it for extracting data from many additional pages

This approach dramatically reduces costs but requires addressing challenges like code safety, detecting when pages change, and verifying code accuracy.

3. Hybrid Approach

Zyte is also exploring a hybrid method that combines the strengths of both approaches:

  1. Using LLM-generated code to identify and select the relevant parts of a page
  2. Allowing that code to call LLMs when needed for complex analyses

This provides the flexibility of direct LLM extraction with the cost-efficiency of code generation.

Optimization Strategies

To make LLM extraction more cost-effective, Zyte implements several optimization techniques:

  • Using smaller models: Fine-tuned 8B parameter models instead of 165B parameter models
  • Sending only essential text: Truncating HTML to relevant sections before sending to LLMs
  • Batch processing: Addressing memory bandwidth limitations
  • Context reuse: Improving efficiency in the pre-fuel phase

Advanced LLM Techniques

Sanchez highlighted emerging technologies making code generation more powerful:

  • ReAct prompting: Combining reasoning and actions in a loop – especially effective with powerful models like Gemini 2.5 Pro
  • Self-discovery and debugging: Using print statements to help LLMs understand what’s happening when their code runs

In one case study, Sanchez demonstrated how Gemini 2.5 Pro could analyze complex product pages and discover high-resolution images hidden in nested JSON structures through iterative self-debugging – a task that would be challenging even for human developers.

Challenges and Risks

As agents powered by LLMs become more capable, Sanchez warned of several potential issues:

  • In-context reward hacking: LLMs might find undesirable shortcuts to accomplish goals
  • Memory/rack-based poisoning: Vulnerabilities when accessing potentially malicious content
  • Improper tool use: Security risks when allowing LLMs to use external tools
  • Ethical and legal concerns: Respecting robots.txt files and website terms of service

Looking Ahead

The best approach to web data extraction depends on specific needs. Direct LLM extraction works well for smaller projects, while code generation is more suitable for large-scale operations like scraping major e-commerce platforms.

While LLM agents show immense potential for data extraction, they require careful implementation and monitoring to avoid technical and ethical pitfalls. As these technologies continue to evolve, companies like Zyte are working to harness their capabilities while addressing the associated challenges.

Leave a Comment