How to Extract Structured Data from Websites with FireCrawl
FireCrawl is a powerful tool that allows users to transform any website into LLM-ready data within seconds. This open-source solution offers 500 free credits to new users, providing ample opportunity to explore its capabilities. FireCrawl offers four primary functions: scrape, crawl, map, and extract – with the latter being particularly useful for pulling specific data from websites based on custom prompts.
The key difference between scraping and extracting is significant. When scraping a website like “Quotes to Scrape,” FireCrawl returns the content in a more readable markdown format compared to raw HTML. However, the extract function goes further by allowing users to specify exactly what data they want to pull from the site.
Understanding FireCrawl’s Extract Function
The extract function works by providing both a URL and a schema that tells the LLM what information to look for. For example, when targeting “Quotes to Scrape,” users can create a schema that identifies both the quote text and author name as strings to be extracted.
One of the most powerful features is FireCrawl’s ability to crawl through multiple pages of a website. By adding an asterisk after the base URL (e.g., quotestoscrape.com/*), FireCrawl will automatically discover and process all pages within that domain, extracting the requested information from each page. In the demonstration, this method extracted 79 quotes from across the entire website rather than just the 10 quotes visible on the homepage.
Implementing FireCrawl in N8N for Automation
To leverage FireCrawl in automation workflows, users can implement it within N8N, a popular workflow automation platform. The process involves:
- Creating an HTTP request node in N8N
- Importing the curl command from FireCrawl’s documentation
- Setting up authorization with a FireCrawl API key (users can save this as a reusable credential)
- Configuring the request body with the target URL and schema
- Implementing polling to check when the asynchronous extraction is complete
When setting up the request body, users must define both the prompt (what data to extract) and the schema (the structure of that data). The extraction process is asynchronous, requiring a second HTTP request to check the status using the extraction ID returned from the initial request.
Advanced Features and Use Cases
FireCrawl’s wildcard URL feature demonstrates its power: when tested against the same website without the asterisk wildcard, it returned only 10 quotes (from the homepage). With the wildcard, it extracted 83 quotes from across the entire site.
Potential use cases for FireCrawl in business automation include:
- Processing multiple URLs in batch from a spreadsheet
- Researching companies at scale
- Generating initial outreach messages based on extracted data
- Creating structured datasets from unstructured web content
FireCrawl offers a 10% discount on the first 12 months using code HERC10.
Conclusion
FireCrawl represents a significant advancement in web data extraction, enabling users to pull structured information from websites without dealing with complex HTML parsing. By combining crawling capabilities with LLM-powered extraction, it can automatically process multiple pages and return only the specific data points requested. When integrated into automation platforms like N8N, FireCrawl becomes even more powerful, enabling scalable data collection workflows across numerous websites.