Building a Web Documentation Assistant: Creating a Free AI Tool with Web Scraping and RAG
Today’s article explores how to build a comprehensive AI assistant that can read and understand any web documentation. This project combines web scraping with Retrieval Augmented Generation (RAG) to create a specialized AI that can answer questions about specific documentation as if it were an expert on the subject—all using free tools.
The Project Overview: docs.telles.app
The project combines several key technologies:
- Drogon: A 100% free API for accessing AI language models
- FileCrawl: An application that performs web scraping by crawling a complete site’s links and sublinks
- RAG (Retrieval Augmented Generation): A technique using LangChain libraries to enhance AI responses with specific context
- Streamlit: A framework for creating the user interface
The workflow is straightforward: First, provide a documentation link to FileCrawl, which downloads all pages as markdown files. Then, the system processes these documents through RAG, creating an assistant that specializes in that specific documentation. Users can then chat with this assistant, asking questions about the documentation.
Setting Up the Project
The project begins with creating a folder structure and installing necessary dependencies. Using Poetry as a package manager, we install Streamlit and other required libraries. The basic application structure includes:
- A main application file (app.py)
- Service modules for scraping and RAG
- Presentation modules for the UI
The Streamlit interface features a sidebar for navigation between the chat and scraping modules, plus a collection selector to choose which documentation set to use.
The Web Scraping Service
The scraping service utilizes FileCrawl to download documentation. FileCrawl can be used in two ways:
- Through the cloud version (with a free tier allowing 500 pages)
- By running it locally using Docker (with no limitations)
The service performs these key functions:
- Accepts a URL and collection name from the user
- Crawls the website to find all documentation pages
- Downloads each page as a markdown file
- Organizes files in a collection folder for later use
For local deployment, Docker is required to run FileCrawl, providing a free and unlimited alternative to the cloud version.
The RAG Service
The RAG (Retrieval Augmented Generation) service is the heart of the system, handling these tasks:
- Loading documents from collections
- Splitting documents into manageable chunks
- Creating embeddings using a Hugging Face model
- Storing vectors in a database for efficient retrieval
- Connecting to a language model (Llama-3 8B) via the Groq API
- Creating a chat chain with prompt templates for consistent responses
The service uses a template instructing the AI to answer based solely on the provided documentation, reducing hallucinations and ensuring accuracy.
The User Interface
The Streamlit interface provides two main functions:
- Scraping Interface: Allows users to enter a URL and collection name, then initiates the scraping process
- Chat Interface: Provides a chat window where users can ask questions about the loaded documentation
The system maintains a list of available collections and allows users to switch between them easily.
Limitations and Future Improvements
While functional, the basic implementation has some limitations:
- No Memory: The chat doesn’t maintain context between messages
- Limited Models: Currently only uses Groq’s models
- Basic Interface: The UI could be enhanced for better user experience
These limitations provide opportunities for extending the project based on specific business needs.
Practical Applications
This tool solves a significant problem with current AI systems: their limited knowledge of recent or specialized documentation. By creating custom assistants trained on specific documentation, users can get accurate, up-to-date information that general AI models might not have access to.
Potential applications include:
- Internal company documentation assistants
- Programming library specialists
- Product support chatbots
- Research assistants for specific domains
Conclusion
By combining web scraping with RAG techniques, this project demonstrates how to create specialized AI assistants without relying on costly API services. The modular design allows for customization and extension to meet various business needs, providing a practical solution for working with specialized documentation through conversational AI.