Building a Web Documentation Assistant: Creating a Free AI Tool with Web Scraping and RAG

Today’s article explores how to build a comprehensive AI assistant that can read and understand any web documentation. This project combines web scraping with Retrieval Augmented Generation (RAG) to create a specialized AI that can answer questions about specific documentation as if it were an expert on the subject—all using free tools.

The Project Overview: docs.telles.app

The project combines several key technologies:

Drogon: A 100% free API for accessing AI language models
FileCrawl: An application that performs web scraping by crawling a complete site’s links and sublinks
RAG (Retrieval Augmented Generation): A technique using LangChain libraries to enhance AI responses with specific context
Streamlit: A framework for creating the user interface

The workflow is straightforward: First, provide a documentation link to FileCrawl, which downloads all pages as markdown files. Then, the system processes these documents through RAG, creating an assistant that specializes in that specific documentation. Users can then chat with this assistant, asking questions about the documentation.

Setting Up the Project

The project begins with creating a folder structure and installing necessary dependencies. Using Poetry as a package manager, we install Streamlit and other required libraries. The basic application structure includes:

A main application file (app.py)
Service modules for scraping and RAG
Presentation modules for the UI

The Streamlit interface features a sidebar for navigation between the chat and scraping modules, plus a collection selector to choose which documentation set to use.

The Web Scraping Service

The scraping service utilizes FileCrawl to download documentation. FileCrawl can be used in two ways:

Through the cloud version (with a free tier allowing 500 pages)
By running it locally using Docker (with no limitations)

The service performs these key functions:

Accepts a URL and collection name from the user
Crawls the website to find all documentation pages
Downloads each page as a markdown file
Organizes files in a collection folder for later use

For local deployment, Docker is required to run FileCrawl, providing a free and unlimited alternative to the cloud version.

The RAG Service

The RAG (Retrieval Augmented Generation) service is the heart of the system, handling these tasks:

Loading documents from collections
Splitting documents into manageable chunks
Creating embeddings using a Hugging Face model
Storing vectors in a database for efficient retrieval
Connecting to a language model (Llama-3 8B) via the Groq API
Creating a chat chain with prompt templates for consistent responses

The service uses a template instructing the AI to answer based solely on the provided documentation, reducing hallucinations and ensuring accuracy.

The User Interface

The Streamlit interface provides two main functions:

Scraping Interface: Allows users to enter a URL and collection name, then initiates the scraping process
Chat Interface: Provides a chat window where users can ask questions about the loaded documentation

The system maintains a list of available collections and allows users to switch between them easily.

Limitations and Future Improvements

While functional, the basic implementation has some limitations:

No Memory: The chat doesn’t maintain context between messages
Limited Models: Currently only uses Groq’s models
Basic Interface: The UI could be enhanced for better user experience

These limitations provide opportunities for extending the project based on specific business needs.

Practical Applications

This tool solves a significant problem with current AI systems: their limited knowledge of recent or specialized documentation. By creating custom assistants trained on specific documentation, users can get accurate, up-to-date information that general AI models might not have access to.

Potential applications include:

Internal company documentation assistants
Programming library specialists
Product support chatbots
Research assistants for specific domains

Conclusion

By combining web scraping with RAG techniques, this project demonstrates how to create specialized AI assistants without relying on costly API services. The modular design allows for customization and extension to meet various business needs, providing a practical solution for working with specialized documentation through conversational AI.