Building a Secure Web Scraping API for Data Extraction: A Practical Architecture
A comprehensive web scraping solution has been developed that provides both authenticated and public access to structured data from online sources. The architecture implements token-based authentication and offers fallback options for offline data retrieval when online sources are unavailable.
Architecture Overview
The system follows a straightforward but effective architecture. Users authenticate through a login process that generates a JWT token. This token then grants access to the web scraping application. When a webscraping endpoint is called, the system processes the target website and returns structured data to the user. If the online source is unavailable, the system automatically falls back to offline data stored in pre-populated CSV files.
Authentication System
Security is implemented through a token-based authentication system. The architecture includes:
- User login endpoints that generate JWT tokens
- Protected routes that require valid authentication
- Public routes for limited testing without authentication
- User management with status tracking
The authentication module maintains user data including username, status, and access permissions. For demonstration purposes, user credentials are stored in a mock database and documented in the README file.
Route Structure
The API implements two types of routes:
- Protected Routes: Require authentication tokens for access
- Public Routes: Allow limited access for testing without authentication
The main data extraction endpoints accept parameters including year, data category (options), and subcategories (sub-options) to refine the information being requested.
Data Categories
The system organizes data into several primary categories:
- Production
- Processing
- Importation
- Exportation
- Commercialization
Many categories contain subcategories that provide more specific data sets. The API is designed to handle these hierarchical relationships automatically, adjusting the scraping parameters based on the selected options.
Online and Offline Processing
A key feature of the architecture is its dual-mode operation:
Online Mode
When online, the system performs real-time web scraping, parsing HTML tables from the source website and converting the data to JSON format for API responses. The system intelligently handles the website’s structure, including navigation through various options and sub-options.
Offline Mode
The offline mode accesses pre-downloaded CSV files stored in the data folder. These files are organized by category and year, ensuring continuity of service even when the source website is unavailable. The offline processor reads these files and returns the data in the same format as the online service.
Implementation Details
The codebase is well-organized with separate modules for:
- Authentication
- Routing
- Services (including web scraping logic)
- Utilities (parameter handling and URL formatting)
- Data storage (CSV files for offline mode)
Configuration settings include token expiration times (set to 60 minutes) and other system parameters.
API Usage
The API supports various query patterns, including:
- Filtering by year (from historical data to current)
- Selecting specific categories (production, importation, etc.)
- Targeting subcategories for more specific data
Responses are formatted as JSON, making the data easily consumable by client applications.
This architecture demonstrates a practical approach to web scraping that balances reliability, security, and data accessibility while providing fallback mechanisms for service continuity.