Building a Secure Web Scraping API for Data Extraction: A Practical Architecture

A comprehensive web scraping solution has been developed that provides both authenticated and public access to structured data from online sources. The architecture implements token-based authentication and offers fallback options for offline data retrieval when online sources are unavailable.

Architecture Overview

The system follows a straightforward but effective architecture. Users authenticate through a login process that generates a JWT token. This token then grants access to the web scraping application. When a webscraping endpoint is called, the system processes the target website and returns structured data to the user. If the online source is unavailable, the system automatically falls back to offline data stored in pre-populated CSV files.

Authentication System

Security is implemented through a token-based authentication system. The architecture includes:

User login endpoints that generate JWT tokens
Protected routes that require valid authentication
Public routes for limited testing without authentication
User management with status tracking

The authentication module maintains user data including username, status, and access permissions. For demonstration purposes, user credentials are stored in a mock database and documented in the README file.

Route Structure

The API implements two types of routes:

Protected Routes: Require authentication tokens for access
Public Routes: Allow limited access for testing without authentication

The main data extraction endpoints accept parameters including year, data category (options), and subcategories (sub-options) to refine the information being requested.

Data Categories

The system organizes data into several primary categories:

Production
Processing
Importation
Exportation
Commercialization

Many categories contain subcategories that provide more specific data sets. The API is designed to handle these hierarchical relationships automatically, adjusting the scraping parameters based on the selected options.

Online and Offline Processing

A key feature of the architecture is its dual-mode operation:

Online Mode

When online, the system performs real-time web scraping, parsing HTML tables from the source website and converting the data to JSON format for API responses. The system intelligently handles the website’s structure, including navigation through various options and sub-options.

Offline Mode

The offline mode accesses pre-downloaded CSV files stored in the data folder. These files are organized by category and year, ensuring continuity of service even when the source website is unavailable. The offline processor reads these files and returns the data in the same format as the online service.

Implementation Details

The codebase is well-organized with separate modules for:

Authentication
Routing
Services (including web scraping logic)
Utilities (parameter handling and URL formatting)
Data storage (CSV files for offline mode)

Configuration settings include token expiration times (set to 60 minutes) and other system parameters.

API Usage

The API supports various query patterns, including:

Filtering by year (from historical data to current)
Selecting specific categories (production, importation, etc.)
Targeting subcategories for more specific data

Responses are formatted as JSON, making the data easily consumable by client applications.

This architecture demonstrates a practical approach to web scraping that balances reliability, security, and data accessibility while providing fallback mechanisms for service continuity.