Advanced Web Scraping: Bypassing Custom CAPTCHA Systems on Chilean Business Data Sites

Web scraping often involves overcoming various security measures designed to prevent automated data extraction. In this article, we explore an advanced technique for bypassing a custom CAPTCHA system on a Chilean business information website.

The Challenge: Custom CAPTCHA Protection

The target website uses a personalized CAPTCHA system rather than standard implementations. This system generates an image containing a verification code that users must enter to access company information. What makes this interesting is that the code is encoded in Base64 format within the page’s JavaScript.

Upon analysis, we discovered that the CAPTCHA system calls an internal method that passes a JSON parameter. This parameter contains the encoded verification code, which is not a standard token but rather a Base64-encoded string containing encrypted data.

Understanding the Pattern

The key breakthrough came from pattern analysis. After examining multiple CAPTCHA generations, we identified that:

The verification code displayed in the image (e.g., “9986” or “7677”) is consistently found at positions 36-40 in the decoded Base64 string
The string also contains other metadata, including timestamps of when the CAPTCHA was generated
The colors used in the CAPTCHA image make traditional OCR approaches difficult

The Solution: Pattern-Based Extraction

Rather than attempting OCR on the colored CAPTCHA images, we developed a more reliable approach:

Initialize a session using the requests library to maintain cookies
Load the initial page to establish the session
Call the method that generates the CAPTCHA
Extract the Base64-encoded string from the response
Decode the string and extract characters at positions 36-40 to get the verification code
Submit this code along with the business ID (“rut”) and verification code

This approach allows us to bypass the CAPTCHA protection entirely through pattern analysis rather than image processing.

Extracting Business Information

Once the CAPTCHA is bypassed, we can extract valuable business information including:

Company name
Business ID number
Active status
Business start date
Economic activities and their respective codes

The extracted data is returned in a structured format that can be integrated with other systems.

Implementation

The implementation uses Python with Django for the backend framework. The scraping process involves:

Creating a session with the requests library
Loading the initial page
Generating and solving the CAPTCHA using the pattern-based approach
Submitting the form with the correct verification code
Parsing the returned HTML to extract the business information
Returning the structured data

Conclusion

This case study demonstrates how pattern analysis can be more effective than traditional OCR approaches when dealing with custom CAPTCHA systems. By understanding the underlying structure of the security mechanism, we can develop more reliable and efficient web scraping solutions that provide valuable business intelligence from public data sources.

Advanced web scraping often requires a combination of technical skills and investigative analysis to overcome increasingly sophisticated protection mechanisms. The approach outlined here can be adapted to similar scenarios where traditional CAPTCHA-solving services might fail.