How to Extract and Process CNPJ Data from Brazil’s Open Government Database

How to Extract and Process CNPJ Data from Brazil’s Open Government Database

Brazil’s open CNPJ database provides valuable information about all companies that have ever existed in the country. This comprehensive database includes active and inactive companies, simplified tax regime businesses, and data on anonymous sources including company names and partial CPF information. While contact information is generally available, it often represents the company’s service provider data rather than direct contacts.

The database is classified using CNI codes (economic activity codes), which represent the business segment of each company. This makes it possible to categorize and filter businesses by their industry or activity type.

Finding the Open CNPJ Database

To access this treasure trove of data, simply search for “data open CNPJ” on Google. The first link that appears will direct you to the national database. The site provides resources including technical documentation that explains the available data fields and their meanings.

When you navigate to the open data section, you’ll find a directory organized by year and month, with monthly updates dating back to May 2023. There’s no fixed release date each month, so it’s advisable to check periodically for new updates.

Understanding the Database Structure

The main files in the database include:

  • Companies (basic information)
  • Establishments (detailed company data)
  • Software information

These files come compressed in ZIP format and need to be extracted before processing. The database contains several components:

  • Basic company database
  • Social network data
  • Natural network information
  • Establishment data (complete company information)
  • Simple company data
  • Social database information
  • Supporting tables for municipalities
  • Quality indicators for social, natural, and legal entities
  • CNI codes (economic activity classifications)

Automating Data Extraction with S3

A practical approach to managing this data involves using an S3-compatible storage system. The process begins by setting up a connection to the S3 service using the boto3 library. For development purposes, this can be configured without stringent security measures, though production implementations should include proper IAM user restrictions.

The automation routine performs several key functions:

  1. Verifies access to the S3 bucket
  2. Identifies the latest directory of CNPJ data
  3. Checks if the data has already been processed
  4. Lists and filters ZIP files for download
  5. Downloads files and uploads them to S3
  6. Extracts (unzips) the files to another S3 directory
  7. Creates a completion marker when processing finishes

This routine is designed to be resilient, allowing for interrupted executions to resume from where they left off. The process typically takes about an hour to complete for a full dataset.

Using MinIO for Local Development

For local development and testing, MinIO provides an excellent S3-compatible storage solution. After setting up MinIO with the appropriate ports (9000 for API access and 9001 for web interface), you can run the extraction routine to download the CNPJ data and store it locally.

Once downloaded, all ZIP files and their extracted contents will be available in your bucket, organized by date and ready for further processing.

Next Steps: Database Integration

With the data successfully extracted and stored, the next logical step is to import this information into a database system like PostgreSQL for more efficient querying and analysis. This setup creates a powerful foundation for building applications that leverage Brazil’s comprehensive business registry data.

Leave a Comment