Web Scraping Data for Consumer Price Index: Experiences from African National Statistical Offices

In a webinar hosted by the UN Regional Hub for Africa on May 27, 2025, three National Statistical Offices (NSOs) shared their experiences implementing web scraping techniques to collect price data for Consumer Price Index (CPI) calculations. The webinar featured presentations from Kenya, Uganda, and Ivory Coast, with technical support from Luigi Palumbo of the Bank of Italy and Federico Polidoro of the World Bank.

Background and Context

The webinar focused on three main approaches to web scraping for CPI data collection:

Targeted web scraping – focusing on specific predefined products
Bulk web scraping – collecting prices without predefined specifications
Classification issues – matching web-scraped products to official CPI classification systems

The participating NSOs had been receiving training since late 2023, following workshops in Rwanda and Italy. The training program emphasized practical implementation and continuous improvement rather than one-off learning events.

Kenya’s Experience: Targeted Web Scraping for Grocery Prices

Peter Kamel from the Kenya National Bureau of Statistics presented their approach to targeted web scraping for grocery items:

Methodology

Used R programming language with Selenium and RCurl packages
Focused on Carrefour’s online store after obtaining formal permission
Implemented the RobotsTXT package to ensure ethical scraping
Collected data daily, capturing product names, prices, brands, packaging and discounts
Currently hosting scripts on local computers but planning to move to cloud servers

Results

Collected over 120,000 data points from food stores, with 100,000 from Carrefour
Compared to traditional methods which collect about 64,000 items per quarter
Developed price indices for specific products like sugar and wheat that showed good correlation with officially published CPI

Challenges and Future Plans

Need for more training on server hosting and automation
Plans to integrate web-scraped data into the current CPI production pipeline
Aims to begin regular compilation of parallel indices for comparison
Looking to extend methods to other sectors like real estate

Uganda’s Experience: Bulk Web Scraping for Smartphone Prices

Robert Tomashemi and Edgar Neympa from the Uganda Bureau of Statistics shared their work on web scraping smartphone prices:

Methodology

Used Python with Beautiful Soup library for weekly data collection since August 2023
Implemented a four-stage process: data collection, data processing, data filtering/augmentation, and index calculation
Applied Large Language Models (LLMs) to classify products and extract features like RAM, storage, and camera specifications
Automated the process to run weekly without human intervention
Tested multiple index calculation methods including Time Product Dummy method, weighted stratum approach, and hedonic pricing models

Results

Processed 67,000 product listings over an eight-month period
Identified 3,527 unique smartphones using AI classification
Computed monthly price changes showing fluctuations between -2.3% and +12.3%

Challenges

Smartphone brands constantly changing or disappearing from the market
Need for effective data cleaning to exclude non-smartphone products
Limitations of brand-based classification when products leave the market

Ivory Coast’s Experience: Classification Issues in Web Scraping

Frank from the National Statistics Agency of Ivory Coast (ANSTAT) focused on addressing classification challenges:

Methodology

Developed Python scripts for multiple websites including Jumia
Collected data for 8 of 12 COICOP functions (Classification of Individual Consumption by Purpose)
Implemented text embeddings to convert product descriptions into vectors
Created a vectorial database using QDrant to store manually coded products
Used cosine distance calculations to match new products with existing classifications

Results

Achieved 88.54% accuracy in automated classification
Recall rate of 75.94% when comparing to manually coded data

Future Plans

Expanding the manually coded database to improve classification accuracy
Implementing Delta Lake for data storage
Applying the methodology to other surveys beyond CPI

Key Takeaways and Considerations

Several important points emerged during the discussion:

Legal and ethical considerations: All three NSOs emphasized the importance of obtaining permission from website owners before scraping data
Limitations for informal markets: Web scraping works best for well-organized markets and structured retail environments
Weighting challenges: Web-scraped data typically doesn’t include information about market weights, requiring approximation
Product selection: Web scraping is particularly valuable for fast-evolving products like smartphones and electronics that are difficult to track with traditional methods
Quality considerations: Traditional data collection may still be preferable for items where expiration dates, freshness, or physical condition are important factors

The webinar demonstrated how African NSOs are successfully implementing modern data collection techniques to improve CPI compilation, with plans to continue developing these methodologies and expanding their application to other sectors.

Web Scraping Data for Consumer Price Index: Experiences from African National Statistical Offices

Background and Context

Kenya’s Experience: Targeted Web Scraping for Grocery Prices

Methodology

Results

Challenges and Future Plans

Uganda’s Experience: Bulk Web Scraping for Smartphone Prices

Methodology

Results

Challenges

Ivory Coast’s Experience: Classification Issues in Web Scraping

Methodology

Results

Future Plans

Key Takeaways and Considerations

Leave a Comment Cancel reply