Web Scraping Data for Consumer Price Index: Experiences from African National Statistical Offices

Web Scraping Data for Consumer Price Index: Experiences from African National Statistical Offices

In a webinar hosted by the UN Regional Hub for Africa on May 27, 2025, three National Statistical Offices (NSOs) shared their experiences implementing web scraping techniques to collect price data for Consumer Price Index (CPI) calculations. The webinar featured presentations from Kenya, Uganda, and Ivory Coast, with technical support from Luigi Palumbo of the Bank of Italy and Federico Polidoro of the World Bank.

Background and Context

The webinar focused on three main approaches to web scraping for CPI data collection:

  • Targeted web scraping – focusing on specific predefined products
  • Bulk web scraping – collecting prices without predefined specifications
  • Classification issues – matching web-scraped products to official CPI classification systems

The participating NSOs had been receiving training since late 2023, following workshops in Rwanda and Italy. The training program emphasized practical implementation and continuous improvement rather than one-off learning events.

Kenya’s Experience: Targeted Web Scraping for Grocery Prices

Peter Kamel from the Kenya National Bureau of Statistics presented their approach to targeted web scraping for grocery items:

Methodology

  • Used R programming language with Selenium and RCurl packages
  • Focused on Carrefour’s online store after obtaining formal permission
  • Implemented the RobotsTXT package to ensure ethical scraping
  • Collected data daily, capturing product names, prices, brands, packaging and discounts
  • Currently hosting scripts on local computers but planning to move to cloud servers

Results

  • Collected over 120,000 data points from food stores, with 100,000 from Carrefour
  • Compared to traditional methods which collect about 64,000 items per quarter
  • Developed price indices for specific products like sugar and wheat that showed good correlation with officially published CPI

Challenges and Future Plans

  • Need for more training on server hosting and automation
  • Plans to integrate web-scraped data into the current CPI production pipeline
  • Aims to begin regular compilation of parallel indices for comparison
  • Looking to extend methods to other sectors like real estate

Uganda’s Experience: Bulk Web Scraping for Smartphone Prices

Robert Tomashemi and Edgar Neympa from the Uganda Bureau of Statistics shared their work on web scraping smartphone prices:

Methodology

  • Used Python with Beautiful Soup library for weekly data collection since August 2023
  • Implemented a four-stage process: data collection, data processing, data filtering/augmentation, and index calculation
  • Applied Large Language Models (LLMs) to classify products and extract features like RAM, storage, and camera specifications
  • Automated the process to run weekly without human intervention
  • Tested multiple index calculation methods including Time Product Dummy method, weighted stratum approach, and hedonic pricing models

Results

  • Processed 67,000 product listings over an eight-month period
  • Identified 3,527 unique smartphones using AI classification
  • Computed monthly price changes showing fluctuations between -2.3% and +12.3%

Challenges

  • Smartphone brands constantly changing or disappearing from the market
  • Need for effective data cleaning to exclude non-smartphone products
  • Limitations of brand-based classification when products leave the market

Ivory Coast’s Experience: Classification Issues in Web Scraping

Frank from the National Statistics Agency of Ivory Coast (ANSTAT) focused on addressing classification challenges:

Methodology

  • Developed Python scripts for multiple websites including Jumia
  • Collected data for 8 of 12 COICOP functions (Classification of Individual Consumption by Purpose)
  • Implemented text embeddings to convert product descriptions into vectors
  • Created a vectorial database using QDrant to store manually coded products
  • Used cosine distance calculations to match new products with existing classifications

Results

  • Achieved 88.54% accuracy in automated classification
  • Recall rate of 75.94% when comparing to manually coded data

Future Plans

  • Expanding the manually coded database to improve classification accuracy
  • Implementing Delta Lake for data storage
  • Applying the methodology to other surveys beyond CPI

Key Takeaways and Considerations

Several important points emerged during the discussion:

  • Legal and ethical considerations: All three NSOs emphasized the importance of obtaining permission from website owners before scraping data
  • Limitations for informal markets: Web scraping works best for well-organized markets and structured retail environments
  • Weighting challenges: Web-scraped data typically doesn’t include information about market weights, requiring approximation
  • Product selection: Web scraping is particularly valuable for fast-evolving products like smartphones and electronics that are difficult to track with traditional methods
  • Quality considerations: Traditional data collection may still be preferable for items where expiration dates, freshness, or physical condition are important factors

The webinar demonstrated how African NSOs are successfully implementing modern data collection techniques to improve CPI compilation, with plans to continue developing these methodologies and expanding their application to other sectors.

Leave a Comment