How to Scrape IMDB Top 250 Movies Data Directly into Power BI

How to Scrape IMDB Top 250 Movies Data Directly into Power BI

Web scraping provides powerful capabilities for data analysts looking to extract and analyze information from websites. In this comprehensive guide, we’ll walk through the process of scraping IMDB’s Top 250 Movies list directly into Power BI for analysis and visualization.

Getting Started with the Web Data Source

The first step is acquiring the data from IMDB’s website. Begin by copying the URL of the IMDB Top 250 Movies page. In Power BI, select ‘Get Data’, choose ‘Web’ as your data source, and paste the URL. Power BI will access the website and display available tables from the page.

Selecting and Transforming the Right Table

After the web content loads, you’ll need to identify which table contains the movie data. In this case, ‘Table 1’ contains the required information. Select it and click ‘Transform Data’ to open the Power Query Editor where you can reshape the dataset.

Renaming and Organizing Columns

The raw data needs proper column names for clarity. Rename the columns to represent their contents accurately:

  • Column for movie titles: ‘Title’
  • Column for ratings: ‘Rating’
  • Column for vote counts: ‘Votes’
  • Column for release years: ‘Year’
  • Column for duration: ‘Duration’
  • Column for content rating: ‘Content Rating’

Remove any unnecessary columns to streamline your dataset.

Converting Duration to Minutes

Movie durations on IMDB appear in a format like ‘2h 45m’. To make this data more useful for analysis, create a new column that converts these durations into total minutes. This requires parsing the hours and minutes components and performing the appropriate calculation.

Extracting Movie IDs

IMDB assigns each movie a unique identifier that appears in the URL (format: tt0111161). Extracting these IDs allows you to create direct links back to the movie pages. Use the Text.BeforeDelimiter function to isolate just the ID portion from the full URL string.

Adding Thumbnail Images

To enhance your visualization, you can extract the URL patterns for movie poster thumbnails. This requires inspecting the page source to identify the image URL pattern, then creating a custom column that constructs the appropriate URL for each movie.

Final Data Formatting

Before loading the data to your model, perform these final transformations:

  1. Extract the rank number from the beginning of each title
  2. Clean up the movie titles by removing the rank prefix
  3. Set appropriate data types for each column (whole numbers for rank, text for titles, etc.)
  4. Create a direct URL link to each movie’s IMDB page using the movie ID

Loading and Configuring in Power BI

After loading the data into Power BI, configure your fields appropriately:

  • Set rank, duration, and other numeric fields to not summarize
  • Apply conditional formatting to make titles clickable, opening the respective IMDB page
  • Arrange columns in a logical order for presentation

With these steps completed, you now have a robust dataset of IMDB’s Top 250 Movies with rankings, ratings, release years, durations, and direct links to the original pages – all ready for analysis and visualization in Power BI.

Leave a Comment