Advanced Web Scraping: Extracting Director Information from IMDB using Power Query

Web scraping remains one of the most powerful techniques for gathering structured data from websites. In this comprehensive guide, we’ll explore an advanced implementation focusing on extracting detailed director information from IMDB’s top 250 movies list using Power Query.

Building upon previous data collection efforts, this tutorial specifically focuses on obtaining director images and biographies to enhance our dataset.

Identifying the Data Source

The first step in our process involves locating the exact source of the information we need. When examining a director’s page (like Frank Darabont’s profile), we need to identify where the biography and profile image are stored in the page source.

Upon inspection of the page source, we discover that the necessary information is stored in a JSON file embedded within the HTML. This structured data format makes extraction more straightforward than parsing raw HTML.

Connecting to the Data Source

To begin the extraction process:

Connect to the web page using Power Query’s web connector
Specify HTML as the connection type
Use the Web.Contents function to ensure the query remains refreshable in service environments
Convert the source to text format using Text.FromBinary

Extracting the JSON Data

The embedded JSON contains all the information we need, but we must extract it precisely from the surrounding HTML. The process involves:

Converting the text into lines
Locating the specific line containing our JSON data
Extracting the text between specific delimiters
Adding necessary closing brackets to ensure valid JSON structure
Converting the result to a proper JSON document

This approach allows us to transform unstructured web content into a structured format that Power Query can easily manipulate.

Parsing and Structuring the Data

Once we have the JSON document, we can extract the specific fields we need:

Person ID: The unique identifier for each director
Director Name: The full name of the director
Image URL: The link to the director’s profile image
Biography: The director’s professional background and career highlights

By expanding these records and filtering out unnecessary information, we create a clean, structured dataset containing only the relevant director details.

Creating Dynamic URLs for Service Refresh

To ensure our query remains refreshable when deployed to a service environment, we need to construct dynamic URLs. This involves:

Creating a base URL variable
Defining a path pattern that incorporates the person ID
Using Web.Contents with relative path parameters instead of hardcoded URLs

This approach ensures that our query will dynamically generate the correct URLs for each director when refreshed in the Power BI service.

Final Data Processing

The final step involves expanding the retrieved director information and integrating it with our existing dataset. The result is a comprehensive collection of director details including names, images, and biographies that can enhance our analysis of IMDB’s top 250 movies.

While the scraping process may require patience, especially when dealing with larger datasets, the result is a rich, structured dataset that would be difficult to obtain through manual methods.

Conclusion

By leveraging the power of JSON extraction within Power Query, we’ve demonstrated how to obtain detailed information about movie directors from IMDB. This technique can be adapted for various web scraping needs where the target data is embedded in JSON format within HTML pages.

Remember that when scraping websites, patience is indeed a virtue, as processing larger datasets can take significant time. The investment, however, pays off with comprehensive, structured data ready for analysis.