Building a News Classifier: How to Fine-Tune LLMs with Web-Scraped Data

Fine-tuning large language models (LLMs) for specific classification tasks has become an essential skill in the current AI landscape. In this comprehensive guide, we’ll walk through the process of creating a news article classifier that can automatically categorize content into topics like politics, business, health, science, and climate.

Why Automatic News Classification Matters

Automatically categorizing news articles has several practical applications:

Eliminating manual categorization of content
Optimizing advertisement placement based on content type
Powering recommendation engines that suggest similar content

Setting Up Web Scraping with Apify

The first step in our project is gathering domain-specific data. We’ll use Apify, a tool that simplifies web scraping through its intuitive APIs, built-in JavaScript rendering, and automatic proxy rotation.

After creating a free account on Apify, we can access the Scraping API. This service makes it easy to crawl websites without worrying about getting blocked, having to implement exponential back-offs, or dealing with other common scraping headaches.

Understanding HTML Basics for Web Scraping

Before diving into scraping, let’s review some HTML fundamentals that will help us target and extract the right content:

HTML tags: Elements like <h1> for headings and <p> for paragraphs
Anchor tags: The <a> elements that create hyperlinks
Div tags: Container elements that help structure content
Classes and IDs: Attributes that identify specific elements

These components will help us locate and extract article content while filtering out advertisements, navigation elements, and other irrelevant parts of web pages.

Building Our News Scraper

To gather our training data, we’ll create a scraper that:

Visits category pages (politics, business, health, etc.)
Extracts article links from these pages
Opens each link to collect the full article text
Maps the text to its respective category

We’ll implement this using Python and libraries like requests and BeautifulSoup. The scraper will handle pagination by iteratively requesting more content as needed until we’ve collected our target number of articles per category.

Understanding Large Language Models

Before fine-tuning our model, it’s important to understand how LLMs work:

LLMs are mathematical functions that predict what word comes next in a piece of text. During training, they learn to generate coherent language by predicting missing words in sentences. This process is called causal language modeling.

To adapt LLMs for specific tasks like classification, we use fine-tuning—a process where we take a pre-trained model and further train it on a smaller, task-specific dataset.

The Fine-Tuning Process

Our fine-tuning workflow consists of several key steps:

1. Defining Parameters and Reading Data

We’ll set up parameters like dataset path, test/train split ratio, and column names for our labels and text.

2. Cleaning the Dataset

Even though our scraped data is relatively clean, we’ll implement a cleaner class that can handle HTML removal and other text normalization tasks.

3. Wrangling the Data

This includes:

Converting text category labels to numeric IDs using a label encoder
Splitting the data into training and testing sets
Converting our pandas DataFrame to a Hugging Face dataset

4. Initializing the Tokenizer and Model

We’ll use Hugging Face’s transformers library to load a pre-trained model (in this case, Meta’s Llama 3.1B) and prepare it for fine-tuning by setting up the tokenizer.

5. Training the Model

We’ll configure training parameters like batch size, learning rate, and the number of epochs, then use Hugging Face’s Trainer class to fine-tune our model.

6. Evaluating the Model

We’ll assess our model’s performance using classification metrics like accuracy on both the training and testing datasets.

7. Model Inference

Finally, we’ll demonstrate how to use our fine-tuned model to classify new, unseen articles.

Results and Improvements

With our initial dataset of 100 articles per category, we achieved 85% accuracy on the training set but only 50% on the testing set—a sign of overfitting. When expanding to 1,000 articles per category, the testing accuracy improved to 73%, showing how critical data volume is for effective fine-tuning.

Saving and Sharing Your Model

After training, we can save our model locally or push it to the Hugging Face Hub for easy access and sharing. This makes deployment and integration into other applications much simpler.

Conclusion

By combining web scraping with LLM fine-tuning, we’ve created a powerful news classifier that can automatically categorize articles with high accuracy. The same approach can be adapted for other text classification tasks or extended to different domains by changing the data source and category labels.

This workflow demonstrates how accessible AI model customization has become—allowing developers to create specialized tools that leverage the power of large language models without starting from scratch.