Automated News Article Extraction and Translation: Building a Telegram Bot
Today we explore an innovative automation tool designed to extract and translate news articles. This powerful workflow can scrape content from major news websites, extract the title, main image, and article content, then translate everything into other languages – all through a simple Telegram bot interface.
How the News Article Extractor Works
The system was designed primarily for news articles from major Indian publications including Hindustan Times, NDTV, Times of India, and Deccan Herald. The original use case was for a content creator who needed to translate English news content into Kannada, but the system is flexible enough to work with various languages.
When a user sends a news article URL to the Telegram bot, the workflow is triggered and performs several key functions:
- Identifies the source website by examining the URL
- Extracts the main image from the article
- Pulls the article title and converts it from HTML to text
- Extracts the full article content, removing HTML tags and special characters
- Translates both title and content to the target language (Kannada in the demonstration)
- Returns all components as a formatted response in Telegram
Technical Implementation
The workflow is built with several interconnected components:
1. Telegram Trigger
The process begins when a user sends a message to the bot. This activates the workflow and passes the message content (the news URL) to the next component.
2. URL Router
A switch mechanism examines the URL to determine which news source it belongs to. Each source has its own extraction path since different news sites use different HTML structures.
3. HTML Extraction
For each news source, the system:
- Makes an HTTP request to download the full HTML content of the page
- Uses Regex functions to locate and extract the main image URL
- Identifies and extracts the article title (typically in H1 tags)
- Locates and extracts the main article content
- Cleans up the extracted content by removing HTML tags, links, and special characters
4. Translation Engine
A language model-based translation system takes the extracted content and converts it to the target language. The system:
- Detects the source language
- Translates the title and article content
- Formats the response with proper language indicators
5. Response Delivery
The system sends back three main components to the user through Telegram:
- The main article image
- The translated title
- The full translated article content
Practical Applications
This automation tool has numerous practical applications for content creators, news aggregators, or anyone needing to quickly access and translate news content. Some potential uses include:
- News monitoring across language barriers
- Content creation for multilingual audiences
- Research that requires translation of multiple news sources
- Automated content repurposing for blogs and social media
While demonstrated with Kannada translation, the system can be easily modified to work with any target language by adjusting the prompts in the translation module.
Customization Possibilities
The workflow can be expanded beyond the four demonstrated news sources. Since different websites have unique HTML structures, adding support for additional news sources requires creating appropriate extraction paths for each site’s specific layout.
The translation component can also be customized to work with languages beyond English and Kannada, making this a versatile tool for global content management.
For those interested in web scraping, content automation, or multilingual content management, this project demonstrates how powerful automation tools can streamline complex workflows that would otherwise require significant manual effort.