Building a Web Scraping Tool with JavaScript and Puppeteer

Building a Web Scraping Tool with JavaScript and Puppeteer

Web scraping is a powerful technique for extracting data from websites. This article will guide you through creating a robust web scraping tool using JavaScript and Puppeteer to collect job listings data and export it to CSV format.

Setting Up Your Project

To begin creating your web scraping tool, you’ll need to set up a new project directory and initialize it with npm. Create a folder named ‘Scrapping’ and open it in your code editor. Then open a terminal and run the following command:

npm init

After completing the initialization, modify your package.json file to set the type as ‘module’ and define a start script:

"type": "module",
"scripts": {
"start": "node index.js"
}

Installing Puppeteer

Puppeteer is a Node.js library that provides a high-level API to control a virtual browser. Install it using npm:

npm install puppeteer

Next, create an index.js file in your project directory and import Puppeteer:

import puppeteer from 'puppeteer';

Setting Up Browser Automation

Now that we have Puppeteer installed, we can start writing our scraping logic. The first step is to launch a browser instance and open a new page:

const browser = await puppeteer.launch();
const page = await browser.newPage();

To avoid being detected as a bot, it’s important to set a user agent:

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36');

Navigating to the Target Website

For our example, we’ll scrape job listings from a job portal. Define the URL you want to scrape and navigate to it:

const url = 'https://www.naukri.com/software-development-jobs';
await page.goto(url);

Extracting Job Data

Once the page loads, we need to wait for the job listing elements to appear and then extract the data we want. We’ll use CSS selectors to target specific elements on the page:

// Wait for job cards to load
await page.waitForSelector('.jobCard');

Now we can extract the job data using the page.$$eval function, which allows us to select multiple elements and process them:

const jobs = await page.$$eval('.jobCard', cards => {
return cards.map(card => {
// Extract job title
const titleSelector = card.querySelector('.title');
const title = titleSelector ? titleSelector.innerText : '';
const url = titleSelector ? titleSelector.href : '';

// Extract company name
const companySelector = card.querySelector('.company');
const company = companySelector ? companySelector.innerText : '';

// Extract experience requirement
const expSelector = card.querySelector('.expWidth');
const experience = expSelector ? expSelector.innerText : '';

// Extract location
const locSelector = card.querySelector('.locWidth');
const location = locSelector ? locSelector.innerText : '';

return {
title,
url,
company,
experience,
location
};
});
});

After extracting the data, we can log it to verify what we’ve collected:

console.log(jobs);

Exporting Data to CSV

To make our scraped data more useful, we’ll export it to a CSV file. First, install the json2csv package:

npm install json2csv

Then import the necessary modules and convert our JSON data to CSV:

import { Parser } from 'json2csv';
import fs from 'fs';

const parser = new Parser();
const csv = parser.parse(jobs);

fs.writeFileSync('jobs.csv', csv);

Finally, close the browser when we’re done:

await browser.close();

Running the Scraper

To run your web scraping tool, use the start script you defined earlier:

npm start

After execution, you’ll find a jobs.csv file in your project directory containing all the scraped job listings data. You can open this file in Excel or any spreadsheet program to view and analyze the data.

Ethical Considerations

When creating web scraping tools, it’s important to be mindful of a few ethical considerations:

  • Always check a website’s robots.txt file and terms of service to ensure scraping is allowed
  • Implement rate limiting to avoid overloading the target server
  • Only collect publicly available data
  • Use the data responsibly and in accordance with applicable laws

With these guidelines in mind, web scraping can be a powerful technique for data collection and analysis across various domains.

Leave a Comment