Get the entire HTML of a page with Playwright

Scraping and saving the whole entire HTML of a webpage with Playwright is a straightforward task. In the script below:

start by setting requirements
set the list of urls
start the loop
start browser
go to url in the list
get the HTML
console log the HTML
slugify the url to create a useable and readable filename
add date and fileextension to filename
save HTML file in ‘/html/’ directory
and done…

const playwright = require('playwright-chromium');
const fs = require('fs');

// urls 
const siteList = [
    { site: "https://www.example.com/" },
    { site: "https://www.yahoo.com/" },
    { site: "https://www.coinbase.com/" },
    { site: "https://kraken.com/" },
];


const siteLoop = async() => {

    for (const url of siteList) {

        console.log('Visit:::::', url.site)
        goToUrl = url.site

        const browser = await playwright.chromium.launch({ headless: false });
        const context = await browser.newContext({});

        const page = await context.newPage({});;
        await page.goto(goToUrl)
        const html = await page.content()
        console.log(html)

        const slugify = str =>
            str
            .toLowerCase()
            .trim()
            .replace(/[^\w\s-]/g, '')
            .replace(/[\s_-]+/g, '-')
            .replace(/^-+|-+$/g, '');

        const today = new Date();
        const date = today.getFullYear() + '-' + (today.getMonth() + 1) + '-' + today.getDate();

        const fileName = slugify(goToUrl) + '-' + date + '.html';
        const dir = './html/';

        if (!fs.existsSync(dir)) {
            fs.mkdirSync(dir);
        }

        const path = dir + fileName

        fs.writeFileSync(path, html);

        await browser.close();

    }
}

siteLoop(siteList)

Scraping a XML Sitemap with urls

An alternative for inline urls is using a XML Sitemap as a list of urls.

Start by creating a sitemap.json file in your project root directory
Copy the nodes from the XML Sitemap, and convert it to JSON format
- Only copy all the ‘<url> … </url>’ nodes you need
- Use an ‘online XML to Json converter’ to convert the XML nodes to JSON format
Save copied JSON in sitemap.json
Copy the script below and run it

const playwright = require('playwright-chromium');
const fs = require('fs');

let siteListJson = require('./sitemap.json')
const siteList = siteListJson

// urls 
// const siteList = [
//     { site: "https://www.example.com/" },
//     { site: "https://www.yahoo.com/" },
//     { site: "https://www.coinbase.com/" },
//     { site: "https://kraken.com/" },
// ];

const siteLoop = async() => {

    for (const url of siteList) {

        console.log('Visit:::::', url.loc)
        goToUrl = url.loc

        const browser = await playwright.chromium.launch({ headless: false });
        const context = await browser.newContext({});

        const page = await context.newPage({});;
        await page.goto(goToUrl)
        const html = await page.content()
        console.log(html)

        const slugify = str =>
            str
            .toLowerCase()
            .trim()
            .replace(/[^\w\s-]/g, '')
            .replace(/[\s_-]+/g, '-')
            .replace(/^-+|-+$/g, '');

        const today = new Date();
        const date = today.getFullYear() + '-' + (today.getMonth() + 1) + '-' + today.getDate();

        const fileName = slugify(goToUrl) + '-' + date + '.html';
        const dir = './html/';

        if (!fs.existsSync(dir)) {
            fs.mkdirSync(dir);
        }

        const path = dir + fileName

        fs.writeFileSync(path, html);

        await browser.close();

    }
}

siteLoop(siteList)

Result scraping entire HTML pages / XML Sitemap

After running either script you should end up with a directory full of HTML files like this:

Next: Load external or inline scripts in Playwright

Previous: Screenshot long pages with lazy loaded images