Crawling html to generate pdf

First of all, I read this article The front end uses the puppeter crawler to generate the PDF of React.js small book and merge it , it's hard to find that the final pdf is not bookmarked, so we mainly add bookmarking function on this basis.

The example site to crawl to is React.js , just learning and communication

Generate pdf for web pages

Using puppeter to crawl web pages and generate pdf

Chinese document of puppeter

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
  await page.pdf({path: 'hn.pdf', format: 'A4'});

  await browser.close();
})();

Synthesis of pdf

pdf-merge Merge pdf

pdftk dependent

How to bookmark pdf

pdftk : a tool for pdf processing

  • Add the bin directory to the environment variable after installation

Bookmark pdf with update info utf8:

pdftk 'd:\OpenSource\My\genpfdforrsb\React Little book(No bookmarks).pdf' update_info_utf8 'd:\OpenSource\My\genpfdforrsb\bookmarks.txt' output 'd:\OpenSource\My\genpfdforrsb\React Little book.pdf'

What is a bookmark

That is, bookmarks.txt

Bookmark format:

BookmarkBegin
BookmarkTitle: PDF Reference (Version 1.5)
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: Contents
BookmarkLevel: 2
BookmarkPageNumber: 3

Determine bookmark page number

pdfjs-dist : gets the number of individual pdf pages, which is used to specify the page number in bookmarks.txt

Generate bookmarks

const pageArr = result.map(c => c.numPages);
let txt = ''
for (let index = 0; index < pageArr.length; index++) {
    let temp = `BookmarkBegin\r\nBookmarkTitle: ${titleArr[index]}\r\nBookmarkLevel: 1\r\nBookmarkPageNumber: ${pageIndex}\r\n`
    txt += temp
    pageIndex += pageArr[index]
}
fs.writeFileSync('bookmarks.txt', txt);

Add bookmarks

Refer to PDF merge source code, add runshell.js to execute pdftk command in node

runshell.js is as follows:

'use strict';
const child = require('child_process');
const Promise = require('bluebird');
const exec = Promise.promisify(child.exec);

module.exports = (scripts) => new Promise((resolve, reject) => {
    exec(scripts)
        .then(resolve)
        .catch(reject);
});

Execute pdftk update? Info? Utf8

const nobkname = 'React Little book(No bookmarks).pdf'
const hasbkname = 'React Little book.pdf'
mergepdf(nobkname).then(buffer => {
    console.log('starting add bookmarks!')
    runshell(`pdftk "${__dirname}/${nobkname}" update_info_utf8 "${__dirname}/bookmarks.txt" output "${__dirname}/${hasbkname}"`).then(() => {
        console.log('completed add bookmarks!')
        fs.unlinkSync(`${__dirname}/${nobkname}`);
        fs.unlinkSync(`${__dirname}/bookmarks.txt`);
        console.log('all completed!')
    })
})
  • Double quotes required for file path

Source code: genpfdforrsb

problem

The combined pdf page number is not consecutive or a single pdf page number

Tags: node.js React

Posted on Sun, 01 Dec 2019 01:32:37 -0800 by Hopps