Export WordPress Posts to PDF

I’ve been working a project recently to investigate training an LLM (LocalGPT, in this case) to help analyze a document library. (More on that in the future.) For ingest, it handles PDF files really well. I needed some well-known (by me) content to kick the tires for initial prototyping so I decided to dump all the posts from this blog to PDF.

It turns out that wasn’t an exceptionally easy thing to do. Although there were several WordPress plug-ins that purported to be able to do it for a small fee, I felt like it should fairly straightforward to do with some Python.

It was.

I’ve posted the code as a gist here. Or you can copy/paste it below. One thing to pay attention to is that, because it renders the HTML of each post and then converts it to PDF, the current blog theme is applied. Depending on the complexity of the theme, it can lead to interesting outputs so I strongly recommend temporarily switching to some kind of very basic theme before using this code.

I hope you find this useful.

import pdfkit
import requests

'''
The current theme will be applied to outputs, so it is recommended to switch to a simple theme before exporting.
'''

#Generate PDF from individual post URL
def url_to_pdf(url, output_filename):
    try:
        # Fetch HTML content from the URL
        response = requests.get(url)
        response.raise_for_status()  # Raises an HTTPError if the response was an unsuccessful status code

        # Convert HTML to PDF and save it
        pdfkit.from_string(response.text, output_filename)
        print(f"PDF successfully saved as {output_filename}")
    except requests.exceptions.RequestException as e:
        #print(f"Error fetching the URL: {e}")
        pass
    except Exception as e:
        #print(f"Error converting HTML to PDF: {e}")
        pass

#Get all post URLs via WordPress API
def list_all_posts(website_url):
    posts_list = []
    api_url = f"{website_url}/wp-json/wp/v2/posts"
    page = 1
    while True:
        # Fetch a page of posts
        response = requests.get(api_url, params={'per_page': 100, 'page': page})
        # If the response status code is not 200, break the loop
        if response.status_code != 200:
            #print(f"Finished or encountered an error. HTTP status code: {response.status_code}")
            break
        posts = response.json()
        # If the page is empty, break the loop
        if not posts:
            break
        # Print URLs and titles of posts
        for post in posts:
            posts_list.append(post['link'])
            #print(post['link'])
        page += 1
    return posts_list


# Usage
website_url = "https://blog.yoursite.org"
pl = list_all_posts(website_url)
#use an integer to generate unique file names.
ord = 0
for p in pl:
    outfile = "p" + str(ord) + ".pdf"
    url_to_pdf(p, outfile)
    ord += 1