I’ve been working a project recently to investigate training an LLM (LocalGPT, in this case) to help analyze a document library. (More on that in the future.) For ingest, it handles PDF files really well. I needed some well-known (by me) content to kick the tires for initial prototyping so I decided to dump all the posts from this blog to PDF.
It turns out that wasn’t an exceptionally easy thing to do. Although there were several WordPress plug-ins that purported to be able to do it for a small fee, I felt like it should fairly straightforward to do with some Python.
It was.
I’ve posted the code as a gist here. Or you can copy/paste it below. One thing to pay attention to is that, because it renders the HTML of each post and then converts it to PDF, the current blog theme is applied. Depending on the complexity of the theme, it can lead to interesting outputs so I strongly recommend temporarily switching to some kind of very basic theme before using this code.
I hope you find this useful.
import pdfkit import requests ''' The current theme will be applied to outputs, so it is recommended to switch to a simple theme before exporting. ''' #Generate PDF from individual post URL def url_to_pdf(url, output_filename): try: # Fetch HTML content from the URL response = requests.get(url) response.raise_for_status() # Raises an HTTPError if the response was an unsuccessful status code # Convert HTML to PDF and save it pdfkit.from_string(response.text, output_filename) print(f"PDF successfully saved as {output_filename}") except requests.exceptions.RequestException as e: #print(f"Error fetching the URL: {e}") pass except Exception as e: #print(f"Error converting HTML to PDF: {e}") pass #Get all post URLs via WordPress API def list_all_posts(website_url): posts_list = [] api_url = f"{website_url}/wp-json/wp/v2/posts" page = 1 while True: # Fetch a page of posts response = requests.get(api_url, params={'per_page': 100, 'page': page}) # If the response status code is not 200, break the loop if response.status_code != 200: #print(f"Finished or encountered an error. HTTP status code: {response.status_code}") break posts = response.json() # If the page is empty, break the loop if not posts: break # Print URLs and titles of posts for post in posts: posts_list.append(post['link']) #print(post['link']) page += 1 return posts_list # Usage website_url = "https://blog.yoursite.org" pl = list_all_posts(website_url) #use an integer to generate unique file names. ord = 0 for p in pl: outfile = "p" + str(ord) + ".pdf" url_to_pdf(p, outfile) ord += 1