PDF Generator Best Practices

By Ka Wing Ho, Associate Security Consultant at PrivasecRED

In this blog, I explain the dangers of using server-side PDF generation technologies without properly sanitising user input.

Background

Have you ever used a web application with an “Export to PDF” function? Not to be confused with “Print to PDF” by the way, that’s just something your web browser offers (ie. Ctrl+P). 

Chances are that you have used this functionality at some point — It could be exporting search results, or if you’re an administrative user, it could be exporting logs and statistics surrounding your users and traffic. Similar functionality has existed for quite a while now in the form of “Export to CSV”. Users would generate on-demand plaintext reports with variable data such as dates, columns, number of items etc.

Why use PDF then? Well… CSV isn’t exactly the most readable format:

Exporting to PDF allows users to distribute and archive data in a productive manner, trading import capability for readability.

As is the norm for our current technological age, if you’ve got a problem, you’ll probably find numerous software solutions for it! Here are a handful of PDF generation technologies that have cropped up in the past decade or so:

Software/Library

Type

EO.PDF

Library

DinkToPDF

Library

PDFKit

Library

Node-html-pdf

Library

Go-wkhtmltopdf

Library

Flying Saucer

Library

DomPDF

Library

WeasyPrint

Library

wkhtmltopdf

Headless Browser

Headless Chrome

Headless Browser

PhantomJS

Headless Browser

MicroStrategy

Standalone Software

PrinceXML

Standalone Software

These technologies operate by filling in some form of HTML/XML template with whatever data you’ve queried from the backend database and then generating a nicely formatted PDF report for your end users or admins to look at! Great!

Functionality Abuse

Skipping over the minute details, all it really takes for things to go south is for your input to contain some malicious HTML like: <iframe src=/etc/passwd> or an External XML Entity (XXE), and the backend PDF generation software will happily attach external resources to your generated PDF. In a twist of irony, CSV doesn’t seem like such a bad export format after all…
In case you were wondering what that looks like…

Possible impact

The impact of such functionality abuse depends on various factors such as:

  • The software library used and the chosen implementation methods
  • The capabilities given to the templated language by the parser during execution time
    • Plain-HTML rendering only?
    • Full JavaScript-execution allowed?
  • The underlying environment the software runs under
    • On-premises server?
    • EC2 instance hosted on AWS?

Here is a list of possible impacts:

  • Reading of local files on the backend server: This could lead to unauthorized access if an attacker could read sensitive documents such as configuration files, password files and SSH keys.
  • Exposure of the application server’s original IP address: An attacker could bypass any CDN-level web application firewalls in place to launch attacks against your server directly.
  • Forged communications with other private applications/services in the internal network: The request for external resources can be abused to point inwards instead, which may allow an attacker to reach inside your internal network
  • Disclosure of Access Keys/Tokens due to Cloud Provider Instance Metadata Access: If the service is running in a cloud environment, the internal network may also contain a Metadata endpoint which leaks sensitive information sometimes including access credentials

It’s not a Bug, It’s a Feature!

Unfortunately, this isn’t something that developers can (or will) easily fix, as there are legitimate use cases for some of these HTML/XML tags that cannot be ignored. The current approach involves offloading the responsibility onto the user to control what content can be embedded, rather than stopping the embedding of content altogether.

Let’s put it this way: Car manufacturers build cars with seatbelts in them. If you got injured because you didn’t wear your seatbelt whilst driving, do you blame the car manufacturer? 

Prevention Strategies

The following examples demonstrate measures that have been taken by various providers:
  • PrinceXML: Offers the --no-network and --no-local-files options when executing Prince
  • MicroStrategy: Allows users to configure a whitelist of URLs for content purposes
  • wkhtmltopdf: Offers the --disable-javascript flag to prevent JavaScript execution

Conclusion/Takeaway

Regardless of which PDF generation software you use, the following recommendations should always be followed:

  • Read through the software documentation for features that can be abused and enable offered safety mechanisms to help mitigate issues
  • Properly sanitise and canonicalise all user input
  • Sanitise EXIF data/metadata from the generated PDFs to prevent technology fingerprinting
  • Execute the PDF generation software within a sandboxed environment to minimize impact

References

Corben Leo – Discovery of vulnerability in PriceXML products
https://www.corben.io/XSS-to-XXE-in-Prince/

Triskele Labs – Discovery of vulnerability in MicroStrategy products
https://triskelelabs.com/extracting-your-aws-access-keys-through-a-pdf-file/

AWS – Introduced the IMDSv2 to mitigate SSRF attacks accessing AWS secrets
https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/

PrinceXML Documentation: Security Best Practices
https://www.princexml.com/doc/server-integration/#security

MicroStrategy: Security Best Practices
https://community.microstrategy.com/s/article/Securing-PDF-and-Excel-Export-with-Whitelists

The Daily Swig – Benchmarks against several PDF generation libraries
https://portswigger.net/daily-swig/html-to-pdf-converters-open-to-denial-of-service-ssrf-directory-traversal-attacks

Scroll to Top