PDF Generator Best Practices

By Ka Wing Ho, Associate Security Consultant at PrivasecRED

In this blog, I explain the dangers of using server-side PDF generation technologies without properly sanitising user input.

Background

Have you ever used a web application with an “Export to PDF” function? Not to be confused with “Print to PDF” by the way, that’s just something your web browser offers (ie. Ctrl+P).

Chances are that you have used this functionality at some point — It could be exporting search results, or if you’re an administrative user, it could be exporting logs and statistics surrounding your users and traffic. Similar functionality has existed for quite a while now in the form of “Export to CSV”. Users would generate on-demand plaintext reports with variable data such as dates, columns, number of items etc.

Why use PDF then? Well… CSV isn’t exactly the most readable format:

Exporting to PDF allows users to distribute and archive data in a productive manner, trading import capability for readability.

As is the norm for our current technological age, if you’ve got a problem, you’ll probably find numerous software solutions for it! Here are a handful of PDF generation technologies that have cropped up in the past decade or so:

Software/Library	Type
EO.PDF	Library
DinkToPDF	Library
PDFKit	Library
Node-html-pdf	Library
Go-wkhtmltopdf	Library
Flying Saucer	Library
DomPDF	Library
WeasyPrint	Library
wkhtmltopdf	Headless Browser
Headless Chrome	Headless Browser
PhantomJS	Headless Browser
MicroStrategy	Standalone Software
PrinceXML	Standalone Software

These technologies operate by filling in some form of HTML/XML template with whatever data you’ve queried from the backend database and then generating a nicely formatted PDF report for your end users or admins to look at! Great!

Functionality Abuse

Skipping over the minute details, all it really takes for things to go south is for your input to contain some malicious HTML like: <iframe src=/etc/passwd> or an External XML Entity (XXE), and the backend PDF generation software will happily attach external resources to your generated PDF. In a twist of irony, CSV doesn’t seem like such a bad export format after all…

Possible impact

The impact of such functionality abuse depends on various factors such as:

The software library used and the chosen implementation methods
The capabilities given to the templated language by the parser during execution time
- Plain-HTML rendering only?
- Full JavaScript-execution allowed?
The underlying environment the software runs under
- On-premises server?
- EC2 instance hosted on AWS?

Here is a list of possible impacts:

Reading of local files on the backend server: This could lead to unauthorized access if an attacker could read sensitive documents such as configuration files, password files and SSH keys.
Exposure of the application server’s original IP address: An attacker could bypass any CDN-level web application firewalls in place to launch attacks against your server directly.
Forged communications with other private applications/services in the internal network: The request for external resources can be abused to point inwards instead, which may allow an attacker to reach inside your internal network
Disclosure of Access Keys/Tokens due to Cloud Provider Instance Metadata Access: If the service is running in a cloud environment, the internal network may also contain a Metadata endpoint which leaks sensitive information sometimes including access credentials

It’s not a Bug, It’s a Feature!

Unfortunately, this isn’t something that developers can (or will) easily fix, as there are legitimate use cases for some of these HTML/XML tags that cannot be ignored. The current approach involves offloading the responsibility onto the user to control what content can be embedded, rather than stopping the embedding of content altogether.

Let’s put it this way: Car manufacturers build cars with seatbelts in them. If you got injured because you didn’t wear your seatbelt whilst driving, do you blame the car manufacturer?

Prevention Strategies

The following examples demonstrate measures that have been taken by various providers:

PrinceXML: Offers the --no-network and --no-local-files options when executing Prince
MicroStrategy: Allows users to configure a whitelist of URLs for content purposes
wkhtmltopdf: Offers the --disable-javascript flag to prevent JavaScript execution

Conclusion/Takeaway

Regardless of which PDF generation software you use, the following recommendations should always be followed:

Read through the software documentation for features that can be abused and enable offered safety mechanisms to help mitigate issues
Properly sanitise and canonicalise all user input
Sanitise EXIF data/metadata from the generated PDFs to prevent technology fingerprinting
Execute the PDF generation software within a sandboxed environment to minimize impact

References

Corben Leo – Discovery of vulnerability in PriceXML products
https://www.corben.io/XSS-to-XXE-in-Prince/

Triskele Labs – Discovery of vulnerability in MicroStrategy products
https://triskelelabs.com/extracting-your-aws-access-keys-through-a-pdf-file/

AWS – Introduced the IMDSv2 to mitigate SSRF attacks accessing AWS secrets
https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/

PrinceXML Documentation: Security Best Practices
https://www.princexml.com/doc/server-integration/#security

MicroStrategy: Security Best Practices
https://community.microstrategy.com/s/article/Securing-PDF-and-Excel-Export-with-Whitelists

The Daily Swig – Benchmarks against several PDF generation libraries
https://portswigger.net/daily-swig/html-to-pdf-converters-open-to-denial-of-service-ssrf-directory-traversal-attacks

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.