By Ka Wing Ho, Associate Security Consultant at PrivasecRED
In this blog, I explain the dangers of using server-side PDF generation technologies without properly sanitising user input.
Background
Have you ever used a web application with an “Export to PDF” function? Not to be confused with “Print to PDF” by the way, that’s just something your web browser offers (ie. Ctrl+P).
Chances are that you have used this functionality at some point — It could be exporting search results, or if you’re an administrative user, it could be exporting logs and statistics surrounding your users and traffic. Similar functionality has existed for quite a while now in the form of “Export to CSV”. Users would generate on-demand plaintext reports with variable data such as dates, columns, number of items etc.
Why use PDF then? Well… CSV isn’t exactly the most readable format:
Exporting to PDF allows users to distribute and archive data in a productive manner, trading import capability for readability.
As is the norm for our current technological age, if you’ve got a problem, you’ll probably find numerous software solutions for it! Here are a handful of PDF generation technologies that have cropped up in the past decade or so:
Software/Library | Type |
EO.PDF | Library |
DinkToPDF | Library |
PDFKit | Library |
Node-html-pdf | Library |
Go-wkhtmltopdf | Library |
Flying Saucer | Library |
DomPDF | Library |
WeasyPrint | Library |
wkhtmltopdf | Headless Browser |
Headless Chrome | Headless Browser |
PhantomJS | Headless Browser |
MicroStrategy | Standalone Software |
PrinceXML | Standalone Software |
These technologies operate by filling in some form of HTML/XML template with whatever data you’ve queried from the backend database and then generating a nicely formatted PDF report for your end users or admins to look at! Great!
Functionality Abuse
<iframe src=/etc/passwd>
or an External XML Entity (XXE), and the backend PDF generation software will happily attach external resources to your generated PDF. In a twist of irony, CSV doesn’t seem like such a bad export format after all… Possible impact
The impact of such functionality abuse depends on various factors such as:
- The software library used and the chosen implementation methods
- The capabilities given to the templated language by the parser during execution time
- Plain-HTML rendering only?
- Full JavaScript-execution allowed?
- The underlying environment the software runs under
- On-premises server?
- EC2 instance hosted on AWS?
Here is a list of possible impacts:
- Reading of local files on the backend server: This could lead to unauthorized access if an attacker could read sensitive documents such as configuration files, password files and SSH keys.
- Exposure of the application server’s original IP address: An attacker could bypass any CDN-level web application firewalls in place to launch attacks against your server directly.
- Forged communications with other private applications/services in the internal network: The request for external resources can be abused to point inwards instead, which may allow an attacker to reach inside your internal network
- Disclosure of Access Keys/Tokens due to Cloud Provider Instance Metadata Access: If the service is running in a cloud environment, the internal network may also contain a Metadata endpoint which leaks sensitive information sometimes including access credentials
It’s not a Bug, It’s a Feature!
Unfortunately, this isn’t something that developers can (or will) easily fix, as there are legitimate use cases for some of these HTML/XML tags that cannot be ignored. The current approach involves offloading the responsibility onto the user to control what content can be embedded, rather than stopping the embedding of content altogether.
Let’s put it this way: Car manufacturers build cars with seatbelts in them. If you got injured because you didn’t wear your seatbelt whilst driving, do you blame the car manufacturer?
Prevention Strategies
- PrinceXML: Offers the
--no-network
and--no-local-files
options when executing Prince - MicroStrategy: Allows users to configure a whitelist of URLs for content purposes
- wkhtmltopdf: Offers the
--disable-javascript
flag to prevent JavaScript execution
Conclusion/Takeaway
Regardless of which PDF generation software you use, the following recommendations should always be followed:
- Read through the software documentation for features that can be abused and enable offered safety mechanisms to help mitigate issues
- Properly sanitise and canonicalise all user input
- Sanitise EXIF data/metadata from the generated PDFs to prevent technology fingerprinting
- Execute the PDF generation software within a sandboxed environment to minimize impact
References
Corben Leo – Discovery of vulnerability in PriceXML products
https://www.corben.io/XSS-to-XXE-in-Prince/
Triskele Labs – Discovery of vulnerability in MicroStrategy products
https://triskelelabs.com/extracting-your-aws-access-keys-through-a-pdf-file/
AWS – Introduced the IMDSv2 to mitigate SSRF attacks accessing AWS secrets
https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/
PrinceXML Documentation: Security Best Practices
https://www.princexml.com/doc/server-integration/#security
MicroStrategy: Security Best Practices
https://community.microstrategy.com/s/article/Securing-PDF-and-Excel-Export-with-Whitelists
The Daily Swig – Benchmarks against several PDF generation libraries
https://portswigger.net/daily-swig/html-to-pdf-converters-open-to-denial-of-service-ssrf-directory-traversal-attacks