humanmade / whitelist-html
Safe HTML for WordPress. Adds esc_*()-like function to allow some HTML.
This package is auto-updated.
Last update: 2020-06-09 13:18:25 UTC
README
Introduces an esc_*()
-like function for when you need to allow some HTML.
Rationale
Background
Best practices when working with any sort of data is to escape your output, and
do it as late as possible. Values can't usually know about where they're going
to be used, so you need to escape based on whatever context you're
outputting into. These are things like esc_attr()
for HTML attribute values,
esc_html()
for text in HTML, and so on.
Even if values could know about their output context, it's still possible for users to craft malicious output if you're not escaping properly. For this reason, you need to do sanitization on input (to ensure your value is correct), as well as escaping on output (to ensure the value is output into the context correctly).
Right now across every WordPress site, there's a glaring hole in escaping, and hence in security.
When translating strings in WordPress, the most common functions to use are
__()
(translate and return) or _e()
(translate and output). Where possible,
these need to be escaped too, to ensure that translations don't accidentally
break your output. For this reason, esc_html_e()
, esc_attr_e()
, etc are
offered as convenience functions.
However, this falls down when you need to have HTML in the translation. Translation best practices say to include as much information as possible for translators when you translate a string. This means including HTML tags in the string so translators can understand how the sentence is formed.
It's possible to do "clever" hacks to get around this with placeholders, for example:
$text = sprintf( esc_html__( 'This is some text %1$swith a link%2$s'), '<a href="http://example.com/">', '</a>' );
Note though that this is much harder for translators to understand, since they can't intuitively tell what's going on without checking the code. Even with translator comments, it's still harder to understand. There's also no guarantee that this is secure. You could swap the placeholders, or leave out pieces. Best practice states that we should instead have the following:
$text = sprintf( esc_html__( 'This is some text <a href="%1$s">with a link</a>'), 'http://example.com/' );
Right now, the policy is essentially to treat translated strings with HTML as trusted. Not only does this push the burden off to translation validators in GlotPress, but it means you're no longer in control of your output. This is an attack vector waiting to be exploited.
How do we solve this?
WordPress contains functions specifically designed to help with this problem. After all, people can submit comments or posts with HTML in them, but WP can handle this fine. WordPress handles this through a library called kses, which sanitizes HTML down to a small, safe subset of HTML. Posts can have more HTML tags than comments can, since they're usually semi-trusted users.
kses is great, but is not typically used outside of large HTML blocks like post or comment content. The reason for this is often stated as performance. It's well-known that kses is pretty slow, since it has to essentially disassemble the HTML, then reconstruct it with the allowed tags.
However, Zack Tollman wrote a fantastic post that calls into question this accepted knowledge of kses performance. Zack's findings show that while kses is worse with performance on longer pieces of content (like post content), it's actually closer to being on-par with other escaping for short strings. This is even more evident when reducing the allowed elements down from the default to just the elements you need.
clean_html
This library provides a nice, easy, performant way to perform sanitization on
translated strings. Rather than requiring you to work with the internals of
kses, it's much closer to functions like esc_html
.
Security is only useful if it's also usable. For the most part, clean_html
can be used in exactly the same way developers are used to using other escaping
functions.
A quick example to demonstrate how easy it is:
<!-- Previously --> <p><?php _e( 'This is a terrific use of <code>WP_Error</code>.' ) ?></p> <!-- Secure version --> <p><?php print_clean_html( __( 'This is a terrific use of <code>WP_Error</code>.' ), 'code' ) ?></p>
Even if a malicious translator changed this to include a link to a spam site (or
worse), this would be caught and stripped by clean_html
.
Taking our original example from above, we can modify it to only allow a
tags:
$text = clean_html( sprintf( __( 'This is some text <a href="%1$s">with a link</a>'), 'http://example.com/' ), 'a' );
It's that easy. You can do this with multiple elements as well, using a comma-separated string or list of elements:
$text = clean_html( sprintf( __( 'This is <code>some</code> text <a href="%1$s">with a link</a>'), 'http://example.com/' ), 'a, code' // or array( 'a', 'code' ) );
If you need custom attributes, you can use kses-style attribute specifiers. These can be mixed too:
$text = clean_html( sprintf( __( 'This is <span class="x">some</span> text <a href="%1$s">with a link</a>'), 'http://example.com/' ), array( 'a', 'span' => array( 'class' => true, ), ) );
Performance Test
In a quick test, the string
'hello with a <a href="wak://example.com">malicious extra link!<///q><o>b'
was
run through both clean_html
(with only a
) and esc_html
with 10,000
iterations. While the two functions don't perform the same task, they're both
escaping functions, so it's useful to compare performance to understand whether
this approach can be used in production code.
In an unscientific trial, this gave figures of 0.96s for clean_html
and
1.07s for esc_html
for 10,000 trials each. This indicates that
clean_html
is at least on the order of other escaping functions.
Using this Library
Two steps to using this library:
- Add this library in as a git submodule.
- Load
clean-html.php
before you need to use it. We recommend inmu-plugins
, but you can also load it in viawp-config.php
if you want it earlier.
Done. Start using the function.