Sometimes, however, it is desirable to permit a limited dialect of HTML tags. To that end it is necessary
to sanitize the input by removing only potentially malicious tags and attributes (or, because achieving
security is easier as such — allow only tags and attributes that
be used maliciously).
Some applications take the approach of using a proprietary markup language instead of HTML. A simi-
lar topic was discussed in Chapter 6 in the section “Using a Custom Markup Language to Generate
SE-Friendly HTML,” but to a different end — enhancing on-page HTML optimization. It can also
be used to ensure that content is sanitized. In this case, you would execute
over or strip the HTML, then also use a translation function and a limited set of proprietary tags such
, to permit only certain functionality. This is
the approach of many forum web applications such as vBulletin and phpBB. And indeed for specific
applications where users are constantly engaged in dialog and willing to learn the proprietary markup
language, this makes sense. However, for such things as a comment or guest book, HTML provides
a common denominator that most users know, and allowing a restrictive dialect is probably more
prudent with regard to usability. That is the solution discussed here.
As usual, in order to keep your code tidy, group the HTML sanitizing functionality into a separate file.
Go through the following quick exercise, where you create and use this new little library. The code is
Sanitizing User Input
Create a new file named
folder, and write
// sanitizes the HTML code in $inputHTML
$allowed_tags = array(‘<h1>’, ‘<b>’, ‘<i>’, ‘<a>’,
‘<ul>’, ‘<li>’, ‘<pre>’, ‘<hr>’,
$_allowed_tags = implode(‘’, $allowed_tags);
$inputHTML = strip_tags($inputHTML, $_allowed_tags);
return preg_replace(‘#<(.*?)>#ise’, “‘<’ . removeBadAttributes(‘\\1’) . ‘>’“ ,
// removes the unallowed attributes from $inputHTML
// define the list of unallowed attributes
$bad_attributes = ‘onerror|onmousemove|onmouseout|onmouseover|’ .
// remove the bad attributes and return the result
return stripslashes(preg_replace(“#($bad_attributes)(\s*)(?==)#is” ,
Chapter 8: Black Hat SEO
c08.qxd:c08 10:59 185