There are quotation marks and quotation marks, apostrophes and apostrophes, etc: they're just not all alike. A web application (sometimes) needs to deal with this situation.
In ASCII you only have one quotation mark, one backtick, one double quotation mark, etc. In UTF-8 you have (of course) the same ones as ASCII plus some which are outside the ASCII set. These allow for more eye-candy characters, and also for distinctions between (for example) right double quotation mark and left double quotation mark.
Along the same lines, the horizontal ellipses is a character in UTF-8, which in ASCII you have to enter three dots (...) to create it.
All these new and good-looking characters are good, but they can also prove to be problematic to deal with when you are using some piece of software which doesn't work that well with UTF-8). For instance, htmldoc works with latin1: when you convert from UTF-8 to ISO-8859-1 you just lose those characters. In theory you can just not use them, but if you have a web application that takes its input from a form, users could just paste those chars in (it's especially common when pasting from Microsoft Word or other word processing software).
So, you need to get rid of them - that is, convert them to their ASCII equivalents in order to still see something useful. If you use Catalyst, there is Catalyst::Plugin::Params::Demoronize, by Mike Eldridge, a plugin which does the proper conversion of form parameters (it can also deal with the same problems with Windows-1252 charset if necessary).
This plugin works just fine for me, even though it suffers from a couple of minor problems. First, it still uses NEXT as opposed to MRO::Compat, but I already submitted a patch to the author for that. Then, the name just sucks, even though it has historic reasons to be as such; said that, I'm still not able to think of a better one. ;-)
Recent Comments