Why (and how to) convert common UTF-8 characters to ASCII

| | Comments (2)

There are quotation marks and quotation marks, apostrophes and apostrophes, etc: they're just not all alike. A web application (sometimes) needs to deal with this situation.

In ASCII you only have one quotation mark, one backtick, one double quotation mark, etc. In UTF-8 you have (of course) the same ones as ASCII plus some which are outside the ASCII set. These allow for more eye-candy characters, and also for distinctions between (for example) right double quotation mark and left double quotation mark.

Along the same lines, the horizontal ellipses is a character in UTF-8, which in ASCII you have to enter three dots (...) to create it.

All these new and good-looking characters are good, but they can also prove to be problematic to deal with when you are using some piece of software which doesn't work that well with UTF-8). For instance, htmldoc works with latin1: when you convert from UTF-8 to ISO-8859-1 you just lose those characters. In theory you can just not use them, but if you have a web application that takes its input from a form, users could just paste those chars in (it's especially common when pasting from Microsoft Word or other word processing software).

So, you need to get rid of them - that is, convert them to their ASCII equivalents in order to still see something useful. If you use Catalyst, there is Catalyst::Plugin::Params::Demoronize, by Mike Eldridge, a plugin which does the proper conversion of form parameters (it can also deal with the same problems with Windows-1252 charset if necessary).

This plugin works just fine for me, even though it suffers from a couple of minor problems. First, it still uses NEXT as opposed to MRO::Compat, but I already submitted a patch to the author for that. Then, the name just sucks, even though it has historic reasons to be as such; said that, I'm still not able to think of a better one. ;-)

2 Comments

For a more general-purpose sort of ASCII downgrade, check out Text::Unidecode. It's pretty cool.

Hi!

I'm already using Text::Unidecode in other situation. That module, however, would convert to ASCII just everything which is outside of it, which is not what I want: accented letters, which I can then convert to latin1, should remain in UTF-8.

Michele.

Leave a comment

Perl IronMan

My current status is:
Stone Man

About this Entry

This page contains a single entry by Michele Beltrame published on February 18, 2010 11:33 AM.

I'm in for the Belgian Perl Workshop was the previous entry in this blog.

YAPC::Europe 2010 Call for Training Courses is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Categories

Pages

Powered by Movable Type 4.23-en