ASCII transliteration of Unicode strings

| 4 Comments

It's sometimes useful, or even necessary, to represent strings containing accented or other letters, which are outside of the US-ASCII set, as pure ASCII. That is, for instance:

perché   ==> perche

This transliteration might be desirable for various reasons, mainly to use the string somewhere where only ASCII is supported (or desirable). Some folks call this process deaccent, as it's commonly used to remove accents from words in order to make comparisons possible. In practice, accents are not necessarily the only problem, and you'll want to handle things like:

straße   ==> strasse
Tromsø   ==> Tromso

There's a CPAN module which can help here: Text::Unidecode by Sean M. Burke.

use utf8;
use Modern::Perl;
use Text::Unidecode;

for my $word(qw/Tromsø perché straße/) {
    # ASCII representation
    say unidecode($word);
}

This will print, as expected:

Tromso
perche
strasse

As you can see in the module documentation, it's not meticulous, so it doesn't always do a good job. However, Text::Unidecode works nicely with Western European languages along with some others.

4 Comments

If all you want is to strip accents from Unicode, or convert Unicode to ASCII equivalents, I think it would be better to use modules intended for such job, like Text::Unaccent (which uses unac C library to stip accents) or Text::Unidecode (which does US-ASCII transliterations of Unicode text).

See also This Is America, Take Your Unicode Somewhere Else blog post by Ted Dziuba.

Did you compare it with Text::Unaccent?

Just curious.

BTW, s/but Sean M. Burke/by Sean M. Burke/

Hi Pedro!

Text::Unaccent does its (good) work removing accents. That is, out of:

perché
Tromsø
straße

you get:

perche
Tromso
straße

The 3 Unicode characters are actually:

LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LETTER O WITH STROKE
LATIN SMALL LETTER SHARP S

so unaccent does a good work in decomposing the first two removing the "ACUTE" and the "STROKE", but can't do anything for the SHARP S which is character of its own and can't be decomposed.

Michele.

Also a reply for jnareb.openid.pl...

You are just right, the article (thanks for my friend Gianni who explained me a lot of things) is now fixed to not refer to Unicode::Normalize any more. Sorry you had to read the "bad" version. ;-)

Michele.

Leave a comment

About this Entry

This page contains a single entry by Michele Beltrame published on October 15, 2009 9:01 AM.

A review of Real World Haskell by a Perl programmer was the previous entry in this blog.

È nata l'Associazione Perl.It is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Categories

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 5.14-en