Graphemes, code points, characters and bytes

| 2 Comments

The origins of Unicode date back to 1987, but it wasn't until the late '90s that it became well known, and general adoption really picked on after year 2000. General adoption was possible mainly thanks to UTF-8, the encoding (dating back to 1993, by the way) which provided full compatibility with US-ASCII character set. Anyway, this is an history that most of us know, and now it's clear to the most that characters do not map to bytes anymore. Here's a small Perl 5 example for this:

# Perl 5
use v5.20;
use Encode qw/encode/;

my $snoopy = "cit\N{LATIN SMALL LETTER E WITH ACUTE}";
say '==> ' . $snoopy;
say 'Characters (code points): ' . length $snoopy;
say 'Bytes in UTF-8: ' . length encode('UTF-8', $snoopy);
say 'Bytes in UTF-16: ' . length encode('UTF-16', $snoopy);
say 'Bytes in ISO-8859-1: ' . length encode('ISO-8859-1', $snoopy);
The output is as (expected):
==> cité
Characters (code points): 4
Bytes in UTF-8: 5
Bytes in UTF-16: 10
Bytes in ISO-8859-1: 4

Ok, this is is well known. However, if you assume that when thinking in characters instead of bytes you are safe, well, you're wrong. Here's are two example, one in JavaScript (ECMAScript) and one in Perl 5:

/* JavaScript */
var snoopy = "cit\u00E9";
var lucy = "cit\u0065\u0301";

window.document.write('Code points ' + snoopy + ': ' + snoopy.length);
window.document.write('Code points ' + lucy + ': ' + lucy.length);
# Perl 5
my $snoopy = "cit\N{LATIN SMALL LETTER E WITH ACUTE}";
my $lucy = "cit\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}";

say "Code points $snoopy: " . length $snoopy;
say "Code points $lucy: " . length $lucy;

The output of both these scripts is:

Code points cité: 4
Code points cité: 5

Ach! What happened here, with the same string (apparently of 4 chars) having the same length?!? First of all, we should ditch the concept of character, which is way too vague (not to mention in some contexts it's still a byte) and use the concepts of code point and grapheme. A code point is any "thing" that the Unicode Consortium assigned a code to, while a grapheme is the visual thing you actually see on the computer screen.

Both strings in our example have 4 graphemes. However, snoopy contains an é using a latin small letter e with acute (U+00E9), while in lucy the accented e is made up using two different code points: latin small letter e (U+0065) and combining acute accent (U+0301); since the accent is combining, it joins with the letter before it in a single grapheme.

Comparison is a problem as well, as the two string will not be equal one to the other when compared - and this might not be what you expect.

This is a non-problem in languages such as Perl 6:

# Perl 6
# There are like "length" in JavaScript and Perl 5
say $snoopy.codes;     # 4
say $lucy.codes;       # 5

# These actually count the graphemes
say $snoopy.graphs;    # 4
say $lucy.graphs;      # 4

If you don't have Perl 6 on hand, you need to normalize strings, which means to bring them to the same Unicode form. In JavaScript this is possible only starting from ECMAScript 6. Even though current browsers (Firefox 34, Chrome 39 at the time of this article) do not fully support it (not surprisingly, as the standard will be finalized in 2015), Unicode normalization is (luckily) already there. Let's see some examples:

/* JavaScript */
window.document.write('NFD code points ' + str1 + ': ' + str1.normalize('NFD').length);
window.document.write('NFD code points ' + str2 + ': ' + str2.normalize('NFD').length);
window.document.write('NFC code points ' + str1 + ': ' + str1.normalize('NFC').length);
window.document.write('NFC code points ' + str2 + ': ' + str2.normalize('NFC').length);
# Perl 5
use Unicode::Normalize;
say "NFC code points $snoopy": ' . length NFC($snoopy);
say "NFC code points $lucy:" . length NFC($lucy);
say "NFD code points $snoopy:" . length NFD($snoopy);
say "NFD code points $lucy"' . length NFD($lucy);

The output should be:

NFC code points cité: 4
NFC code points cité: 4
NFD code points cité: 5
NFD code points cité: 5

We're using a couple normalization forms here. One is NFD (canonical decomposition), where all the code points are decomposed: in this case, the é becomes always made up of 2 code points. The second one is NFC (canonical decomposition followed by canonical composition), where you get a string with all characters made of one code point (where possible: not all the combining code point sequences of the string may be representable as single code points, so even in the NFC form the number of graphemes might be different than the number of code points): in this case, the é becomes made up of one code point.

In this specific case, since snoopy is fully composed and lucy is fully decomposed, you could (de)compose only one of the string. This should, however, be avoided, since you likely don't know what's in the strings you get - so always normalize both.

Please note that there's much more behind normalization: you can take a look here for more information.

So it's now clear enough how to know the length of a string in bytes, code points and characters. but what should be the default way of determining a string length? There's no unique answer to this: most languages return the number of code points, while others such as Perl 6 return the number of graphemes.

If you have a database field which can hold up to a certain number of characters, it probably means code points so you should use those to check the length of a string. If you are determining the length of some user input, you likely want to use graphemes: an user would not understand a "please enter a maximum of 4 characters" error wen entering cité. The length in bytes is necessary when you are working with memory or disk space: of course, the length in bytes should be determined on the string encoded in the character set you plan to use.

It's worth noting that an approach such as "well, I'll just write cité in my code instead of using all those ugly code points"e; is not recommended. First of all, in most time you are not the one to write but you take input from somewhere. Then, by writing this code:

var str1 = "cité";
var str2 = "cité";

window.document.write(str1 + ' - ' + str1.length + '
'); window.document.write(str2 + ' - ' + str2.length + '
');

I've been able to get this result:

Code points cité: 4
Code points cité: 5

You should be able to copy and paste the above code and get an identical result, because my browser and blog software didn't normalize it (which is scary enough, but useful in this particular case).

2 Comments

It's worth noting that the number of codepoints in NFC is not always going to be the number of graphemes — it depends mostly on whether all of the code point sequences in your string are representable as precomposed characters. It's not a thing to be counted on.

Leave a comment

About this Entry

This page contains a single entry by Michele Beltrame published on December 12, 2014 11:45 AM.

Bretagna + Normandia + Guernsey 2012 was the previous entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Categories

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 5.14-en