The origins of Unicode date back to 1987, but it wasn’t until the late ’90s that it became well known, and general adoption really picked on after year 2000. General adoption was possible mainly thanks to UTF-8, the encoding (dating back to 1993, by the way) which provided full compatibility with US-ASCII character set. Anyway, this is an history that most of us know, and now it’s clear to the most that characters do not map to bytes anymore. Here’s a small Perl 5 example for this:
The output is (as expected):
The output of both these scripts is:
Ach! What happened here, with the same string (apparently of 4 chars) having the same length?!? First of all, we should ditch the concept of character, which is way too vague (not to mention in some contexts it’s still a byte) and use the concepts of code point and grapheme. A code point is any “thing” that the Unicode Consortium assigned a code to, while a grapheme is the visual thing you actually see on the computer screen.
Both strings in our example have 4 graphemes. However, snoopy contains an é using a latin small letter e with acute (U+00E9), while in lucy the accented e is made up using two different code points: latin small letter e (U+0065) and combining acute accent (U+0301); since the accent is combining, it joins with the letter before it in a single grapheme.
Comparison is a problem as well, as the two string will not be equal one to the other when compared - and this might not be what you expect.
This is a non-problem in languages such as Perl 6:
The output should be:
We’re using a couple normalization forms here. One is NFD (canonical decomposition), where all the code points are decomposed: in this case, the é becomes always made up of 2 code points. The second one is NFC (canonical decomposition followed by canonical composition), where you get a string with all characters made of one code point (where possible: not all the combining code point sequences of the string may be representable as single code points, so even in the NFC form the number of graphemes might be different than the number of code points): in this case, the é becomes made up of one code point.
In this specific case, since snoopy is fully composed and lucy is fully decomposed, you could (de)compose only one of the string. This should, however, be avoided, since you likely don’t know what’s in the strings you get - so always normalize both.
Please note that there’s much more behind normalization: you can take a look here for more information.
So it’s now clear enough how to know the length of a string in bytes, code points and characters. but what should be the default way of determining a string length? There’s no unique answer to this: most languages return the number of code points, while others such as Perl 6 return the number of graphemes.
If you have a database field which can hold up to a certain number of characters, it probably means code points so you should use those to check the length of a string. If you are determining the length of some user input, you likely want to use graphemes: an user would not understand a “please enter a maximum of 4 characters” error wen entering cité. The length in bytes is necessary when you are working with memory or disk space: of course, the length in bytes should be determined on the string encoded in the character set you plan to use.
It’s worth noting that an approach such as “well, I’ll just write cité in my code instead of using all those ugly code points” is not recommended. First of all, most of the time you are not the one to write but you take input from somewhere. Then, by writing this code:
I’ve been able to get this result:
You should be able to copy and paste the above code and get an identical result, because my browser and blog software didn’t normalize it (which is scary enough, but useful in this particular case).