The origins of Unicode date back to 1987, but it wasn't until the late '90s that it became well known, and general adoption really picked on after year 2000. General adoption was possible mainly thanks to UTF-8, the encoding (dating back to 1993, by the way) which provided full compatibility with US-ASCII character set. Anyway, this is an history that most of us know, and now it's clear to the most that characters do not map to bytes anymore. Here's a small Perl 5 example for this:

# Perl 5
use v5.20;
use Encode qw/encode/;

my $snoopy = "cit\N{LATIN SMALL LETTER E WITH ACUTE}";
say '==> ' . $snoopy;
say 'Characters (code points): ' . length $snoopy;
say 'Bytes in UTF-8: ' . length encode('UTF-8', $snoopy);
say 'Bytes in UTF-16: ' . length encode('UTF-16', $snoopy);
say 'Bytes in ISO-8859-1: ' . length encode('ISO-8859-1', $snoopy);
The output is as (expected):
==> cité
Characters (code points): 4
Bytes in UTF-8: 5
Bytes in UTF-16: 10
Bytes in ISO-8859-1: 4

Ok, this is is well known. However, if you assume that when thinking in characters instead of bytes you are safe, well, you're wrong. Here's are two example, one in JavaScript (ECMAScript) and one in Perl 5:

/* JavaScript */
var snoopy = "cit\u00E9";
var lucy = "cit\u0065\u0301";

window.document.write('Code points ' + snoopy + ': ' + snoopy.length);
window.document.write('Code points ' + lucy + ': ' + lucy.length);
# Perl 5
my $snoopy = "cit\N{LATIN SMALL LETTER E WITH ACUTE}";
my $lucy = "cit\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}";

say "Code points $snoopy: " . length $snoopy;
say "Code points $lucy: " . length $lucy;

The output of both these scripts is:

Code points cité: 4
Code points cité: 5

Ach! What happened here, with the same string (apparently of 4 chars) having the same length?!? First of all, we should ditch the concept of character, which is way too vague (not to mention in some contexts it's still a byte) and use the concepts of code point and grapheme. A code point is any "thing" that the Unicode Consortium assigned a code to, while a grapheme is the visual thing you actually see on the computer screen.

Both strings in our example have 4 graphemes. However, snoopy contains an é using a latin small letter e with acute (U+00E9), while in lucy the accented e is made up using two different code points: latin small letter e (U+0065) and combining acute accent (U+0301); since the accent is combining, it joins with the letter before it in a single grapheme.

Comparison is a problem as well, as the two string will not be equal one to the other when compared - and this might not be what you expect.

This is a non-problem in languages such as Perl 6:

# Perl 6
# There are like "length" in JavaScript and Perl 5
say $snoopy.codes;     # 4
say $lucy.codes;       # 5

# These actually count the graphemes
say $snoopy.graphs;    # 4
say $lucy.graphs;      # 4

If you don't have Perl 6 on hand, you need to normalize strings, which means to bring them to the same Unicode form. In JavaScript this is possible only starting from ECMAScript 6. Even though current browsers (Firefox 34, Chrome 39 at the time of this article) do not fully support it (not surprisingly, as the standard will be finalized in 2015), Unicode normalization is (luckily) already there. Let's see some examples:

/* JavaScript */
window.document.write('NFD code points ' + str1 + ': ' + str1.normalize('NFD').length);
window.document.write('NFD code points ' + str2 + ': ' + str2.normalize('NFD').length);
window.document.write('NFC code points ' + str1 + ': ' + str1.normalize('NFC').length);
window.document.write('NFC code points ' + str2 + ': ' + str2.normalize('NFC').length);
# Perl 5
use Unicode::Normalize;
say "NFC code points $snoopy": ' . length NFC($snoopy);
say "NFC code points $lucy:" . length NFC($lucy);
say "NFD code points $snoopy:" . length NFD($snoopy);
say "NFD code points $lucy"' . length NFD($lucy);

The output should be:

NFC code points cité: 4
NFC code points cité: 4
NFD code points cité: 5
NFD code points cité: 5

We're using a couple normalization forms here. One is NFD (canonical decomposition), where all the code points are decomposed: in this case, the é becomes always made up of 2 code points. The second one is NFC (canonical decomposition followed by canonical composition), where you get a string with all characters made of one code point (where possible: not all the combining code point sequences of the string may be representable as single code points, so even in the NFC form the number of graphemes might be different than the number of code points): in this case, the é becomes made up of one code point.

In this specific case, since snoopy is fully composed and lucy is fully decomposed, you could (de)compose only one of the string. This should, however, be avoided, since you likely don't know what's in the strings you get - so always normalize both.

Please note that there's much more behind normalization: you can take a look here for more information.

So it's now clear enough how to know the length of a string in bytes, code points and characters. but what should be the default way of determining a string length? There's no unique answer to this: most languages return the number of code points, while others such as Perl 6 return the number of graphemes.

If you have a database field which can hold up to a certain number of characters, it probably means code points so you should use those to check the length of a string. If you are determining the length of some user input, you likely want to use graphemes: an user would not understand a "please enter a maximum of 4 characters" error wen entering cité. The length in bytes is necessary when you are working with memory or disk space: of course, the length in bytes should be determined on the string encoded in the character set you plan to use.

It's worth noting that an approach such as "well, I'll just write cité in my code instead of using all those ugly code points"e; is not recommended. First of all, in most time you are not the one to write but you take input from somewhere. Then, by writing this code:

var str1 = "cité";
var str2 = "cité";

window.document.write(str1 + ' - ' + str1.length + '
'); window.document.write(str2 + ' - ' + str2.length + '
');

I've been able to get this result:

Code points cité: 4
Code points cité: 5

You should be able to copy and paste the above code and get an identical result, because my browser and blog software didn't normalize it (which is scary enough, but useful in this particular case).

Ed arriva anche il diario del 2012. Questa volta abbiamo scelto la costa della Francia sulla Manica: da Cognac fino a Dunkirk, con qualche digressione.

If my customers read this, they won't remain such for a long time. :-) But, then, it's roughly the truth: my production servers tend to be an horrible mess and I sometimes wonder how they manage to actually work without too many issues. Or, at least, that used to be the truth.

Until very recently I used sometimes Apache and sometimes Lighttd, relying on FastCGI to make them speak with the Catalyst web applications. I was at least wise enough to not have FastCGI processes spawned by the web server but to manage them myself instead: I sometimes used daemontools and sometimes Gentoo init system for that.

At one point I decided I wanted something which:

  1. was more straightforward to manage
  2. standardized all my production deployments
  3. consumed less memory
  4. could maybe provide a bit of speed-up

After a bit of research a decided to try nginx as web server and uWSGI to manage application start and stop.

Configuration was all in all fairly easy, but there were a couple of caveats, so I'll go through the entire process.

uWSGI

uWSGI is a great and lighting fast piece of software which can be used to spawn processes for any application which supports the PSGI interface. Catalyst supports it out of the box.

You should find a myapp.psgi file in the top directory of your application. If it doesn't exists (mainly because you created your app before Catalyst began to support PSGI), you can easily create it yourself:

use strict;
use warnings;

use lib './lib';
use MyApp;

my $app = MyApp->apply_default_middlewares(MyApp->psgi_app);
$app;

uWSGI is pre-packaged for many distribution, or can be downloaded and compiled. Once you have it installed, you can launch your application as follows:

uwsgi --master --daemonize /var/log/uwsgi/myapp/log --plugins psgi --socket 127.0.0.1:8787 \
    --processes 2 --psgi /usr/local/catalyst/MyApp/myapp.psgi \
    --pidfile /var/run/uwsgi_myapp/myapp.pid

Please note that uWSGI has a ton of options, so you should take a look at the documentation. The example above launches a master process which then spawns 2 worker processes, which are the instances of your application (--psgi /usr/local/catalyst/MyApp/myapp.psgi). The server is bound to a TCP socket (localhost, port 8789). The remaining options tell uWSGI to run as a deamon, to keep a log file and to write the process PID in a file.

You can use threads instead of processes if you wish, or even both.

The operating system init system is an ideal candidate for uWSGI master processes launch: Gentoo Linux, for instance, has a nice uWSGI configuration system which is straightforward to use (even though I had to patch it a bit in order for it to work properly for my use case).

nginx

Fast and with a minimal memory footprint, with uWSGI support out of the box, nginx is a great web server. It is also surprisingly easy to configure, much more than it's rivals! That's what you need for a virtual host which talk to your uWSGI server:

server {
    server_name www.domain.it;
    
    access_log /var/log/nginx/www.domain.it.access_log main;
    error_log /var/log/nginx/www.domain.it.error_log info;
    
    location / {
        include uwsgi_params;
        uwsgi_pass 127.0.0.1:8787;
        uwsgi_modifier1 5;
    }
    
    location /myapp-static {
        alias /usr/local/catalyst/MyApp/root/static;
    }
}

This configuration maps your web application to the root location (/). The uwsgi_params file contains the parameters which nginx passes on to uWSGI, which are tipically the following:

uwsgi_param  QUERY_STRING       $query_string;
uwsgi_param  REQUEST_METHOD     $request_method;
uwsgi_param  CONTENT_TYPE       $content_type;
uwsgi_param  CONTENT_LENGTH     $content_length;

uwsgi_param  REQUEST_URI        $request_uri;
uwsgi_param  PATH_INFO          $document_uri;
uwsgi_param  DOCUMENT_ROOT      $document_root;
uwsgi_param  SERVER_PROTOCOL    $server_protocol;
uwsgi_param  HTTPS              $https if_not_empty;

uwsgi_param  REMOTE_ADDR        $remote_addr;
uwsgi_param  REMOTE_PORT        $remote_port;
uwsgi_param  SERVER_PORT        $server_port;
uwsgi_param  SERVER_NAME        $server_name;

...and it works like a charm! That's all! ... Except, what happens if you don't want to map your application to / but to, say /app instead? It is entirely possible, but there is a caveat.

There is something in Catalyst which messes the URLs up when you don't map them to root (this happens also with the reverse proxy configuration, while Mojolicious for instance works perfectly). It's probably just matter to write a Plack middleware for nginx: there is one here, but it's not yet on CPAN and I didn't try it. Instead, I modified nginx configuration as follows:

rewrite ^/app$ /app/ permanent;
location /app/ {
    include uwsgi_params_stripped;
    # Stript app from path info, or Catalyst will break
    set $app_path_info "";
    if ( $document_uri ~ ^/app(.*)$ ) {
        set $app_path_info $1;
    }
    uwsgi_param  SCRIPT_NAME        "/app/";
    uwsgi_param  PATH_INFO          $app_path_info;
    uwsgi_pass 127.0.0.1:8787;
    uwsgi_modifier1 5;
}

An extra SCRIPT_NAME parameter is passed, while PATH_INFO is modified. You also need to include a uwsgi_params_stripped file, to avoid passing PATH_INFO with more than one value:

uwsgi_param  QUERY_STRING       $query_string;
uwsgi_param  REQUEST_METHOD     $request_method;
uwsgi_param  CONTENT_TYPE       $content_type;
uwsgi_param  CONTENT_LENGTH     $content_length;

uwsgi_param  REQUEST_URI        $request_uri;
uwsgi_param  DOCUMENT_ROOT      $document_root;
uwsgi_param  SERVER_PROTOCOL    $server_protocol;
uwsgi_param  HTTPS              $https if_not_empty;

uwsgi_param  REMOTE_ADDR        $remote_addr;
uwsgi_param  REMOTE_PORT        $remote_port;
uwsgi_param  SERVER_PORT        $server_port;
uwsgi_param  SERVER_NAME        $server_name;

Note: you can also use FastCGI or reverse proxy to make uWSGI server and nginx talk, but direct uwsgi support is the most efficient way to do it.

And what about some lengthy administrative tasks (old file deletion, mail queue processing, ...) you application might have to do? The easiest way with Catalyst is to creation an action (with restricted, maybe IP-based, access) which you execute either by hand on with a cronjob. If one of these tasks requires, say, 15 minutes, you need to configure nginx not to timeout while waiting for a response from the application - but you surely don't want to set the gateway timeout to 15 minutes for all your users.

The solution is easy. Just configure another mapping within localhost, with the appropriate settings:

    uwsgi_read_timeout 900; # 15 minutes
    # Maybe disable buffering so if you are sending status messages
    # with $c->res->write() you see them as they are sent
    uwsgi_buffering off;

This short HOWTO explains how to set up the excellent nginx to work with a SSL certificate released from a CA. The whole process is fairly easy, but not completely straightforward.

I'm assuming the host name for which the certificate will be set up is www.domain.ext and the operating system is Gentoo Linux (the process shouldn't be too different with another OS, though). Also, in my example I'm assuming that the certificate is a PositiveSSL from Comodo: using any other equivalent certificate should not make much difference.

First of all, make sure you have OpenSSL and that nginx is compiled with ssl support. In order to create your private key and the certificate request, I suggest you cd to you web server directory:

cd /etc/nginx

before generating the needed files with these two commands:

openssl genrsa -des3 -out www.domain.ext.key 2048
openssl req -new -key www.domain.ext.key -out www.domain.ext.csr

When, after issuing the second command, you are asked for the Common Name, be sure to enter the name of the host where you want to use you certificate, i.e.:

www.domain.ext

This will only work for https://www.domain.ext, and not for https://domain.ext or https://anyotherthing.domain.ext. Wildcard certificates exist, but they're more expensive: they seem to not be so useful, but they are for instance needed to make SSL name-based virtual hosts (these have some caveats, though).

OK, now you have the certificate request file, www.domain.ext.csr: go to your CA and upload it. After the verifications (which in most cases are just the verification of an e-mail address inside the domain), you'll get a download link for the certificate, which will likely be a ZIP file. This file contains the certificate (a file named domain.ext.crt or something similar) and maybe the CA "intermediate" certificate (which in case of PositiveSSL is named positive_bundle.crt).

At this point you have all the needed files, but a couple of actions still need to be performed. If you entered a password when creating the private key with OpenSSL, you'll now most likely want to remove it, otherwise nginx will always prompt you for it when starting (which is not so handy):

cp www.mydomain.ext.key www.mydomain.ext.key.orig
openssl rsa -in www.mydomain.ext.key.orig -out www.mydomain.ext.key

If the file you received from the CA also contained one or more intermediated certificates, you'll need to concatenate them because nginx want a single file:

cat www.domain.ext.crt positive_bundle.crt > www.domain.ext.pem

Be sure to put your server certificate file at the beginning of the concatenated pem file, as in the example below: otherwise, nginx will pick the wrong one up.

For the sake of security you'd better make all these files readable only by root user:

# Also chown or nginx won't be able to read the files
chown nginx:nginx *.pem *.key *.csr *.crt *.orig
chmod 600 *.pem *.key *.csr *.crt *.orig

The final step is the configuration of the web server. Nginx is incredibly powerful but also extraordinarily easy to manage. Open nginx.conf and add something similar to the following (have a look at nginx documentation for more options):

server {
        listen 15.15.15.15:443;
        server_name www.domain.ext;

        ssl on;
        ssl_certificate /etc/nginx/www.domain.ext.pem;
        ssl_certificate_key /etc/nginx/www.domain.ext.key;

        access_log /var/log/nginx/www.domain.ext.access_log main;
        error_log /var/log/nginx/www.domain.ext.error_log info;

        root  /usr/local/domains/www.domain.ext;
}

You should be all set and ready to go now!

In this article I'm going to explain some of the problems I face when upgrading libraries, language interpreters and other pieces of software which power the web applications I use in production. I'll then be showing how Perl, Catalyst, DBIx::Class and many of the other CPAN modules I use cleverly solve most of these issues.

The issues with software upgrades

When choosing the instruments for building a new web application (or a software in general), a programmer usually bases his decisions on aspects such as knowledge of the language, availability of needed libraries, speed of development, speed of compiled code, and a few others. There is however an important aspect which often doesn't get properly evaluated, which basically is the answer to the question: what is going to happen to my application in 5 years or so?.

IMG_0563

This question actually needs to be broken down in at least four parts:

  1. What is going to happen when a new version of the language (interpreter, compiler) I use will be released?
  2. What is going to happen when a new version of the framework/libraries I use will be released?
  3. What is going to happen when the server where the application is gets updated?
  4. Do I really need to update libraries/language/system/other software?

Minor releases of language interpreters or compilers (question 1) don't usually feature incompatible changes: if they do, that's probably a bug. Major releases, instead, could. For instance, PHP 5 had some incompatible changes when compared to PHP 4 (even though they were just a few). You're not forced to upgrade, but you might actually want to: a configuration option (an instruction at the top of the source code or so) which enables or disables old behaviour could be desirable for situation such as this.

New versions of libraries/modules/frameworks (question 2) sometimes bring incompatible changes, mainly due to deprecation of features: you can't always support legacy things, it's a fact. It's however important to have a good deprecation -> removal cycle for features: this allows users of a library to get warned much beforehand so they have plenty of time to patch their software, and can decide when to do it. Since libraries are developed by a lot of different people, this aspect is covered better or worse depending on the developer.

If you are hosted in a data center in a managed server (which sometimes gets upgraded even if you don't ask for it), or if you decide it's time to update your old system, then you need an answer for question 3. It is basically a sum of 1 and 2, adding some more possible incompatibilities with system tools, etc. You should choose a provider which notifies you months ahead of possible big upgrades to their systems.

So, should you upgrade (question 4)? My opinion is yes, you should do your best to have an up to date system of stable, distribution quality, software, because you're likely to get the latest security patches and the best performance. However, there's no reason to hurry an upgrade, except for serious security issues: take your time, a rushed upgrade is much worse than leaving a working system as is.

The (smart) solutions with Catalyst and Perl

I have some applications in Perl which use Catalyst in production since 2007 or so: Perl was upgraded several times (from version 5.8.8 up to 5.16.1 as of today); libraries were upgraded countless times; operating system was updated regularly. After all of this, the application still works almost no change in 5 years!

First of all, the main libraries I use (Catalyst and DBIx::Class, plus some Catalyst plugins and tenths of other CPAN modules) have an outstanding deprecation policy, which allows me to know way beforehand what API features get removed or changed; also, the code modifications I needed to make were always small enough not to be a real issue.

Perl 5 itself does a pretty good job when it comes to maintain backwards compatibility. When a new major release comes out (i.e. 5.14 => 5.16), backwards compatibility is the default, as you have to specifically enable new features with something like:

use v5.12;

# And it's scoped lexically, so you can
# upgrade PARTS of your software
sub mysub {
    use v5.14;
    ...
    {
        use v5.16;
        ...
    }
}

Thanks to these clever features, which actually solve most of the issue for you, upgrading the software underneath your application while having the application still works (with the added benefits of the upgrade, too!) become a much smaller problem. Just to say an example, I recently upgraded a server with a Catalyst application from perl 5.14.2 to 5.16.1: this involved the reinstallation of some 476 CPAN modules after the upgrade; when it was finished, the application was restarted and it continued to run exactly the same as before, without a single change made.

Also, if you feel you don't want to update your perl interpreter when the operating system gets update, you're not forced to use the perl bundled with the system: take a look at perlbrew, and you'll have your own interpreter in your user directory (you don't even need root access to compile and install it), fully independent and fully managed by you.

Summing it all up, Perl and its ecosystem are proving to be very trustable, and this in turn makes applications very trustable as well, with all the derived benefits!

Recent Comments

  • Michele Beltrame: Hello Hobbs! Thank you for your suggestion. There was already read more
  • hobbs.cleverdomain.org: It's worth noting that the number of codepoints in NFC read more
  • https://www.google.com/accounts/o8/id?id=AItOawlYtyPgsC2jkOV5qvoVq1h9nKhRJhA5mV4: Oops, fixed that. Thanks alex! Michele. read more
  • alex.hartmaier: Note that no CA will issue you a certificate these read more
  • https://me.yahoo.com/a/P6NTpWwWlsUJ2HV2JMd30jQPISYZ8M8Q#108c4: It is good to hear that news. Sympha is improving. read more
  • lordarthas.myopenid.com: Hello marcosolari! > This script doesn't want to work, for read more
  • marcosolari.myopenid.com: Hi, This script doesn't want to work, for me... :-( read more
  • lordarthas.myopenid.com: Hello! You're right. It actually worked like a charm anyway, read more
  • somethingdoug.com: The thing that pops out at me is that you read more
  • lordarthas.myopenid.com: Hello! In 2009 this script was tested and did indeed read more

Categories

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 5.14-en