Internationalization: An Introduction
Chances are most professional programmers will have to deal with software internationalization at some stage in their careers.
Internationalization and localization of software are two sides of the same thing: Localizing an application is customizing it’s language, currency, sort orders, measurement systems, and more for a particular group’s needs. Typically this group is “people of a particular country” but other groups can be defined too. Internationalization is making your software accommodate a range of localizations.
Details of what localization requires varies depending on the nature of the software and the people you are localizing for, but it can potentially affect a surprising number of things in the program. Obvious examples are character encodings and language support. Less obvious are all kinds of number representations, dates (including different calendars), non-decimal currencies, sort orders of lists depending on the local alphabet, varying styles of naming people and locations, different time zones, different standard paper sizes and more.
Depending on what kind of program you’re writing and for whom, the range of internationalization considerations will vary. The golden rules for internationalization are surprisingly simple:
- Keep it simple.
- Don’t make assumptions.
Of course, we can’t do much at all without making any assumptions. The key is to have a general idea of what kinds of things will change and to keep your program organized so that it’s simple to change those things.
Character Encoding
It’s worth knowing a thing or two about character encodings. If you’re allowing translation of your software, or even just text input and output in a users’ native language, then the translation team and the users will be wanting to write some pretty impressive looking symbols, some of which you might not have ever seen before.
Characters are symbols on the screen, and if you don’t know it by now then I have some shocking news for you: Computers can’t store symbols, only numbers. Symbols are stored by “encoding” them as numbers.
One of the oldest character encoding systems still used today is called ASCII, and in that system the letter “A” is encoded by the number 65, “B” by 66, and so on. The computer stores a “65″ in binary, but what it displays is “symbol number 65″, which it might dig up from an array of bitmap images. The 65th image in that array gets displayed and guess what, it looks just like a letter “A”. Computers have been faking the display of symbols like this for years, and it’s amazing how some people graduate from degrees thinking computers really can store letters and not realizing how it works.
The bad news is, it doesn’t stay this simple. ASCII works great for Australian and American symbols, but even 8 bit extended ASCII sets don’t cover all the symbols you’ll need through Europe, let alone further East where we get even more different sets of symbols like Arabic and Chinese and Japanese characters.
There are quite a few character encoding systems around to solve these problems, but one stands out because its goal is to unify all the different systems into one: Unicode. Unicode can currently represent almost 100,000 different symbols and is still growing, with the goal of unifying every representable character in one encoding.
Now, most (but not all!) C compilers use 8 bits to represent a “char”. This is because they were originally designed for working with ASCII, so 8 bits was more than enough. Now we’re wanting to display Unicode, so guess how many bits we’ll need to support all current and future Unicode symbols? The standard says to allow 32 bits for a full, all-out, no symbols left missing character.
If you’re working a lot with just one character, then perhaps “UTF32″, the 32 bit representation of Unicode characters, is the way to go. Fortunately, for efficiency sake, there are other three different standard Unicode encodings:
- UTF-32: 32 bits per character, always.
- UTF-16: 16 bits per character, usually. Rare characters need more.
- UTF-8: 8 bits per character, usually. Many characters need more.
The UTF-16 and UTF-8 systems let you code text in a more compact way, but to still be able to represent everything they need to have various escape mechanisms in them to kick over to different parts of the range of possible characters. These mean that you can no longer assume the number of symbols displayed on the screen will even be a multiple of the number of bytes stored for a string. Luckily, all modern string libraries have functions to help with this sort of thing.
There are pros and cons to each encoding. UTF8 doesn’t require you to detect endianness in files, but probably takes more bytes to represent Chinese writing because so many characters require an escape sequence for that.
UTF-8 is worth special mention because it incorporates traditional
ASCII text. If you write a C string, with no special characters,
like"This is a string", then it’s usable right away as a UTF8
string. That makes updating old programs suddenly much easier. As
this article points out, strlen() will only ever
return the number of bytes used to store a string, not the number
of symbols it really contains. There are functions like wcswidth()
that can check the number of columns you’ll need to display a string
on a terminal, and equivalents for each GUI toolkit which tell you
the number of pixels needed given the font and the Unicode string.
It’s worth noting that most modern graphical toolkits handle Unicode quite well. Gtk2 has Unicode support, and Qt, my personal favorite, offers its own string class which stores all strings internally in UTF-16. Current versions of both can handle right-to-left writing.
According to MSDN, “MFC supports the Unicode standard for encoding wide characters on Windows NT, Windows 2000, and Windows XP platforms. Unicode applications cannot run on Windows 98 platforms.” So as with the Unixy options, working with libraries from this side of the decade is a must if you want Unicode to be easy. MacOS X also uses Unicode as standard, although to the best of my knowledge that wasn’t true of MacOS 9.
Essentially, whether UTF-8, UTF-16 or UTF-32, everyone is converging on Unicode. And you should be too! It’s easy to do, especially if make sure your string libraries are up to date and let them do the hard work for you. Once you support Unicode, your translators can work in nearly any language and even without translating your program, users can write text into your programs in their language of choice, including Japanese Katakana and even just fancy English words like “résumé”.
Language Translation
If you’re lucky, then translation is as simple as marking all the strings in your program as translatable, and letting a software tool dig those strings out into a format for translators to write translations to. The same software package will then select the appropriate translation whenever your program goes to output a string. Easy!
So when you would previously have written a string like this:
printf("My hovercraft is full of eels!");
You should now write it like one of these examples:
GNU programs including Gtk:
printf("%s", gettext("My hovercraft is full of eels!"));
Qt programs:
printf("%s", tr("My hovercraft is full of eels!", "(optional context info here)"));
MFC programs, “.rc” file:
STRINGTABLE DISCARDABLE
BEGIN
IDS_HOVERCRAFT_EELS "My hovercraft is full of eels!"
END
and “.cpp” file something like this (wow my MFC is rusty, feel free to correct me!):
tstring msg;
msg = LoadString(IDS_HOVERCRAFT_EELS);
printf("%s", msg.c_str());
One way or another, those pesky strings get stored away in their own file where it’s easy for translators to go nuts with them.
The MFC uses integers to index which string is wanted (and #define
values like IDS_STRING_PURPOSE to refer to these numbers). All
the other methods I’m aware of simply use the string itself, together
with some optional context information, as the index into the
translations table.
A consideration when accommodating localization of graphical applications is that the size required to display text will change significantly depending on the translation. In the MFC, you will need to allow space, or let the translation team mess with the on-screen layout that goes with particular translations. Both Gtk and Qt avoid this problem using automatic layout management, which takes information like “I want these fields aligned” and “these fields can be flipped from left to right if required by localization”, and adapts the GUI interface to match the constraints you set as well as the sizes of the translated text so it all fits and looks neat.
There are all kinds of traps to watch out for with making your strings translatable. The most common is to keep your program simple so it doesn’t mess with its strings too much. Add in your number, do what you need to, but don’t get cute like this:
printf("%d file%s processed.\n", number, number==1?"":"s");
If you can do with just one string, it’s better, like this:
printf("Files processed: %d\n", number);
Forms are near universal so if you make it “form-like” its nice and easy for the translators. If you really want to get chatty then include both strings in full (I’m ignoring stylistic preferences regarding “if” statements here):
printf(number==1?"%d file processed.":"%d files processed.");
This way the translators at least have both strings to work with, and you won’t get complaints until you translate it to a language where they want different wording again for 3 files, which can happen! Don’t forget that even English gets tricky with numbers, for example “1st”, “2nd”, “3rd”, “4th”, “5th”, and it stays using “th”… until you get to “21st”. If you can organize the display so a very plain number works then it’s really worth doing that.
Avoid saying the same thing in different ways. Say the same thing in the same way. Avoid using different ways to say the same thing. Aside from repeating myself, I’ve just given the translators three strings to translate instead of just one. Of course, using the same words to say two different things isn’t ideal either.
In general, it’s not hard to mark up a program to be translatable. The main rule to remember is to keep the contents of your translatable strings as simple and obvious as possible.
There are endless pitfalls to translation, such as having to worry about the gender of the word “New…” in two similar menu items, depending on what that menu suggests is being created and what gender the word has. Mostly, keep it simple and it’ll be easy to fix when the translators tell you it needs fixing.
Measurement Systems
Are you using Metric or Imperial? Or something else altogether? Make sure it’s possible to set the which units you use.
If you’re changing units a lot, I suggest you choose one set for internally representing data in your program, and just convert to/from other units for display and data input. This has two advantages: Conversions are only done on input and output, so there are less combinations to consider when performing calculations on the data; and also the advantage that saved data is always in the same units, so if Americans with Imperial units load a drawing from Australia with Metric units, they’ll see the same distances regardless of what they call them.
Where you can’t save in a standardized unit, make sure the selected unit is always saved with the data so it’s unambiguous.
Watch Your Assumptions
Paper sizes, currency conversion, and more can change. So make everything settable that isn’t truly a universal constant. Easy.
The format of peoples’ names and addresses can vary a lot. You might need to assume some things like a “first name” and a “last name” field, but allow for the possibility that these might be blank. Some people just have one name. If you throw up a warning for a field left blank, that’s fine, but make a design choice that’s clear from the warning that if you click “yes this is OK” it will be accepted. Think about why you are separating first and last name. Is it to give a short name for “Dear Joe,” in form letters? If so, then why not have a single “full name” field and a single “short name for letters” field? This will make handling “Dear Esteemed Professor” easy if it is required. Or if you do need separate first and last name then you can write a function to fetch a short name that starts with their first name and fetches more if that’s blank. There’s lots of ways to work around different formats of name, and the same general principles apply to addresses. Think about what rules you are imposing and why. Half the time checks like “first name field is blank” get in the way of legitimate data entry rather than preventing mistakes. For postal addresses, the main thing is to allow for long addresses and postal codes that aren’t in a format you’re familiar with.
Round-Up
Internationalization is fairly easy if you keep the program output simple, stay aware of the character set being used (preferably Unicode), and don’t make too many assumptions about what users will give you as data. For simple programs it can be as simple as tagging your strings with the appropriate conventions for the tools you’re using.
If you see terms like “i18n” or “l10n” around, these are just abbreviations for “internationalization” and “localization”. The number represents how many letters were skipped. Personally I think these abbreviations are over-used but you’ll see them around so it’s worth knowing what they mean.
Thanks to Jiri Baum for proofreading this article.
5 Comments
Please use the DP Forums for further discussion of this topic.



Personally, I’d say that if you’re worried about file size you should be compressing anyway, and if you aren’t then it’s not a problem
LZW should do a good job on UTF-8, because UTF-8 involves short sequences of bytes and that’s exactly what LZW handles well…
Jiri
Comment by jiri — On 5-3-2006 at 5:26:38 PM
One thing you might see in software that uses the “gettext” method is that they are so lazy that they can’t be bothered writing “gettext” everywhere, and so define:
define _(string) gettext(string)
then write _(”Here’s some text to translate.”) instead of gettext(”Here’s some text to translate.”).
Comment by Paul Harrison — On 6-3-2006 at 10:06:07 AM
Something to consider about your string tables is that they will be read and translated by
non-programmers. On some of the games I’ve worked on, it was even done by contractors
who never saw the game. So we had to include a lot of context and explanation for each
string, which ended up becoming part of our string source file format.
Then I had a lot of trouble convincing programmers and designers that filling in those fields
wasn’t a complete waste of their time.
Comment by brumby — On 6-3-2006 at 6:01:18 PM
Good article Sarah.
I agree with your comments on “keep it simple” and avoiding saying the same thing in different ways.
This is important not just for internationalization, but also for basic comprehensibility.
A further point - avoid using words like “comprehensibility” - Words that are, at best, marginally correct or idiomatic in English can make translation a nightmare!
Comment by Ian George — On 20-3-2006 at 8:41:02 PM
arabic translators…
Interesting post. I came across this blog by accident, but it was a good accident. I have now bookmarked your blog for future use. Best wishes. Wael Kfoury….
Trackback by Wael Kfoury — On 1-11-2006 at 6:13:41 AM