Internationalization-puzzles: Daily programming puzzles just like Advent of Code

(from the first puzzle)

> The venerable SMS system uses a message limit of 160 bytes. This was designed so that a message could fit in exactly one packet, thus being really cheap and fast to handle on first-generation mobile phone networks. Although the approach makes sense for technical reasons, it unfairly penalizes people who use non-latin (e.g. Russian, Greek, Japanese) alphabets - in most encodings, they need more bytes per character than latin alphabets.

Except that obviously the system is going to use an encoding that makes sense for the local language. It was long remarked that Chinese Twitter users enjoyed a less restrictive limit. [Practically no limit at all, since as this puzzle notes Twitter limited by the character instead of the byte.]

You need two bytes per character in Chinese (unless you really want to use UTF-8).

    是她吗？                         -  8 bytes
    Is that her?                     - 12 bytes
    是的                             - 4 bytes
    Yes                              - 3 bytes
    我做了很多宝宝的表情包             - 22 bytes
    I made a lot of stickers of her  - 31 bytes

This doesn't look like a penalty to me. If we did switch the Chinese into UTF-8, it would take about as much space as the English.

gmokki 4 months ago

SMS in Europe was max 140 bytes and they had various custom 7bit encodings for most western languages. SMS also supported ucs-2 aka Unicode with fixed 16bit codepoints which cannot do modern emojis, but all normal languages can be shown, whether your phone has/had the fonts was another matter.
And when concatenating the SMS messages the UserDatHhader had to be added taking minimum 6 bytes, thus reducing the bytes available from 140 to 134 bytes, which allows only 153 or 67 characters for each 7bit or unicode SMS messages respectively.

fodkodrasz 4 months ago

In the first puzzle, the first line is in Hungarian, but the Hungarian letters i18n usually struggles most with are not even there: üö űő, before utf-8 got widespread adoption these characters regulary got messed up when they were passed along multiple systems.

I still see basic accented characters like éá messed up sometimes, which is especially a shame in 2025.

amarillion 4 months ago

That's true, those characters are from the iso-latin-2 set, but iso-latin-1 was/is more dominant, so there was a lot of potential for confusion.

edarchis 4 months ago

Starts with I18N in SMS. There should be a trigger warning on those. SMS is dreadful in itself. In an international setting, it's a nightmare. But once people wonder why they costs exploded since they changed their welcome message with an accented character...

Timwi 4 months ago

The programming challenge does not require interfacing with actual SMS or to know or use any part of the SMS protocol.

curtisszmania 4 months ago

[dead]