Monday, 24 January 2022

Unicode Trivia U+07C1

Codepoint: U+07C1 "NKO DIGIT ONE"
Block: U+07C0..07FF "NKo"

As reported by Dr Dianne White Oyler in "A Cultural Revolution in Africa: Literacy in the Republic of Guinea since Independence" (2001), the N'Ko script was developed by Souleymane Kanté in 1949, partly in response to

a 1944 challenge posed by the Lebanese journalist Kamal Marwa in an Arabic-language publication, Nahnu fi Afrikiya [We Are in Africa]. Marwa argued that Africans were inferior because they possessed no indigenous written form of communication. His statement that "African voices [languages] are like those of the birds, impossible to transcribe" reflected the prevailing views of many colonial Europeans. Although the journalist acknowledged that the Vai had created a syllabary, he discounted its cultural relevancy because he deemed it incomplete. [Page 588]

Kanté discarded both Arabic and Latin scripts as unable to transcribe all the characteristics of the Mande languages. Having developed a completely novel alphabet instead,

he called together children and illiterates and asked them to draw a line in the dirt; he noticed that seven out of ten drew the line from right to left. For that reason he chose a right-to-left orientation. In all Mande languages the pronoun n- means "I" and the verb ko represents the verb "to say". [Page 589]

So "N'Ko" means "I say" in all the target languages.

The right-to-left mantra extends not only to words, but to digits and numbers too. The ten digits zero to nine (U+07C0 "NKO DIGIT ZERO" to U+07C9 "NKO DIGIT NINE") face right:

N'Ko digits (top), Western Arabic (middle), Eastern Arabic (bottom)

This is particularly noticeable with U+07C1 "NKO DIGIT ONE": '߁'

Not only that, but the least significant digits of multi-digit N'Ko numbers are on the left, unlike almost all other writing systems. Latin, Greek, Arabic and Hebrew numbers place the least significant digit on the right, even though the latter two scripts are written right-to-left.

Consider the improbable phrase "There are 12345 eggs":

There are 12345 eggs = English

Υπάρχουν 12345 αυγά = Greek

 يوجد ١٢٣٤٥ بيضة = Arabic

יש 12345 ביצים = Hebrew

߁߂߃߄߅ ߞߟߌ߫ ߦߋ߫ ߦߋ߲߬ = N’Ko

In case of tofu:

Note that the order of the codepoints for "1", "2", "3" "4" and "5" occur in ascending memory order in all cases. For example:


At first, I wasn't sure how much "support" the Unicode standard gives for this type of anomaly. UCD's sister project CLDR (Common Locale Data Repository) has very little to say about N'Ko. There is scope for algorithmic number formatting, but I didn't find anything specific.

However, after a bit of thought I realised that, because directionality is a property of each codepoint and not of the script of the codepoints, digit ordering in N'Ko works "out of the box".

Consider these bidirectional class fields ("bc") from the UCD:

  • Latin
    • "A" (U+0041 "LATIN CAPTIAL LETTER A") =  "L" = strong left-to-right
    • "1" (U+0041 "LATIN CAPTIAL LETTER A") = "EN" = European number (left-to-right)
  • Greek
    • "α" (U+03B1 "GREEK SMALL LETTER ALPHA") =  "L" = strong left-to-right
  • Arabic
    • "ا" (U+0627 "ARABIC LETTER ALEF") = "AL" =  Arabic letter (right-to-left)
    • "١" (U+0661 "ARABIC-INDIC DIGIT ONE") = "AN" =  Arabic number (left-to-right)
  • Hebrew
    • "א" (U+05D0 "HEBREW LETTER ALEF") = "R" = strong right-to-left
  • N'Ko
    • "ߊ" (U+07CA "NKO LETTER A") = "R" = strong right-to-left
    • "߁" (U+07C1 "NKO DIGIT ONE") = "R" = strong right-to-left

Unlike the other digits, N'Ko digits are marked as strongly right-to-left. The only other examples in Unicode 14.0 I could find were Adlam digits (1989).

Another interesting codepoint from the Unicode "NKo" block is U+07F7 "NKO SYMBOL GBAKURUNEN":

It's a decorative punctuation symbol used to mark the end of a major section of text and represents the three stones holding a cooking pot over a fire:

[source]

Finally, there can't be many alphabets that have their own day: April 14.

[Many thanks to Coleman Donaldson for help with the N'Ko language]

Sunday, 23 January 2022

Unicode Trivia U+0780

Codepoint: U+0780 "THAANA LETTER HAA"
Block: U+0780..07BF "Thaana"

The Thaana script is used to write the Maldivian language. According to Wikipedia, it's an abugida with no inherent vowel. According to the ISO standard, it's a right-to-left-written alphabet (as indicated by the hundreds digit of its numeric ISO-15924 code "170").

It first appeared in about 1705 CE and seems have been developed with obfuscation in mind. The alphabet order is arbitrary and the consonant letterforms are derived from numeric figures:

On the top row, in white, are the 24 basic consonants in Thaana alphabetical order. These are the 24 consecutive Unicode codepoints U+0780 "THAANA LETTER HAA" to U+0797 "THAANA LETTER CHAVIYANI".

The second row shows the Arabic-Indic digits one to nine in blue and the Dhives Akuru digits one to six in red. Dhives Akuru was a Maldivian script used before Thaana. The main part of the alphabet looks very much like a simple replacement cipher.

An early version of the Thaana script, Gabulhi Thaana, was written scriptio continua, that is, without inter-word spacing or punctuation. This sounds like an absolute nightmare but was quite common in classical Greek and Latin. Before mechanical printing, Arabic was also written without spacing. This is, perhaps, why many writing systems have distinct letterforms for final letters in words.

According to "Scripts of Maldives", the early Thaana script, Gabulhi Thaana, got its name from the Maldivian word "gabulhi" meaning the in-between stage of a coconut, when it is neither fully ripe nor quite tender. Hence the idea of "immature" or "not fully-formed".

Saturday, 22 January 2022

Unicode Trivia U+0753

Codepoint: U+0753 "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE"
Block: U+0750..077F "Arabic Supplement"

Syriac is not the only script that makes extensive use of diacritics. The spread of the Arabic script throughout the world means it is used for diverse languages, many of which have sounds not found in Arabic. Part of the "Arabic Supplement" block contains a column "Extended Arabic letters" with the annotation:

These are primarily used in Arabic-script orthographies of African languages.

One codepoint, U+0753, has the somewhat precise name of "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE". When I render that codepoint using "Noto Sans Arabic" on my PC, I get this:

Noto Sans Arabic (2.004)

When I render it with a default local font, I get:

Arial (7.00)

Spot the difference!

There's definitely a discrepancy in the orientation of the lower dots, but which is correct? I came up with three possibilities:

  1. I have old/corrupt font files installed on my PC
  2. The name of the Unicode codepoint is incorrect
  3. The orientation of the lower dots doesn't really matter, so there is no issue
  4. One of the font glyphs is incorrect

Initially, I did indeed think it was an old version of Noto Sans Arabic installed on my machine. But I updated my local version of Noto Sans Arabic to 2.009 with the same results. Google web font specimens confirmed the issue is with Noto Sans Arabic in general:

Three of the four specimens suggest the name of the Unicode codepoint is probably correct. I checked that there are no similarly-named codepoints; there is no "ARABIC LETTER BEH WITH THREE DOTS POINTING DOWNWARDS BELOW AND TWO DOTS ABOVE"

I couldn't really imagine that, carefully named as it is, the orientation of the lower dots in U+0753 was unimportant.

I then checked Unicode Updates and Errata but found no references to this or nearby codepoints.

So the finger of suspicion fell on the glyph within the Noto Sans Arabic font being incorrect. FontForge confirmed this:


I looked through the issues reported for Noto fonts, but found nothing, so I submitted a new one.

Of course, this has only a passing connection to the Unicode standard. But one can easily imagine the amount of noise that has to be ploughed through by the committee along the lines of "My text doesn't get displayed how I expected" just to get to genuine issues with the Unicode standard itself.

According to Wiktionary, U+0753 "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" is

The third letter of the Hausa alphabet in ajami script, equivalent to Latin script c.

I was initially a bit suspicious of this. Both Omniglot and Wikipedia suggest that the three dots go above that letter, making it more like U+062B "ARABIC LETTER THEH". However, Richard Ishida points out that there are lots of subtle local variations and the initial Unicode proposal shows an "ARABIC LETTER BEH WITH THREE DOTS POINTING UPWARDS BELOW AND TWO DOTS ABOVE" in Figure 5. The proposal cites "Using Arabic Script in Writing the Languages of the Peoples of Muslim Africa" (1992) by Mohamed Chtatou:

"Figure 5" (Chtatou, 1992)

Richard Ishida again:

Unicode policy for the Arabic script is to encode fully precomposed characters rather than to use combining characters for ijam.

It would appear that the task of supporting more obscure and/or infrequent Arabic script glyphs in Unicode (and in fonts) can only get harder.

Friday, 21 January 2022

Unicode Trivia U+0740

Codepoint: U+0740 "SYRIAC FEMININE DOT"
Block: U+0700..074F "Syriac"

Syriac has got to be one of the dottiest scripts in Unicode. The fact that there's a 232-page book devoted to Syriac diacritics says a lot:

[source]

The dot is used for everything in Syriac from tense to gender, number, and pronunciation, and unsurprisingly represents one of the biggest obstacles to learning the language.

Section 9.3 of the Core Specification 14.0.0 gives an introduction to some of these complexities. Within the sub-section concerning exceptions to the diabolical diacritical rules, is this:

The feminine dot is usually placed to the left of a final taw.

This refers to codepoint U+0740 "SYRIAC FEMININE DOT". According to Richard Ishida, this non-spacing mark:

[...] is a feminine marker used with "ܬ" [U+072C SYRIAC LETTER TAW] to indicate a feminine suffix. East Syriac fonts should render as two dots below the base letter, whereas West Syriac fonts render as a single dot to the left of the base.

So far as I can tell, this is the only diacritic currently in Unicode that distinguishes (or elucidates) the underlying word's gender.

Below are variations of "ܩܛܠܬ" (= kill) distinguished solely by diacritics ("ܩ̇ܛܠܬ", "ܩܛ̣ܠܬ" and "ܩܛܠܬ݀") rendered with "Noto Sans Syriac":

Notice U+0740 "SYRIAC FEMININE DOT" at the end (left) of the last line.

[Thanks to Richard Ishida and, indirectly, J F Coakley at Jericho Press]

Thursday, 20 January 2022

Unicode Trivia U+0640

Codepoint: U+0640 "ARABIC TATWEEL"
Block: U+0600..06FF "Arabic"

The FIFA World Cup Qatar 2022 logo applies kashida to the Latin script word for "Qatar":

[source]

This is the elongation of the connection between the "t" and "a". In Unicode, U+0640 "ARABIC TATWEEL" (alias "kashida") can be used to represent this elongation. Tatweels are usually only used in Arabic (or similar) scripts, so it's a nice cross-cultural reference in this context.

Here's an extreme example in the form of an Arabic script basmala:

[source]

Tatweels could be considered typographical formatting, but, because a tatweel character was part of ISO/IEC 8859-6 at position 0xE0, it was "inherited" by Unicode as a separate graphical codepoint.

Arabic tatweels are similar to Latin hyphens when used for text justification, but the rules are obviously very different. An excellent history of the topic is given by Titus Nemeth.

[At this point, my complete lack of understanding of Arabic will shine through. Apologies.]

Like its Unicode block-neighbour Hebrew, Arabic script is a right-to-left abjad. The name "Qatar" in Arabic is made up of three Arabic consonants:

  • U+0642 "ARABIC LETTER QAF"
  • U+0637 "ARABIC LETTER TAH"
  • U+0631 "ARABIC LETTER REH"

قطر

If, as part of text justification or for aesthetic effect, we want to widen the word, we could insert a tatweel between the tah and reh:

قطـر

In fact, we can add more tatweels in sequence:

قطـــــر

This is, of course, an artificial example; words of only three consonants are rarely stretched.

Straight line tatweels are not the only mechanism that can be used to justify Arabic text. Others include:

  1. Whitespace
  2. Letterform lengthening/shortening
  3. Ligature variation



Wednesday, 19 January 2022

Unicode Trivia U+05D0

Codepoint: U+05D0 "HEBREW LETTER ALEF"
Block: U+0590..05FF "Hebrew"

Hebrew is usually written in an abjad script, right-to-left. Abjads are also known as consonant alphabets because they lack "letters" for vowel sounds. Diacritics indicating vowels are used for poetry, religious texts and teaching Hebrew.

When we type the 22 consonants, "alef" (U+0590) to "taw" (U+05EA), a text renderer should render them right-to-left:

But how does it know that?

Within the UCD fields for U+05D0, we see:

bc = R

This means that the "bidirectionality class" for U+05D0 is "any strong right-to-left (non-Arabic-type) character" (UAX #44). This, together with the fiendishly complex bidirectional algorithm (UAX #9), allows text renderers to render arbitrary sequences of mixed-script codepoints correctly.

The U+05D0 "HEBREW LETTER ALEF" codepoint is marked as Hebrew script:

sc = Hebr

but the Unicode bidirectional algorithm does not rely on script-level properties. That is, Unicode says that "alef" is usually rendered "right-to-left", not that "alef" is part of the "Hebrew" script and the "Hebrew" script is usually rendered "right-to-left".

There is no concept of lowercase and uppercase letters in Hebrew; the script is unicameral.

Finally, and appropriately, four Hebrew letters have "final forms" when they appear at the end of words:


This is similar to Greek sigma, but without the special-case handling necessary to overcome the lack of an uppercase final sigma.

Tuesday, 18 January 2022

Unicode Trivia U+058D

Codepoint: U+058D "RIGHT-FACING ARMENIAN ETERNITY SIGN"
Block: U+0500..058F "Armenian"

The Armenian alphabet was devised in about 405 CE by Mesrop Maštoc' to give Armenians access to Christian texts. It was probably developed from the Greek alphabet with influences from Syriac and possibly Ge'ez scripts. It has uppercase and lowercase letterforms:

Armenian is written with quite distinctive punctuation. See Section 7.6 of the Unicode Core Specification.

Before Unicode, Armenian script was encoded in one of a set of ASCII-like character encodings called ArmSCII. In all three main variants of ArmSCII, there is a slot for the Armenian eternity symbol. The sign comes in two versions; right- and left-facing:

Right- and left-facing Armenian eternity signs [source]

According to Michael Everson:

The Armenian Eternity Sign is the ancient national symbol of Armenia. Its glyph may have either a clockwise or an anti-clockwise orientation, which is composed with curves running from the centre of the symbol. Typically, the sign has eight such curves, a number which symbolizes revival, rebirth, and recurrence. 

The sign is known to be distinguished with both right and left rotations, which represent (more or less) activity and passivity, similarly to the svаsti sign used in Hinduism and Buddhism.

Personally, I find the "left- and right-facing" nomenclature somewhat confusing, but it made it into the Unicode standard:

֍ U+058D "RIGHT-FACING ARMENIAN ETERNITY SIGN"

֎ U+058E "LEFT-FACING ARMENIAN ETERNITY SIGN"

The latter codepoint has an annotation saying it "maps to AST 34.005:1997" which is ArmSCII-7.

The fact that two codepoints were to be added to Unicode (even though only one existed in ArmSCII) was the topic of some debate within the Unicode committee around 2010, along with where to actually place the two codepoints: either the "Armenian" block (the winner!) or "Miscellaneous Pictographic Symbols".

A similar discussion was had about the placement of the Armenian currency sign, dram (U+058F). It could have been placed in the "Currency Symbols" block, but it was positioned at the end of the "Armenian" block because it is "similar to the Armenian letter D" (section 7.1.1).

Monday, 17 January 2022

Unicode Trivia U+051C

Codepoint: U+051C "CYRILLIC CAPITAL LETTER WE"
Block: U+0500..052F "Cyrillic Supplement"

Cyrillic is a script, not an alphabet. There are many alphabets in the Cyrillic script for different languages.

The Kurdish language is usually written today using a Latin-based alphabet (Celadet Alî Bedirxan, 1932) or a modified Perso-Arabic alphabet (Sa’id Kaban Sedqi, 1928). In the past, a Cyrillic alphabet (Heciyê Cindî, 1946) was also used:

Аа Бб Вв Гг Г’г’ Дд Ее Әә Ә’ә’ Жж Зз Ии Йй Кк К’к’ Лл Мм Нн Оо Ӧӧ Пп П’п’ Рр Р’р’ Сс Тт Т’т’ Уу Фф Хх Һһ Һ’һ’ Чч Ч’ч’ Шш Щщ Ьь Ээ Ԛԛ Ԝԝ

The last two letters are not the Latin letters Q and W, they are the Kurdish Cyrillic letters Qa and We:

  • 'Ԛ' (U+051A "CYRILLIC CAPITAL LETTER QA")
  • 'ԛ' (U+051B "CYRILLIC SMALL LETTER QA")
  • 'Ԝ' (U+051C "CYRILLIC CAPITAL LETTER WE")
  • 'ԝ' (U+051D "CYRILLIC SMALL LETTER WE")

They were added to the "standard" Cyrillic alphabet to capture Kurdish sounds not found elsewhere.

Cyrillic We (U+051C) is a homoglyph of Latin W (U+0057), and vice versa. In Unicode parlance, they are "confusable".

[source]

Of course, Cyrillic We and Latin W are semantically different letters, even though they may look identical. The Unicode standard deals primarily with codepoints and not their visual representation, so having two distinct codepoints in this case makes sense. Other examples in the Unicode repertoire are less clear-cut.

There is a serious side to Unicode homoglyphs: there is a very real technology security threat associated with them. For this reason, the the Unicode Consortium publishes a partial list of confusables with every release, along with mitigation guidelines.

Sunday, 16 January 2022

Unicode Trivia U+047C

Codepoint: U+047C "CYRILLIC CAPITAL LETTER OMEGA WITH TITLO"
Block: U+0400..04FF "Cyrillic"

It is perhaps not surprising, given the history of writing, that there are so many references to religious aspects of letterforms in the Unicode standard. Take U+047C "CYRILLIC CAPITAL LETTER OMEGA WITH TITLO" as an example:

Taken from Unicode Cyrillic Chart

It sits in the "Historic letters" column of the "Cyrillic" block. A look at the official Unicode charts reveals the following annotations:

  • [alias] Cyrillic "beautiful omega"
  • [note] despite its name, this character does not have a titlo, nor is it composed of an omega plus a diacritic
  • [see also] A64C Ꙍ cyrillic capital letter broad omega

Apparently, this glyph (or something that looks similar) is used in Church Slavonic religious texts for the interjection "Oh!" However, there has been some discussion within the Unicode community about this codepoint and its lowercase version:

These characters were originally encoded in the Unicode standard with an erroneous name and representation. After the UTC ruling on Everson et al. (2006), the representation was corrected and an annotation was added to U+047C, reading “despite its name, this character does not have a titlo, nor is it composed of an omega plus a diacritic”. However, no annotation was added to the lowercase form U+047D.

The character that is encoded here is a ligature of the Cyrillic broad (or wide) Omega (encoded at U+A64C and U+A64D) and the ‘great apostrof’, a stylized diacritical mark consisting of the soft breathing (encoded at U+0486) and the Cyrillic kamora (encoded at U+0311). The broad Omega (U+A64D) can occur by itself, without this diacritical mark, in pre-1700 printed Church Slavic books, though not in modern liturgical texts. Functionally, the character with the diacritical mark is analogous to the Greek character ὦ, which also consists of an Omega, a soft breathing mark and a Perispomene. Both the Greek and Church Slavic characters have identical functions: to record the exclamation ‘Oh!’ Since U+047C and U+047D were encoded without a canonical decomposition, though they are linguistically decomposable, they should not be decomposed to avoid an encoding ambiguity. However, in our opinion, the annotation as written does not make this clear.

There has been a suggestion to rename (or alias) U+047C to "CYRILLIC LETTER BROAD OH" with the observation:

In addition, the Unicode note “beautiful omega” should refer to A64C, not to this character.

At the time of writing there are no name aliases in the UCD for any of these codepoints.

It all goes to show that:

  1. Naming codepoints is a perilous task.
  2. The complexity of the competing interests makes errors inevitable.
  3. If mistakes are made, the Unicode stability policy makes fixing them difficult or unappealing.
  4. Unicode annotations can be more revealing than the raw UCD data.

Saturday, 15 January 2022

Unicode Trivia U+03C2

Codepoint: U+03C2 "GREEK SMALL LETTER FINAL SIGMA"
Block: U+0370..03FF "Greek and Coptic"

The modern lowercase (minuscule) Greek alphabet is encoded in the Unicode range U+03B1 to U+03C9  in the "Greek and Coptic" block:

αβγδεζηθικλμνξοπρςστυφχψω

The uppercase versions of these letters are in the range U+0391 to U+03A9:

ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩ

Tofu alert! Something nasty happens between the capitals rho "Ρ" and sigma "Σ". Here are those ranges rendered in a table:


The grey square is U+03A2: a "reserved" codepoint. This character is also reserved in earlier character sets (e.g. ISO/IEC 8859-7 of 1987) so it's not an anomaly of Unicode. It's the gap where "GREEK CAPITAL LETTER FINAL SIGMA" would sit if it actually existed.

Consider the titlecase Greek word for "Stasis":

Στασις

If we convert this to uppercase by using the following in the Chrome browser's console

"Στασις".toUpperCase()

we get

'ΣΤΑΣΙΣ'

as expected. All three sigmas (initial "Σ", medial "σ" and final "ς") get mapped to capital sigma "Σ":

ΣΤΑΣΙΣ

The UCD lowercase mapping of U+03A3 "GREEK CAPITAL LETTER SIGMA" only mentions U+03C3 "GREEK SMALL LETTER SIGMA". So one would think (like I naively did) that converting "ΣΤΑΣΙΣ" to lowercase would produce

στασισ

but if we use the browser console again

"ΣΤΑΣΙΣ".toLowerCase()

we actually get

'στασις'

(with a final sigma) which is correct but pleasantly unexpected.

The official reason the string mapping is correct is that final sigmas are "special" according to the Unicode standard. There's a file in the UCD named SpecialCasing.txt. Below is the relevant snippet from that text file:

# Special case for final form of sigma
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA

This rule kicks in when the "Final_Sigma" condition is true independent of language.

In reality, every implementation of a lowercase mapping function must have special logic to handle final sigmas. Seriously.

As an example, here's the relevant functionality from the Chrome browser source code

// Really special case 1: upper case sigma.  This letter
// converts to two different lower case sigmas depending on
// whether or not it occurs at the end of a word.
if (next != 0 && Letter::Is(next)) {
  result[0] = 0x03C3;
} else {
  result[0] = 0x03C2;
}

This is a gotcha that's obviously bitten people more than once. Even the Unicode Consortium acknowledges it's a can of wormς


Friday, 14 January 2022

Unicode Trivial U+030C

Codepoint: U+030C "COMBINING CARON"
Block: U+0300..036F "Combining Diacritical Marks"

Imagine the role of Bjorn in a heavy metal ABBA tribute band (really?) who styles his name:

Bǰörn

There's a caron over the "j" and a (heavy metal) umlaut over the "o". That's five Unicode codepoints:

U+0042, U+01F0, U+00F6, U+0072, U+006E

When his name is converted to all uppercase for the tour poster:

BJ̌ÖRN

it mysteriously becomes six Unicode codepoints:

U+0042, U+004A, U+030C, U+00D6, U+0052, U+004E

This is because although Unicode has the codepoint U+01F0 "LATIN SMALL LETTER J WITH CARON", it has no single codepoint for "LATIN CAPITAL LETTER J WITH CARON". The case mapping algorithm uses data from the UCD to map U+01F0 to the pair U+004A/U+030C.

U+030C is the codepoint for "COMBINING CARON".

If we convert the name back to titlecase, we get:

Bǰörn

This looks the same (hopefully) but is also made up of six codepoints:

U+0042, U+006A, U+030C, U+00F6, U+0072, U+006E

The "O WITH DIAERESIS" round-tripped okay, but not the "J WITH CARON". What's going on?

Case mapping and case folding are very knotty problems. There are plenty of edge-cases in Unicode where converting to/from uppercase/lowercase and back again does not produce the original input. You cannot perform case-insensitive matching by simply converting both strings to uppercase and comparing for equality. Nor does converting to lowercase (or titlecase) work either.

If we look at the UCD entry for U+01F0 "LATIN SMALL LETTER J WITH CARON", we see:

  1. dm = 006A 030C
  2. uc = 004A 030C
  3. lc = #
  4. tc = 004A 030C
  5. cf = 006A 030C

This can be interpreted as:

  1. The "Decomposition Mapping" is "U+006A U+030C". That's lowercase "j" followed by a combining caron.
  2. The (non-simple) "Uppercase Mapping" is "U+004A U+030C". That's uppercase "J" followed by a combining caron.
  3. The (non-simple) "Lowercase Mapping" is the unaltered codepoint, i.e. "U+01F0". That's the single codepoint "LATIN SMALL LETTER J WITH CARON".
  4. The (non-simple) "Titlecase Mapping" is "U+004A U+030C". That's uppercase "J" followed by a combining caron.
  5. The (non-simple) "Case Folding" is "U+006A U+030C". That's lowercase "j" followed by a combining caron.

From these definitions, one can imagine algorithms for:

  1. Decomposing strings into normalized forms (NFC/NFD/NFKC/NFKD) to avoid ambiguity, although there are still lots of additional complications.
  2. Converting strings to uppercase.
  3. Converting strings to lowercase.
  4. Converting strings to titlecase.
  5. Comparing strings in a case-insensitive way.
Further complications occur when more than one diacritic is attached to a letter. And then there's the question of ordering (collating) text with diacritics...

Perhaps Bjorn was so busy wondering why there's no umlaut in "umlaut" that he missed a trick. He should have styled himself:


U+0243, U+01F0, U+00F6, U+1E5D, U+00F1

Thursday, 13 January 2022

Unicode Trivia U+02DB

Codepoint: U+02DB "OGONEK"
Block: U+02B0..02FF "Spacing Modifier Letters"

The English language doesn't really have diacritics except in loanwords (such as "café", "naïve", "façade" and "piñata") or in poetry (as with "belovèd"). As a consequence, many English-speakers struggle with  the whole concept.

The many scripts and languages supported by Unicode make diacritics a thorny issue here too. Trawling though the UCD comes up with the following major instances:*

  1. ACUTE
    The acute accent:
     
    Ó
    U+00D3 "LATIN CAPITAL LETTER O WITH ACUTE"

  2. DOUBLE ACUTE
    The double acute accent (sometimes called the hungarumlaut):

    Ő
    U+0150 "LATIN CAPITAL LETTER O WITH DOUBLE ACUTE"

  3. GRAVE
    The grave accent:

    Ò
    U+00D2 "LATIN CAPITAL LETTER O WITH GRAVE"

  4. DOUBLE GRAVE
    The double grave accent (mainly used in Serbo-Croatian and Slovenian):

    Ȍ
    U+020C "LATIN CAPITAL LETTER O WITH DOUBLE GRAVE"

  5. CIRCUMFLEX
    The circumflex (easily confused with the inverted breve):

    Ô
    U+00D4 "LATIN CAPITAL LETTER O WITH CIRCUMFLEX"

  6. TILDE
    The tilde (in the Estonian alphabet "õ" is an independent letter):

    Õ
    U+00D5 "LATIN CAPITAL LETTER O WITH TILDE"

  7. DIAERESIS
    The diaeresis or umlaut:
    [In Unicode, the term "DIAERESIS" is preferred over "UMLAUT"]

    Ö
    U+00D6 "LATIN CAPITAL LETTER O WITH DIAERESIS"

  8. STROKE
    The stroke (in some Scandinavian alphabets "ø" is an independent letter):

    Ø
    U+00D8 "LATIN CAPITAL LETTER O WITH STROKE"

  9. MACRON
    The macron or line above:

    Ō
    U+014C "LATIN CAPITAL LETTER O WITH MACRON"

  10. BREVE
    The breve (easily confused with the caron or háček):

    Ŏ
    U+014E "LATIN CAPITAL LETTER O WITH BREVE"

  11. INVERTED BREVE
    The inverted breve or arch (easily confused with the circumflex):

    Ȏ
    U+020E "LATIN CAPITAL LETTER O WITH INVERTED BREVE"

  12. HORN
    The horn (used in Vietnamese):

    Ơ
    U+01A0 "LATIN CAPITAL LETTER O WITH HORN"

  13. CARON
    The caron or háček (easily confused with the breve):
    [Since Unicode 1.1, the term "CARON" is preferred over "HACEK"]

    Ǒ
    U+01D1 "LATIN CAPITAL LETTER O WITH CARON"

  14. DOT ABOVE
    The dot above or overdot:

    Ȯ
    U+022E "LATIN CAPITAL LETTER O WITH DOT ABOVE"

  15. DOT BELOW
    The dot below or underdot:

    U+1ECC "LATIN CAPITAL LETTER O WITH DOT BELOW"

  16. HOOK ABOVE
    The hook above (used in Vietnamese):

    U+1ECE "LATIN CAPITAL LETTER O WITH HOOK ABOVE"

  17. LONG STROKE OVERLAY
    The long stroke overlay ("ꝋ" was a medieval abbreviation for the Latin obiit "he died"):

    U+A74A "LATIN CAPITAL LETTER O WITH LONG STROKE OVERLAY"

  18. LOOP
    The loop ("ꝍ" is used for transliterating medieval Nordic vowels):

    U+A74C "LATIN CAPITAL LETTER O WITH LOOP"

  19. BELT
    The belt ("ɬ" is used in IPA for the voiceless alveolar lateral fricative):

    U+A7AD "LATIN CAPITAL LETTER L WITH BELT"

  20. LINE BELOW
    The line below (or macron below):

    U+1E3A "LATIN CAPITAL LETTER L WITH LINE BELOW"

  21. STROKE
    The stroke ("ł" is a Polish dark L):

    Ł
    U+0141 "LATIN CAPITAL LETTER L WITH STROKE"

  22. CEDILLA
    The cedilla:

    Ç
    U+00C7 "LATIN CAPITAL LETTER C WITH CEDILLA"

  23. RING ABOVE
    The ring above or overring (used in many Scandinavian languages):

    Å
    U+00C5 "LATIN CAPITAL LETTER A WITH RING ABOVE"

  24. RING BELOW
    The ring below or underring:

    U+1E00 "LATIN CAPITAL LETTER A WITH RING BELOW"

  25. OGONEK
    The ogonek (usually applied to vowels):

    Ǫ
    U+01EA "LATIN CAPITAL LETTER O WITH OGONEK"

The Polish ogonek (literally "little tail") is applied to the letters "A" and "E":

Ąą Ęę

According to Adam Twardoch, the Polish ogonek isn't simply an accent...

It's much more a character element, just like a stem, a serif or a descent. In a vast majority of cases ogonek should be smoothly connected with the base glyph, it should be a part of the glyph.

Wikimedia Commons

If you search the UCD, you'll find 18 references to "ogonek":

  • U+0104 "LATIN CAPITAL LETTER A WITH OGONEK"
  • U+0105 "LATIN SMALL LETTER A WITH OGONEK"
  • U+0118 "LATIN CAPITAL LETTER E WITH OGONEK"
  • U+0119 "LATIN SMALL LETTER E WITH OGONEK"
  • U+012E "LATIN CAPITAL LETTER I WITH OGONEK"
  • U+012F "LATIN SMALL LETTER I WITH OGONEK"
  • U+0172 "LATIN CAPITAL LETTER U WITH OGONEK"
  • U+0173 "LATIN SMALL LETTER U WITH OGONEK"
  • U+01EA "LATIN CAPITAL LETTER O WITH OGONEK"
  • U+01EB "LATIN SMALL LETTER O WITH OGONEK"
  • U+01EC "LATIN CAPITAL LETTER O WITH OGONEK AND MACRON"
  • U+01ED "LATIN SMALL LETTER O WITH OGONEK AND MACRON"
  • U+02DB "OGONEK"
  • U+0328 "COMBINING OGONEK"
  • U+04BE "CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER"
    • Was named "CYRILLIC CAPITAL LETTER IE HOOK OGONEK" in Unicode 1.0
  • U+04BF "CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER"
    • Was named "CYRILLIC SMALL LETTER IE HOOK OGONEK" in Unicode 1.0
  • U+1AB7 "COMBINING OPEN MARK BELOW"
    • Notes include "see also combining ogonek - 0328"
  • U+1DCE "COMBINING OGONEK ABOVE"

Codepoint U+02DB is interesting. It's in the "Spacing clones of diacritics" column of the "Spacing Modifier Letters" block. This column contains six codepoints (including U+02DB) which fill in the gaps for "standalone" diacritics; that is, codepoints for diacritics that take up space without the need for the combining equivalent being applied to a letter.

So, if you're talking about ogoneks in general and want to include them in text without being attached to another glyph, you can just use U+02DB:

An ogonek looks like “˛”

There will be visible differences between this and a combining ogonek with a standard space (U+0020):

An ogonek looks like “̨

A combining ogonek with a non-breaking space  (U+00A0):

An ogonek looks like “ ̨

And a combining ogonek with a dotted circle (U+25CC):

An ogonek looks like “◌̨”

One of the few times you depict ogoneks in isolation is when you're talking about how to depict ogoneks in isolation.

* Good luck with the text rendering in your browser here!

Wednesday, 12 January 2022

Unicode Trivial U+0256

Codepoint: U+0256 "LATIN SMALL LETTER D WITH TAIL"
Block: U+0250..02AF "IPA Extensions"

The "Ð" (U+00D0 "LATIN CAPITAL LETTER ETH") that we first met in the Icelandic alphabet is easily confused with other Unicode codepoints:

  • "Đ" (U+0110 "LATIN CAPITAL LETTER D WITH STROKE")
  • "Ɖ" (U+0189 "LATIN CAPITAL LETTER AFRICAN D")

These three codepoints are considered distinct according to the Unicode standard but are typically rendered (almost) identically.

The lowercase mapping of the African D (which sits in the "Latin Extended-B" block) is "ɖ" (U+0256 "LATIN SMALL LETTER D WITH TAIL") which sits in the "IPA Extensions" block. Note that the lowercase is not called "LATIN SMALL LETTER AFRICAN D" as you might expect. However, the uppercase mapping of "LATIN SMALL LETTER D WITH TAIL" is indeed "LATIN CAPITAL LETTER AFRICAN D", so they're obviously a pair.

At first, I thought the African D naming inconsistency was because the two blocks ("Latin Extended-B" and "IPA Extensions") were added to the Unicode standard at different times. But the UCD tells us the codepoint were both in the original version 1.0, though under differently-named blocks.

An alternative reason for the codepoint naming inconsistency may be due to history of the African D itself.

The African D is the sixth letter of the International African Alphabet developed in 1928 by Diedrich Hermann Westermann and others. This alphabet was a precursor to the African Reference Alphabet (1978) and the World Orthography alphabet (1948).

International African Alphabet

The International African Alphabet was itself developed from the International Phonetic Alphabet which has been evolving since 1888. However, the IPA famously does not have uppercase versions of its letters. Fortunately, the International Institute of African Languages and Cultures had already prepared such a mapping in "The Practical Orthography of African Languages" (1928) [transcript]:

[source]

The International Phonetic Alphabet does not have a dedicated block of its own in Unicode as its characters are mainly "borrowed" from other sources (e.g. Latin, Greek and Cyrillic blocks). During the original compilation of Unicode, any phonetic character that wasn't extant elsewhere was added to the new "Standard Phonetic" (later renamed "IPA Extensions") block. The original code charts show U+0189 (p.185 "Extended Latin") and U+0256 (p.189 "Standard Phonetic") with their current names and the expected case mapping between them.

In a Unicode meeting (ISO/IEC JTC 1/SC 2/WG 2 on 1994-06-01 item N 989), a suggestion was made to rename U+0189 to "LATIN CAPITAL LETTER OF LETTER D WITH TAIL", but this was withdrawn for unknown reasons. The alternatives are to rename U+0189 to "LATIN CAPITAL LETTER D WITH TAIL" which is obviously visually incorrect, or to rename U+0256 to "LATIN SMALL LETTER AFRICAN D". The last suggestion is problematic because I'm sure codepoint U+0256 is primarily used in the context of phonetics.

I believe the two codepoints were added to Unicode 1.0 with their current names knowingly inconsistent as a compromise. And compromises are what standards committees are all about.

Tuesday, 11 January 2022

Unicode Trivia U+01BF

Codepoint: U+01BF "LATIN LETTER WYNN"
Block: U+0180..024F "Latin Extended-B"

The Old English (Anglo-Saxon) alphabet (circa 8th to 12th centuries) had 24 letters:

Aa Ææ Bb Cc Dd Ðð Ee Ff Ᵹᵹ Hh Ii Ll Mm Nn Oo Pp Rr Ss Tt Uu Ƿƿ Xx Yy Þþ

The twenty-first letter (uppercase "Ƿ", lowercase "ƿ") is named "wynn". This alphabet was derived from Latin, which lacked a /w/ sound, so the digraph "uu" was used instead. Hence the name "double-u".

Later, "uu" was replaced by the runic symbol "ᚹ" (U+16B9 "RUNIC LETTER WUNJO WYNN W") that morphed into the Latin wynn. After the Norman Conquest, French scribes abandoned the wynn, possibly because it was easily confused with "Pp" and/or "Þþ" (thorn), and reverted to "double-u".

There are many infographics about the evolution of the English alphabet, including one by Useful Charts. I've tried to construct one from the perspective of codepoints in Unicode 14.0. I've also ignored minuscules which means that topics such as Carolingian are omitted.

There's an interactive rendition of this table on the "Unicode Tour" web page that accompanies these blog posts.

Unicode Trivia U+0132

Codepoint: U+0132 "LATIN CAPITAL LIGATURE IJ"
Block: U+0100..017F "Latin Extended-A"

[If you haven't already noticed, I'm trying to come up with an mildly interesting fact about one codepoint in every Unicode block. As of Version 14.0, that's 320 blocks]

The Latin script is the basis of many alphabets. Of the languages whose Latin-script alphabets can (currently) be expressed as single codepoints, here are a selection [source]:

Danish
Aa Bb Cc Dd Ee Ff Gg Hh Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz Ææ Øø Åå

Dutch
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy IJij Zz

English
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz

Estonian
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Šš Zz Žž Tt Uu Vv Ww Õõ Ää Öö Üü Xx Yy

Faroese
Aa Áá Bb Dd Ðð Ee Ff Gg Hh Ii Íí Jj Kk Ll Mm Nn Oo Óó Pp Rr Ss Tt Uu Úú Vv Yy Ýý Ææ Øø 

Finnish
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Xx Yy Zz Ää Öö

French
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz

German
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz Ää Öö Üü
ẞß

Icelandic
Aa Áá Bb Dd Ðð Ee Éé Ff Gg Hh Ii Íí Jj Kk Ll Mm Nn Oo Óó Pp Rr Ss Tt Uu Úú Vv Xx Yy Ýý Þþ Ææ Öö

Irish
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz

Italian
Aa Bb Cc Dd Ee Ff Gg Hh Ii Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Zz

Latvian
Aa Āā Bb Cc Čč Dd Ee Ēē Ff Gg Ģģ Hh Ii Īī Jj Ķķ Ll Ļļ Mm Nn Ņņ Oo Pp Rr Ss Šš Tt Uu Ūū Vv Zz Žž

Lithuanian
Aa Ąą Bb Cc Čč Ee Ęę Ėė Ff Gg Hh Ii Įį Yy Jj Kk Ll Mm Nn Oo Pp Rr Ss Šš Tt Uu Ųų Ūū Vv Zz Žž

Norwegian
Aa Bb Cc Dd Ee Ff Gg Hh Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz Ææ Øø Åå

Polish
Aa Ąą Bb Cc Ćć Dd Ee Ęę Ff Gg Hh Ii Jj Kk Ll Łł Mm Nn Ńń Oo Óó Pp Rr Ss Śś Tt Uu Ww Yy Zz Źź Żż

Portuguese
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Ll Mm Nn Oo Pp Qq Ss Tt Uu Vv Xx Zz

Romanian
Aa Ăă Ââ Bb Cc Dd Ee Ff Gg Hh Ii Îî Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Şş Tt Ţţ Uu Vv Ww Xx Yy Zz

Sami
Aa Áá Bb Cc Čč Dd Đđ Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Ŋŋ Oo Pp Rr Ss Šš Ŧŧ Uu Vv Zz Žž

Swedish
Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz Åå Ää Öö

Turkish
Aa Bb Cc Çç Dd Ee Ff Gg Ğğ Hh Iı Ii Jj Kk Ll Mm Nn Oo Öö Pp Rr Ss Şş Tt Uu Üü Vv Vv Yy Zz

Some languages use digraphs (or similar) in their alphabets which are not single codepoints in Unicode:

Albanian
Aa Bb Cc Çç Dd DH/dh Ee Ëë Ff Gg GJ/gj Hh Ii Jj Kk Ll LL/ll Mm Nn NJ/nj Oo Pp Qq Rr RR/rr Ss SH/sh Tt TH/th Uu Vv Xx XH/xh Yy Zz ZH/zh

Croatian
Aa Bb Cc Čč Ćć Dd DŽ/dž Đđ Ee Ff Gg Hh Ii Jj Kk Ll LJlj Mm Nn NJ/nj Oo Pp Rr Ss Šš Uu Vv Zz Žž

Czech
Aa Bb Cc Čč Dd Ee Ff Gg Hh CH/ch Ii Jj Kk Ll Mm Nn Oo Pp Rr Řř Ss Šš Tt Uu Vv Ww Xx Yy Zz Žž

Hungarian (standard)
Aa Áá Bb Cc CS/cs Dd DZ/dz DZS/dzs Ee Éé Ff Gg GY/gy Hh Ii Íí Jj Kk Ll LY/ly Mm Nn NY/ny Oo Óó Öö Őő Pp Qq Rr Ss SZ/sz Tt TY/ty Uu Úú Üü Űű Vv ZZS/zzs

Spanish
Aa Bb Cc CH/ch Dd Ee Ff Gg Hh Ii Jj Kk Ll LL/ll Mm Nn Ññ Oo Pp Qq Rr RR/rr Ss Tt Uu Vv Ww Xx Yy Zz

Welsh
Aa Bb Cc CH/ch Dd DD/dd Ee Ff FF/ff Gg NG/ng Hh Ii Jj Ll LL/ll Mm Nn Oo Pp PH/ph Rr RH/rh Ss Tt TH/th Uu Ww Yy

According to Wikipedia, the largest Latin (and European) true alphabet is Slovak with 46 letters:

Slovak
Aa Áá Ää Bb Cc Čč Dd Ďď DZ/dz DŽ/dž Ee Éé Ff Gg Hh CH/ch Ii Íí Jj Kk Ll Ĺĺ Ľľ Mm Nn Ňň Oo Óó Ôô Pp Qq Rr Ŕŕ Ss Šš Tt Ťť Uu Úú Vv Ww Xx Yy Ýý Zz Žž

Italian has just 21.

A number of digraphs (and similar) are encoded in Unicode:

  • DZ, Dz, dz (U+01F1, U+01F2, U+01F3)
  • DŽ, Dž, dž (U+01C4, U+01C5, U+01C6)
  • IJ, ij (U+0132, U+0133)
  • LJ, Lj, lj (U+01C7, U+01C8, U+01C9)
  • NJ, Nj, nj (U+01CA, U+01CB, U+01CC)

Of these, the Dutch "IJ" (U+0132 "LATIN CAPITAL LIGATURE IJ") is unusual in not having a titlecase mapping distinct from its uppercase mapping. Consider the Dutch word "ijsje" meaning ice cream; in uppercase it is "IJSJE" and in titlecase it is "IJsje" not "Ijsje".

This quirk is considered by many as evidence that "IJ" should be considered a true letter in its own right. That possibly includes this purveyor of alcohol:

"Slijterij" means "Off-Licence"

However, the replacement of the "IJ" tile with "Y" in Dutch Scrabble may be another nail in the coffin.

Monday, 10 January 2022

Unicode Trivia U+00F0

Codepoint: U+00F0 "LATIN SMALL LETTER ETH"
Block: U+0080..00FF "Latin-1 Supplement"

The Basic Latin Unicode block (U+0000..007F) is fine if you're writing English, but it quickly runs out of steam for other languages. For example, the Icelandic alphabet ("stafrófið") has 32 letters:

Aa Áá Bb Dd Ðð Ee Éé Ff Gg Hh Ii Íí Jj Kk Ll Mm Nn Oo Óó Pp Rr Ss Tt Uu Úú Vv Xx Yy Ýý Þþ Ææ Öö

The fifth letter "ð" is "eth", seen here in a sign in Landmannalaugar:

One of Unicode's founding principles is "universal repertoire" and, indeed, "LATIN SMALL LETTER ETH" has been assigned the unique codepoint U+00F0. Before Unicode, region-specific characters had a tendency to "move around" in codepoint space. For instance, in IBM DOS Code Page 861, small eth was at position 0x8C.

For every codepoint, the Unicode Character Database maintains a plethora of information. For U+00F0, we can view that data using an online utility:

https://util.unicode.org/UnicodeJsps/character.jsp?a=00F0

Amongst other things, we see that:

  • The official name of that codepoint is "LATIN SMALL LETTER ETH"
  • It belongs to the "Latin-1 Supplement" block (U+0080..00FF)
  • It primarily belongs to the "Latin" script
  • It was introduced in Unicode 1.1
  • Its General Category is "Lowercase Letter"
  • Its uppercase mapping is "Ð" U+00D0
  • Its titlecase mapping is also "Ð" U+00D0
  • etc.

The titlecase mapping is somewhat moot as there are no words in Icelandic that begin with "eth". This makes children's "A is for Apple"-style alphabet posters somewhat difficult to produce:

It also means that the uppercase letter "Ð" (U+00D0) only occurs in text where the whole word is uppercase, such as … erm … "STAFRÓFIÐ".

Saturday, 8 January 2022

Unicode Trivia U+0000

[This blog post is part of the Universe series of investigations]

Codepoint: U+0000 "CONTROL CODE <NUL>"
Block: U+0000..007F "Basic Latin"

Unicode is built on top of any number of standards. The history of the first Unicode codepoint (U+0000, a.k.a. control code <NUL>) is an object lesson in the workings of standards committees.

  • The Unicode 1.0 (1991) standard, along with ISO/IEC 10646, bases its first 128 codepoints on ISO/IEC 646.
  • ISO/IEC 646 (1972) was itself based on ASCII.
  • The first major release of the ASCII standard was ASA X3.4-1963.

In the late 1950s, American Telephone and Telegraph Company (AT&T) had stated to the predecessors of the ASA X3 committee a functional requirement for any new standard to have an all-zeroes character named NULL (or IDLE). See Chapter 13 of Coded Character Sets, History and Development by Charles E. Mackenzie, 1980.

This "all-zeroes" requirement was almost certainly due to the practice of leaving gaps in punched tape or cards that could be "overwritten" with other characters later on without having to reissue the whole tape or card deck. Similarly, AT&T also requested an all-ones bit pattern to delete a character by "punching out" all the holes in the row. As ASA X3.4-1963 was a 7-bit character set, this led to the <DEL> control code eventually ending up at U+007F.

Further back in time, the "all-zeroes" control code was part of the International Telegraph Alphabet No. 2 (ITA2) code of 1924. This, inexorably, was a development of Baudot (ITA1) code developed in the 1870s and patented in the United States in 1888:

The final three rows of the table above are:

  1. Figure switch
  2. Letter switch
  3. Instrument at rest

where "Instrument at rest" is the idle state for teleprinters, a.k.a. NUL.