Monday 17 January 2022

Unicode Trivia U+051C

Codepoint: U+051C "CYRILLIC CAPITAL LETTER WE"
Block: U+0500..052F "Cyrillic Supplement"

Cyrillic is a script, not an alphabet. There are many alphabets in the Cyrillic script for different languages.

The Kurdish language is usually written today using a Latin-based alphabet (Celadet Alî Bedirxan, 1932) or a modified Perso-Arabic alphabet (Sa’id Kaban Sedqi, 1928). In the past, a Cyrillic alphabet (Heciyê Cindî, 1946) was also used:

Аа Бб Вв Гг Г’г’ Дд Ее Әә Ә’ә’ Жж Зз Ии Йй Кк К’к’ Лл Мм Нн Оо Ӧӧ Пп П’п’ Рр Р’р’ Сс Тт Т’т’ Уу Фф Хх Һһ Һ’һ’ Чч Ч’ч’ Шш Щщ Ьь Ээ Ԛԛ Ԝԝ

The last two letters are not the Latin letters Q and W, they are the Kurdish Cyrillic letters Qa and We:

  • 'Ԛ' (U+051A "CYRILLIC CAPITAL LETTER QA")
  • 'ԛ' (U+051B "CYRILLIC SMALL LETTER QA")
  • 'Ԝ' (U+051C "CYRILLIC CAPITAL LETTER WE")
  • 'ԝ' (U+051D "CYRILLIC SMALL LETTER WE")

They were added to the "standard" Cyrillic alphabet to capture Kurdish sounds not found elsewhere.

Cyrillic We (U+051C) is a homoglyph of Latin W (U+0057), and vice versa. In Unicode parlance, they are "confusable".

[source]

Of course, Cyrillic We and Latin W are semantically different letters, even though they may look identical. The Unicode standard deals primarily with codepoints and not their visual representation, so having two distinct codepoints in this case makes sense. Other examples in the Unicode repertoire are less clear-cut.

There is a serious side to Unicode homoglyphs: there is a very real technology security threat associated with them. For this reason, the the Unicode Consortium publishes a partial list of confusables with every release, along with mitigation guidelines.

No comments:

Post a Comment