7294feb8 | 01-Sep-2015 |
John Marino <draco@marino.st> |
UTF-8: Multiple improvements (and detection of possible issue)
This commit started out intending to fix "digit" definition on unicode, which it mostly does, but a lot more happened in the end, namel
UTF-8: Multiple improvements (and detection of possible issue)
This commit started out intending to fix "digit" definition on unicode, which it mostly does, but a lot more happened in the end, namely:
* digits apparently are not part of CLDR definition. I added a section in the manual portion of UTF-8 source file that defines digit classes for generated sections. * Add numbers classification for entire UTF-8. Currently DragonFly and all BSDs do not support "number" type. However, localedef understands it (its supported on Illumos), but currently the number flag value is zero, so it's a no-op. A short term goal is to have DragonFly be the first BSD with proper number ctype handling. * Redefine "special" ctype once and for all. There is no definitive agreement on what "special" characters are. According to wiki which got it from unicode, it starts with 33 characters (0x20 - 0x2F, 0x3A - 0x40, 0x5B - 0x60, 0x7B - 0x7E). However, localedef objects to <space> because it sets "graph" and "print" flags, and <space> can't be graph. As a result, the <space> is not considered "special" here. Moreover, the punctuation in Latin-1 supplement is "special". The division and multiplication signs are ambiguous, so I set them to special (since plus and minus signs are special). Finally, with the most doubt, the punctuation of "general punctuation" block is also considered special although I couldn't find convincing evidence either way. Given the lack of definition, I don't think "special" classification is really used, especially not in unicode. * Fix NON-BREAK_SPACE classification (set as graph and space on previous commit) * the MICRO character was also warning due to being classified as both lower (in Greek section) and punctuation, so remove the punct. class. * When possible, don't define graph if digit is defined, and similarly with graph and punct. Both digit and punct also set graph flag so having both is redundant. * add several new block definitions: - Syloti Nagri - Common Indic Number Forms - Phags-pa - Saurashra - Kayah Li - Rejang - Javanese - Cham - Tal Viet - Meetei Mayek & extension * Detection of possible bug in localedef The Tai Tham definition are producing the wrong code but there's nothing wrong with the definitions. The 6 unused characters between the two digit definitions should not be graphable, but as soon as one "digit" is defined after the first digit range is defined, all the characters between are marked as graphable and digits. There are similar "fill-ins" but so far only with Thai Tam. It was detected while outputting all "digit" types against a python program that does the same and this error was reveal. It requires further investigation about exactly what is causing it (and thus where the bug is) but right now it's either a bad definition elsewhere that affects Thai Tam or localedef has a bug somewhere (avl lookup?)
show more ...
|
2fd39989 | 30-Aug-2015 |
John Marino <draco@marino.st> |
UTF8 locales: Refine Latin supplement more
The multiplication and division sign were missing, and the control characters were not outlined. Also set superscript 1,2,3 as digits. There are not showi
UTF8 locales: Refine Latin supplement more
The multiplication and division sign were missing, and the control characters were not outlined. Also set superscript 1,2,3 as digits. There are not showing up with iswdigit() function so that requires further investigation (iswdigit does work for '0','1',...'9' however)
show more ...
|
252b0055 | 15-Aug-2015 |
John Marino <draco@marino.st> |
Add 6 Arabic locales: AE EG JO MA QA SA
The lack of Arabic support on BSD seems to be a glaring omission, so lets help rectify that by adding since 6 locales: ar_AE: United Arab Emirates ar_EG: Eg
Add 6 Arabic locales: AE EG JO MA QA SA
The lack of Arabic support on BSD seems to be a glaring omission, so lets help rectify that by adding since 6 locales: ar_AE: United Arab Emirates ar_EG: Egypt ar_JO: Jordan ar_MA: Morocco ar_QA: Qatar ar_SA: Saudi Arabia
There are obviously more (e.g. Iraq, Kuwait, Algeria, Libya, Tunisia, Syria, etc.) but I selected these as being the most likely to be used in my limited opinion.
They are UTF-8 locales only. Most of them are identical to ar_SA except for the monetary definitions (each has a different currency)
show more ...
|