Table of Contents
Sgaw Karen Collation
Karen is collated based on syllables. A Karen syllable encoded in Unicode can be broken into 5 parts for collation:
<consonant><medial><vowel><tone><final>
Only the consonant is always present, one or more of the other parts may be empty in any given syllable. No character reordering is necessary.
Each of these parts of the syllable may be composed of one or more characters as delineated in the following tables.
Consonants
C1 | က | U+1000 |
C2 | ခ | U+1001 |
C3 | ဂ | U+1002 |
C4 | ဃ | U+1003 |
C5 | င | U+1004 |
C6 | စ | U+1005 |
C7 | ဆ | U+1006 |
C8 | ဇ | U+1007 |
C9 | ၡ | U+1061 |
C10 | ည | U+100A |
C11 | တ | U+1010 |
C12 | ထ | U+1011 |
C13 | ဒ | U+1012 |
C14 | န | U+1014 |
C15 | ပ | U+1015 |
C16 | ဖ | U+1016 |
C17 | ဘ | U+1018 |
C18 | မ | U+1019 |
C19 | ယ | U+101A |
C20 | ရ | U+101B |
C21 | လ | U+101C |
C22 | ဝ | U+101D |
C23 | သ | U+101E |
C24 | ဟ | U+101F |
C25 | အ | U+1021 |
C26 | ဧ | U+1027 |
Note 1: The correct relative order is not the same as the numeric order of the code points themselves. This observation applies to all tables in this document (vowels, tones, etc.).
Note 2: The character ၡ U+1061 is often incorrectly typed as ရှ U+101B U+103E. If the incorrect sequence is used, the word will sort in a totally unexpected way. The algorithm implementers must decide whether they want to leave it up to the typist to get this right, or do it for them.
See note 2 in the medial section for an additional issue that should be taken into consideration.
Medials
M0 | ||
M1 | ှ | U+103E |
M2 | ၠ | U+1060 |
M3 | ျ | U+103B |
M4 | ြ | U+103C |
M5 | ွ | U+103D |
M6 | ျှ | U+103B U+103E |
M7 | ြှ | U+103C U+103E |
M8 | ွှ | U+103D U+103E |
Note 1: The case where there is no medial - here and in following tables - is also included so that the relative sequence is clear.
Note 2: Multiple medial marks are never used in Sgaw Karen. However, it is conceivable that someone may use combination medials when transcribing foreign words (Myanmar, in particular). For this reason, I have listed the possible combinations (M6 - M8). I have seen only one example of this which appeared in point number 27 of Reverend David Gilmore's Sgaw Karen Grammar. He there gives an example in which he transcribes the Burmese name Saw Shwe Yaw as စီၤရွှ့ၣ်ယီၣ်.
This, like Myanmar where it comes from, is using ှ U+103E as a part of the initial consonant itself, rather than as a medial consonant. In the case of combination with ရ U+101B, it forms a completely different consonant and should be sorted it a totally different place. However, for (I believe) all other characters, this simple approach would result in it being sorted in the expected order.
Vowels
V0 | ||
V1 | ါ | U+102B |
V2 | ံ | U+1036 |
V3 | ၢ | U+1062 |
V4 | ု | U+102F |
V5 | ူ | U+1030 |
V6 | ့ | U+1037 |
V7 | ဲ | U+1032 |
V8 | ိ | U+102D |
V9 | ီ | U+102E |
Note 1: If no vowel is written, and a tone mark is present, then V1 is implied and should be collated accordingly.
Tones
T0 | ||
T1 | ၢ် | U+1062 U+103A |
T2 | ာ် | U+102C U+103A |
T3 | း | U+1038 |
T4 | ၣ် | U+1063 U+103A |
T5 | ၤ | U+1064 |
Note 1: The multi-character tones are treated as one unit for collation purposes.
Final
F0 | ||
F1 | က် | U+1000 U+1039 |
F2 | ခ် | U+1001 U+1039 |
F3 | င် | U+1004 U+1039 |
F4 | စ် | U+1005 U+1039 |
F5 | ဆ် | U+1006 U+1039 |
F6 | ၡ် | U+1061 U+1039 |
F7 | ဇ် | U+1007 U+1039 |
F8 | ည် | U+100A U+1039 |
F9 | တ် | U+1010 U+1039 |
F10 | ထ် | U+1011 U+1039 |
F11 | ဒ် | U+1012 U+1039 |
F12 | န် | U+1014 U+1039 |
F13 | ပ် | U+1015 U+1039 |
F14 | ဖ် | U+1016 U+1039 |
F15 | ဘ် | U+1018 U+1039 |
F16 | မ် | U+1019 U+1039 |
F17 | ယ် | U+101A U+1039 |
F18 | လ် | U+101F U+1039 |
F19 | ဝ် | U+101D U+1039 |
F20 | သ် | U+101E U+1039 |
Note 1: Finals are marked, as in Myanmar, with the Myanmar Sign Asat U+103A.
Note 2: F11 ဒ် and F16 မ် are usually ligatures rather than finals but can be either. See the next section for more information.
Special Ligatures
The following contractions expand as listed and sort according to their expansion:
မ် (U+1019 U+103A) -> ဒံ (U+1012 U+1036) ဒ် (U+1012 U+103A) -> မီၤ (U+1019 U+102E U+1064)
Note 1: These ligatures are a consonant followed by the Myanmar Sign Asat U+103A. This is the same way that final consonants are marked, making it ambiguous whether the sequence should be collated as a final consonant or as a ligature. For this reason, I believe the algorithm is going to need to refer to a word list.
The following is a list of all known words which make use of a မ် or ဒ် as final consonants. There may be more. Please let me know if you know of any.
ကံလိကြဲၢ်မ် ကြဲၣ်မ်
Pwo Karen Collation
The Pwo Karen and Sgaw Karen orthographies are quite similar. However, in several important cases, their customary dictionary order varies. Also, a few of the characters they use are different. Much of the discussion in the Sgaw Karen section above relates equally well to Pwo Karen.
Like Sgaw, Pwo should be collated by syllables. Each syllable can have up to five parts. They are:
<consonant><medial><vowel><tone><final>
Only the consonant is always present, one or more of the other parts may be empty in any given syllable. No character reordering is necessary.
Each of these parts of the syllable may be composed of one or more characters as delineated in the following tables.
Consonants
C1 | က | U+1000 |
C2 | ခ | U+1001 |
C3 | ဂ | U+1002 |
C4 | ဎ | U+100E |
C5 | င | U+1004 |
C6 | စ | U+1005 |
C7 | ဆ | U+1006 |
C8 | ဇ | U+1007 |
C9 | ည | U+100A |
C10 | ၡ | U+1061 |
C11 | တ | U+1010 |
C12 | ထ | U+1011 |
C13 | ဒ | U+1012 |
C14 | န | U+1014 |
C15 | ပ | U+1015 |
C16 | ဖ | U+1016 |
C17 | ဘ | U+1018 |
C18 | မ | U+1019 |
C19 | ယ | U+101A |
C20 | ရ | U+101B |
C21 | လ | U+101C |
C22 | ဝ | U+101D |
C23 | ၥ | U+1065 |
C24 | ဟ | U+101F |
C25 | အ | U+1021 |
C26 | ဧ | U+1027 |
C27 | ၦ | U+1066 |
Notes 1 and 2 from the Sgaw Karen section apply here as well.
Note 1: The Pwo character ၦ U+1066 may be incorrectly typed as ပှ U+1015 U+103E. However, the sequence ပှ U+1015 U+103E may also be used, making it harder for the collation algorithm to automatically handle user input mistakes.
Medials
M0 | ||
M1 | ှ | U+103E |
M2 | ၠ | U+1060 |
M3 | ျ | U+103B |
M4 | ြ | U+103C |
M5 | ွ | U+103D |
M6 | ျှ | U+103B U+103E |
M7 | ြှ | U+103C U+103E |
M8 | ွှ | U+103D U+103E |
Notes 1 and 2 from the Sgaw Karen section apply here as well.
Vowels
V0 | ||
V1 | ါ | U+102B |
V2 | ံ | U+1036 |
V3 | ့ | U+1037 |
V4 | ဲ | U+1032 |
V5 | ၧ | U+1062 |
V6 | ၨ | U+1068 |
V7 | ု | U+102F |
V8 | ူ | U+1030 |
V9 | ိ | U+102D |
V10 | ီ | U+102E |
Like Sgaw Karen, if no vowel is written, and a tone mark is present, then V1 is implied and should be collated accordingly.
Tones
T0 | ||
T1 | ၢ် | U+1062 U+103A |
T2 | ာ် | U+102C U+103A |
T3 | း | U+1038 |
T4 | ၣ် | U+1063 U+103A |
T5 | ၤ | U+1064 |
The multi-character tones are treated as one unit for collation purposes.
Final
Finals are marked, as in Myanmar, with the Myanmar Sign Asat U+103A.
F0 | ||
F1 | က် | U+1000 U+1039 |
F2 | ခ် | U+1001 U+1039 |
F3 | င် | U+1004 U+1039 |
F4 | စ် | U+1005 U+1039 |
F5 | ဆ် | U+1006 U+1039 |
F6 | ၡ် | U+1061 U+1039 |
F7 | ဇ် | U+1007 U+1039 |
F8 | ည် | U+100A U+1039 |
F9 | တ် | U+1010 U+1039 |
F10 | ထ် | U+1011 U+1039 |
F11 | ဒ် | U+1012 U+1039 |
F12 | န် | U+1014 U+1039 |
F13 | ပ် | U+1015 U+1039 |
F14 | ဖ် | U+1016 U+1039 |
F15 | ဘ် | U+1018 U+1039 |
F16 | မ် | U+1019 U+1039 |
F17 | ယ် | U+101A U+1039 |
F18 | လ် | U+101F U+1039 |
F19 | ဝ် | U+101D U+1039 |
F20 | သ် | U+101E U+1039 |
F11 ဒ် and F16 မ် are usually ligatures rather than finals but can be either. See the next section for more information.
Myanmar Collation
See this paper MyanmarCollation.pdf.
Discussion