Sgaw Karen Collation

Karen is collated based on syllables. A Karen syllable encoded in Unicode can be broken into 5 parts for collation:

<consonant><medial><vowel><tone><final>

Only the consonant is always present, one or more of the other parts may be empty in any given syllable. No character reordering is necessary.

Each of these parts of the syllable may be composed of one or more characters as delineated in the following tables.

Consonants

C1 က U+1000
C2 U+1001
C3 U+1002
C4 U+1003
C5 U+1004
C6 U+1005
C7 U+1006
C8 U+1007
C9 U+1061
C10 U+100A
C11 U+1010
C12 U+1011
C13 U+1012
C14 U+1014
C15 U+1015
C16 U+1016
C17 U+1018
C18 U+1019
C19 U+101A
C20 U+101B
C21 U+101C
C22 U+101D
C23 U+101E
C24 U+101F
C25 U+1021
C26 U+1027

Note 1: The correct relative order is not the same as the numeric order of the code points themselves. This observation applies to all tables in this document (vowels, tones, etc.).

Note 2: The character ၡ U+1061 is often incorrectly typed as ရှ U+101B U+103E. If the incorrect sequence is used, the word will sort in a totally unexpected way. The algorithm implementers must decide whether they want to leave it up to the typist to get this right, or do it for them.

See note 2 in the medial section for an additional issue that should be taken into consideration.

Medials

M0
M1 U+103E
M2 U+1060
M3 U+103B
M4 U+103C
M5 U+103D
M6 ျှ U+103B U+103E
M7 ြှ U+103C U+103E
M8 ွှ U+103D U+103E

Note 1: The case where there is no medial - here and in following tables - is also included so that the relative sequence is clear.

Note 2: Multiple medial marks are never used in Sgaw Karen. However, it is conceivable that someone may use combination medials when transcribing foreign words (Myanmar, in particular). For this reason, I have listed the possible combinations (M6 - M8). I have seen only one example of this which appeared in point number 27 of Reverend David Gilmore's Sgaw Karen Grammar. He there gives an example in which he transcribes the Burmese name Saw Shwe Yaw as စီၤရွှ့ၣ်ယီၣ်.

This, like Myanmar where it comes from, is using ှ U+103E as a part of the initial consonant itself, rather than as a medial consonant. In the case of combination with ရ U+101B, it forms a completely different consonant and should be sorted it a totally different place. However, for (I believe) all other characters, this simple approach would result in it being sorted in the expected order.

Vowels

V0
V1 U+102B
V2 U+1036
V3 U+1062
V4 U+102F
V5 U+1030
V6 U+1037
V7 U+1032
V8 U+102D
V9 U+102E

Note 1: If no vowel is written, and a tone mark is present, then V1 is implied and should be collated accordingly.

Tones

T0
T1 ၢ် U+1062 U+103A
T2 ာ် U+102C U+103A
T3 U+1038
T4 ၣ် U+1063 U+103A
T5 U+1064

Note 1: The multi-character tones are treated as one unit for collation purposes.

Final

F0
F1 က် U+1000 U+1039
F2 ခ် U+1001 U+1039
F3 င် U+1004 U+1039
F4 စ် U+1005 U+1039
F5 ဆ် U+1006 U+1039
F6 ၡ် U+1061 U+1039
F7 ဇ် U+1007 U+1039
F8 ည် U+100A U+1039
F9 တ် U+1010 U+1039
F10 ထ် U+1011 U+1039
F11 ဒ် U+1012 U+1039
F12 န် U+1014 U+1039
F13 ပ် U+1015 U+1039
F14 ဖ် U+1016 U+1039
F15 ဘ် U+1018 U+1039
F16 မ် U+1019 U+1039
F17 ယ် U+101A U+1039
F18 လ် U+101F U+1039
F19 ဝ် U+101D U+1039
F20 သ် U+101E U+1039

Note 1: Finals are marked, as in Myanmar, with the Myanmar Sign Asat U+103A.

Note 2: F11 ဒ် and F16 မ် are usually ligatures rather than finals but can be either. See the next section for more information.

Special Ligatures

The following contractions expand as listed and sort according to their expansion:

မ် (U+1019 U+103A) -> ဒံ (U+1012 U+1036)
ဒ် (U+1012 U+103A) -> မီၤ (U+1019 U+102E U+1064)

Note 1: These ligatures are a consonant followed by the Myanmar Sign Asat U+103A. This is the same way that final consonants are marked, making it ambiguous whether the sequence should be collated as a final consonant or as a ligature. For this reason, I believe the algorithm is going to need to refer to a word list.

The following is a list of all known words which make use of a မ် or ဒ် as final consonants. There may be more. Please let me know if you know of any.

ကံလိကြဲၢ်မ်
ကြဲၣ်မ်

Pwo Karen Collation

The Pwo Karen and Sgaw Karen orthographies are quite similar. However, in several important cases, their customary dictionary order varies. Also, a few of the characters they use are different. Much of the discussion in the Sgaw Karen section above relates equally well to Pwo Karen.

Like Sgaw, Pwo should be collated by syllables. Each syllable can have up to five parts. They are:

<consonant><medial><vowel><tone><final>

Only the consonant is always present, one or more of the other parts may be empty in any given syllable. No character reordering is necessary.

Each of these parts of the syllable may be composed of one or more characters as delineated in the following tables.

Consonants

C1 က U+1000
C2 U+1001
C3 U+1002
C4 U+100E
C5 U+1004
C6 U+1005
C7 U+1006
C8 U+1007
C9 U+100A
C10 U+1061
C11 U+1010
C12 U+1011
C13 U+1012
C14 U+1014
C15 U+1015
C16 U+1016
C17 U+1018
C18 U+1019
C19 U+101A
C20 U+101B
C21 U+101C
C22 U+101D
C23 U+1065
C24 U+101F
C25 U+1021
C26 U+1027
C27 U+1066

Notes 1 and 2 from the Sgaw Karen section apply here as well.

Note 1: The Pwo character ၦ U+1066 may be incorrectly typed as ပှ U+1015 U+103E. However, the sequence ပှ U+1015 U+103E may also be used, making it harder for the collation algorithm to automatically handle user input mistakes.

Medials

M0
M1 U+103E
M2 U+1060
M3 U+103B
M4 U+103C
M5 U+103D
M6 ျှ U+103B U+103E
M7 ြှ U+103C U+103E
M8 ွှ U+103D U+103E

Notes 1 and 2 from the Sgaw Karen section apply here as well.

Vowels

V0
V1 U+102B
V2 U+1036
V3 U+1037
V4 U+1032
V5 U+1062
V6 U+1068
V7 U+102F
V8 U+1030
V9 U+102D
V10 U+102E

Like Sgaw Karen, if no vowel is written, and a tone mark is present, then V1 is implied and should be collated accordingly.

Tones

T0
T1 ၢ် U+1062 U+103A
T2 ာ် U+102C U+103A
T3 U+1038
T4 ၣ် U+1063 U+103A
T5 U+1064

The multi-character tones are treated as one unit for collation purposes.

Final

Finals are marked, as in Myanmar, with the Myanmar Sign Asat U+103A.

F0
F1 က် U+1000 U+1039
F2 ခ် U+1001 U+1039
F3 င် U+1004 U+1039
F4 စ် U+1005 U+1039
F5 ဆ် U+1006 U+1039
F6 ၡ် U+1061 U+1039
F7 ဇ် U+1007 U+1039
F8 ည် U+100A U+1039
F9 တ် U+1010 U+1039
F10 ထ် U+1011 U+1039
F11 ဒ် U+1012 U+1039
F12 န် U+1014 U+1039
F13 ပ် U+1015 U+1039
F14 ဖ် U+1016 U+1039
F15 ဘ် U+1018 U+1039
F16 မ် U+1019 U+1039
F17 ယ် U+101A U+1039
F18 လ် U+101F U+1039
F19 ဝ် U+101D U+1039
F20 သ် U+101E U+1039

F11 ဒ် and F16 မ် are usually ligatures rather than finals but can be either. See the next section for more information.

Myanmar Collation

See this paper MyanmarCollation.pdf.

Discussion

Enter your comment. Wiki syntax is allowed:
If you can't read the letters on the image, download this .wav file to get them read to you.
 
development/collation.txt · Last modified: 2013/08/26 16:30 by ben
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0