Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
development:collation [2013/08/26 16:30]
ben [Vowels]
development:collation [2013/08/26 16:30] (current)
ben [Special Ligatures]
Line 1: Line 1:
  
 +==== Sgaw Karen Collation ====
 +
 +Karen is collated based on syllables. A Karen syllable encoded in Unicode can be broken
 +into 5 parts for collation:
 +
 +  <​consonant><​medial><​vowel><​tone><​final>​
 +
 +Only the consonant is always present, one or more of the other parts may be empty in any given syllable. No character reordering is necessary.
 +
 +Each of these parts of the syllable may be composed of one or more characters as delineated in the following tables.
 +=== Consonants ===
 +
 +<WRAP column>
 +| C1 | က | U+1000 |
 +| C2 | ခ | U+1001 |
 +| C3 | ဂ | U+1002 |
 +| C4 | ဃ | U+1003 |
 +| C5 | င | U+1004 |
 +| C6 | စ | U+1005 |
 +| C7 | ဆ | U+1006 |
 +| C8 | ဇ | U+1007 |
 +| C9 | ၡ | U+1061 |
 +| C10 | ည | U+100A |
 +| C11 | တ | U+1010 |
 +| C12 | ထ | U+1011 |
 +| C13 | ဒ | U+1012 |
 +</​WRAP>​
 +
 +<WRAP column>
 +| C14 | န | U+1014 |
 +| C15 | ပ | U+1015 |
 +| C16 | ဖ | U+1016 |
 +| C17 | ဘ | U+1018 |
 +| C18 | မ | U+1019 |
 +| C19 | ယ | U+101A |
 +| C20 | ရ | U+101B |
 +| C21 | လ | U+101C |
 +| C22 | ဝ | U+101D |
 +| C23 | သ | U+101E |
 +| C24 | ဟ | U+101F |
 +| C25 | အ | U+1021 |
 +| C26 | ဧ | U+1027 |
 +</​WRAP>​
 +
 +**Note 1:** The correct relative order is not the same as the numeric order of the code points themselves. This observation applies to all tables in this document (vowels, tones, etc.).
 +
 +**Note 2:** The character ၡ U+1061 is often incorrectly typed as ရှ U+101B U+103E. If the incorrect sequence is used, the word will sort in a totally unexpected way. The algorithm implementers must decide whether they want to leave it up to the typist to get this right, or do it for them.
 +
 +See note 2 in the medial section for an additional issue that should be taken into consideration.
 +
 +=== Medials ===
 +
 +<WRAP column>
 +| M0 | | |
 +| M1 | ှ | U+103E |
 +| M2 | ၠ | U+1060 |
 +| M3 | ျ | U+103B |
 +| M4 | ြ | U+103C |
 +| M5 | ွ | U+103D |
 +
 +| M6 | ျှ | U+103B U+103E |
 +| M7 | ြှ | U+103C U+103E |
 +| M8 | ွှ | U+103D U+103E |
 +</​WRAP>​
 +
 +**Note 1:** The case where there is no medial - here and in following tables - is also included so that the relative sequence is clear.
 +
 +**Note 2:** Multiple medial marks are never used in Sgaw Karen. However, it is conceivable that someone may use combination medials when transcribing foreign words (Myanmar, in particular). For this reason, I have listed the possible combinations (M6 - M8). I have seen only one example of this which appeared in point number 27 of Reverend David Gilmore'​s Sgaw Karen Grammar. He there gives an example in which he transcribes the Burmese name Saw Shwe Yaw as စီၤရွှ့ၣ်ယီၣ်.
 +
 +This, like Myanmar where it comes from, is using ှ U+103E as a part of the initial consonant itself, rather than as a medial consonant. In the case of combination with ရ U+101B, it forms a completely different consonant and should be sorted it a totally different place. However, for (I believe) all other characters, this simple approach would result in it being sorted in the expected order.
 +
 +
 +=== Vowels ===
 +
 +<WRAP column>
 +| V0 | | |
 +| V1 | ါ | U+102B |
 +| V2 | ံ | U+1036 |
 +| V3 | ၢ | U+1062 |
 +| V4 | ု | U+102F |
 +</​WRAP>​
 +
 +<WRAP column>
 +| V5 | ူ | U+1030 |
 +| V6 | ့ | U+1037 |
 +| V7 | ဲ | U+1032 |
 +| V8 | ိ | U+102D |
 +| V9 | ီ | U+102E |
 +</​WRAP>​
 +
 +**Note 1:** If no vowel is written, and a tone mark is present, then V1 is implied and should be collated accordingly.
 +=== Tones ===
 +
 +<WRAP column>
 +| T0 | | |
 +| T1 | ၢ် | U+1062 U+103A |
 +| T2 | ာ် | U+102C U+103A |
 +| T3 | း | U+1038 |
 +| T4 | ၣ် | U+1063 U+103A |
 +| T5 | ၤ | U+1064 |
 +</​WRAP>​
 +
 +**Note 1:** The multi-character tones are treated as one unit for collation purposes.
 +=== Final ===
 +
 +<WRAP column>
 +| F0 | | |
 +| F1 | က် | U+1000 U+1039 |
 +| F2 | ခ် | U+1001 U+1039 |
 +| F3 | င် | U+1004 U+1039 |
 +| F4 | စ် | U+1005 U+1039 |
 +| F5 | ဆ် | U+1006 U+1039 |
 +| F6 | ၡ် | U+1061 U+1039 |
 +| F7 | ဇ် | U+1007 U+1039 |
 +| F8 | ည် | U+100A U+1039 |
 +| F9 | တ် | U+1010 U+1039 |
 +| F10 | ထ် | U+1011 U+1039 |
 +</​WRAP>​
 +
 +<WRAP column>
 +| F11 | ဒ် | U+1012 U+1039 |
 +| F12 | န် | U+1014 U+1039 |
 +| F13 | ပ် | U+1015 U+1039 |
 +| F14 | ဖ် | U+1016 U+1039 |
 +| F15 | ဘ် | U+1018 U+1039 |
 +| F16 | မ် | U+1019 U+1039 |
 +| F17 | ယ် | U+101A U+1039 |
 +| F18 | လ် | U+101F U+1039 |
 +| F19 | ဝ် | U+101D U+1039 |
 +| F20 | သ် | U+101E U+1039 |
 +</​WRAP>​
 +
 +**Note 1:** Finals are marked, as in Myanmar, with the Myanmar Sign Asat U+103A.
 +
 +**Note 2:** F11 ဒ် and F16 မ် are usually ligatures rather than finals but can be either. See the next section for more information.
 +=== Special Ligatures ===
 +
 +The following contractions expand as listed and sort according to their expansion:
 +
 +  မ် (U+1019 U+103A) -> ဒံ (U+1012 U+1036)
 +  ဒ် (U+1012 U+103A) -> မီၤ (U+1019 U+102E U+1064)
 +
 +**Note 1:** These ligatures are a consonant followed by the Myanmar Sign Asat U+103A. This is the same way that final consonants are marked, making it ambiguous whether the sequence should be collated as a final consonant or as a ligature. For this reason, I believe the algorithm is going to need to refer to a word list.
 +
 +The following is a list of all known words which make use of a မ် or ဒ် as final consonants. There may be more. Please let me know if you know of any.
 +
 +  ကံလိကြဲၢ်မ်
 +  ကြဲၣ်မ်
 +
 +
 +
 +==== Pwo Karen Collation ====
 +
 +The Pwo Karen and Sgaw Karen orthographies are quite similar. However, in several important cases, their customary dictionary order varies. Also, a few of the characters they use are different. Much of the discussion in the Sgaw Karen section above relates equally well to Pwo Karen.
 +
 +Like Sgaw, Pwo should be collated by syllables. Each syllable can have up to five parts. They are:
 +
 +  <​consonant><​medial><​vowel><​tone><​final>​
 +
 +Only the consonant is always present, one or more of the other parts may be empty in any given syllable. No character reordering is necessary.
 +
 +Each of these parts of the syllable may be composed of one or more characters as delineated in the following tables.
 +=== Consonants ===
 +
 +<WRAP column>
 +| C1 | က | U+1000 |
 +| C2 | ခ | U+1001 |
 +| C3 | ဂ | U+1002 |
 +| C4 | ဎ | U+100E |
 +| C5 | င | U+1004 |
 +| C6 | စ | U+1005 |
 +| C7 | ဆ | U+1006 |
 +| C8 | ဇ | U+1007 |
 +| C9 | ည | U+100A |
 +| C10 | ၡ | U+1061 |
 +| C11 | တ | U+1010 |
 +| C12 | ထ | U+1011 |
 +| C13 | ဒ | U+1012 |
 +| C14 | န | U+1014 |
 +</​WRAP>​
 +
 +<WRAP column>
 +| C15 | ပ | U+1015 |
 +| C16 | ဖ | U+1016 |
 +| C17 | ဘ | U+1018 |
 +| C18 | မ | U+1019 |
 +| C19 | ယ | U+101A |
 +| C20 | ရ | U+101B |
 +| C21 | လ | U+101C |
 +| C22 | ဝ | U+101D |
 +| C23 | ၥ | U+1065 |
 +| C24 | ဟ | U+101F |
 +| C25 | အ | U+1021 |
 +| C26 | ဧ | U+1027 |
 +| C27 | ၦ | U+1066 |
 +</​WRAP>​
 +
 +Notes 1 and 2 from the Sgaw Karen section apply here as well.
 +
 +**Note 1:** The Pwo character ၦ U+1066 may be incorrectly typed as ပှ U+1015 U+103E. However, the sequence ပှ U+1015 U+103E may also be used, making it harder for the collation algorithm to automatically handle user input mistakes.
 +
 +
 +=== Medials ===
 +
 +<WRAP column>
 +| M0 | | |
 +| M1 | ှ | U+103E |
 +| M2 | ၠ | U+1060 |
 +| M3 | ျ | U+103B |
 +| M4 | ြ | U+103C |
 +| M5 | ွ | U+103D |
 +
 +| M6 | ျှ | U+103B U+103E |
 +| M7 | ြှ | U+103C U+103E |
 +| M8 | ွှ | U+103D U+103E |
 +</​WRAP>​
 +
 +Notes 1 and 2 from the Sgaw Karen section apply here as well.
 +=== Vowels ===
 +
 +<WRAP column>
 +| V0 | | |
 +| V1 | ါ | U+102B |
 +| V2 | ံ | U+1036 |
 +| V3 | ့ | U+1037 |
 +| V4 | ဲ | U+1032 |
 +| V5 | ၧ | U+1062 |
 +| V6 | ၨ | U+1068 |
 +| V7 | ု | U+102F |
 +| V8 | ူ | U+1030 |
 +| V9 | ိ | U+102D |
 +| V10 | ီ | U+102E |
 +</​WRAP>​
 +
 +Like Sgaw Karen, if no vowel is written, and a tone mark is present, then V1 is implied and should be collated accordingly.
 +=== Tones ===
 +
 +| T0 | | |
 +| T1 | ၢ် | U+1062 U+103A |
 +| T2 | ာ် | U+102C U+103A |
 +| T3 | း | U+1038 |
 +| T4 | ၣ် | U+1063 U+103A |
 +| T5 | ၤ | U+1064 |
 +
 +The multi-character tones are treated as one unit for collation purposes.
 +=== Final ===
 +Finals are marked, as in Myanmar, with the Myanmar Sign Asat U+103A.
 +
 +| F0 | | |
 +| F1 | က် | U+1000 U+1039 |
 +| F2 | ခ် | U+1001 U+1039 |
 +| F3 | င် | U+1004 U+1039 |
 +| F4 | စ် | U+1005 U+1039 |
 +| F5 | ဆ် | U+1006 U+1039 |
 +| F6 | ၡ် | U+1061 U+1039 |
 +| F7 | ဇ် | U+1007 U+1039 |
 +| F8 | ည် | U+100A U+1039 |
 +| F9 | တ် | U+1010 U+1039 |
 +| F10 | ထ် | U+1011 U+1039 |
 +| F11 | ဒ် | U+1012 U+1039 |
 +| F12 | န် | U+1014 U+1039 |
 +| F13 | ပ် | U+1015 U+1039 |
 +| F14 | ဖ် | U+1016 U+1039 |
 +| F15 | ဘ် | U+1018 U+1039 |
 +| F16 | မ် | U+1019 U+1039 |
 +| F17 | ယ် | U+101A U+1039 |
 +| F18 | လ် | U+101F U+1039 |
 +| F19 | ဝ် | U+101D U+1039 |
 +| F20 | သ် | U+101E U+1039 |
 +
 +F11 ဒ် and F16 မ် are usually ligatures rather than finals but can be either. See the next section for more information.
 +
 +==== Myanmar Collation ====
 +
 +See this paper [[http://​developer.mimer.com/​collations/​myanmar/​MyanmarCollation.pdf|MyanmarCollation.pdf]].
development/collation.txt · Last modified: 2013/08/26 16:30 by ben
Public Domain
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0