Why reordering is needed in Burmese Syllables? and How to fix with Regex
For example if can type “ko” with က___ိ__ု or က__ု__ိ
But its not good idea, right? You cannot do like that in English, for example “use” and “ues”
And even in Unicode 5.1, Vowel E like သဝထိုးေ are after the consonant like က to အ,
“က ” for “ကေ”
so I collected the data and get the following ordering list for Unicode 5.1
#ps ---း--- will need at the end according to theory (I skipped that because I never find the person who type း before others, may b I didnt know Ma KOM at that time :P)
Its mean ----ွ-------- come first than
so what I’ve done is to reorder the following patterns.
To get that list Javascript need to done like this
in python, its more easy like this in one line,
ps# Don’t tell me its complicated, I like obfuscated codes : - )
in C|C++, have to do a bit more things.
with that rules, replace / sub to $2$1 / \2\1 will get correct order.
Ps# I ve put extra underscores ___ on the codes, to see it clearly on screen, you should remove all of that characters.
And, This is the one I use in Burglish Web Input System to Correct Syntaxes.
Cheers,
Soe Min
For example if can type “ko” with က___ိ__ု or က__ု__ိ
But its not good idea, right? You cannot do like that in English, for example “use” and “ues”
And even in Unicode 5.1, Vowel E like သဝထိုးေ are after the consonant like က to အ,
“က ” for “ကေ”
so I collected the data and get the following ordering list for Unicode 5.1
-------ွ----------------ါ----ာ----ိ----ီ----ု----ူ----ဲ----ံ----ျ----်----့----
#ps ---း--- will need at the end according to theory (I skipped that because I never find the person who type း before others, may b I didnt know Ma KOM at that time :P)
Its mean ----ွ-------- come first than
so what I’ve done is to reorder the following patterns.
([_ွ____ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(____)
([____ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ွ__)
([___ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(____)
([__ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(____)
([_ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(____)
([_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ါ__)
([_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ာ__)
([_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ိ__)
([_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ီ__)
([_ူ_ဲ_ံ_ျ_်_့]+__)(__ု__)
([_ဲ_ံ_ျ_်_့]+__)(__ူ__)
([_ံ_ျ_်_့]+__)(__ဲ__)
([_ျ_်_့]+__)(__ံ__)
([_်_့]+__)(__ျ__)
To get that list Javascript need to done like this
for(var i=0;i<vowel.length-2;i++){
var re=eval("/(["+vowel.slice(i+1).join("")+"]+)("+vowel[i]+")/g"__);
}
in python, its more easy like this in one line,
ps# Don’t tell me its complicated, I like obfuscated codes : - )
["("+"".join(vowel[vowel.index(x)+1:])+")("+x+")" for x in vowel]
in C|C++, have to do a bit more things.
for(int i=0;i<vowel_len-1;i++){
wchar_t *restr = new wchar_t[40];
sprintf(restr, L"([%s]+)(%c)", &vowel[i+1], &vowel[i]);
Regex re(restr,true);
if(re.test(content)){
re.sub(content,L"\2\1",content);
}
#ps That Regex function is 100% my own regex engine written in C++.with that rules, replace / sub to $2$1 / \2\1 will get correct order.
Ps# I ve put extra underscores ___ on the codes, to see it clearly on screen, you should remove all of that characters.
And, This is the one I use in Burglish Web Input System to Correct Syntaxes.
Cheers,
Soe Min
5 comments:
Friend, thanks for this information; it's quite helpful for searching and sorting.
I notice that you didn't include "း", which is always the last letter. Is this because "း" is always in the correct location?
-->Seth
Yeah, Thanks for pointing out.
I almost forget that case.
According to Theory, Have to put that "း"
But my codes are targeted for light weight, I think I skipped that at that time.
I will put that one too.
Thanks
> But my codes are targeted for light weight
Yes, I liked your one-line Python example. :D
Just one more question.How do you apply the algorithm? Once to each word? Or once after each consonant, even if it's killed?
For example, with: "ခိုယ့္", do you apply the algorithm:
1) Only once, to --ိ---ု---္---့-
2) Twice, first to --ိ---ု- and second to --္---့-
I ask this because I am thinking of a very simple implementation:
String s = "ခိုယ့္";
String[] options = s.split("[\u1000..\u1020]);
for each String opt in options {
//Apply your re-ordering rule to opt
}
This approach "resets" the scan after each consonant, even if it's stacked or killed. Do you think it will work?
-->Seth
>> String[] options = s.split("[\u1000..\u1020]);
if you split into words or put some DELIMITER like "|". it will get accurate results.
for stacks characters, need extra replace functions. they are define in fontmap as fontmap[Myanmar3][order][after]
actual regex is
(__[__ါ__-__ဲ__်__-__ွ__]__+__)__(___[__က__-__အ__]__)
just moving stack indicator \u1039+Consonent infront of others parts.
Ah, thanks, that makes sense. This makes it a lot easier to do things like string comparison.
Cheers,
-->Seth
Post a Comment