Loading

Why reordering is needed in Burmese Syllables and How to fix with Regex (Javascript,Python,C/C++)

Why reordering is needed in Burmese Syllables? and How to fix with Regex

For example if can type “ko” with က___ိ__ု or က__ု__ိ

But its not good idea, right? You cannot do like that in English, for example “use” and “ues”

And even in Unicode 5.1, Vowel E like သဝထိုးေ are after the consonant like က to အ,
“က ” for “ကေ”

so I collected the data and get the following ordering list for Unicode 5.1

-------ွ----------------ါ----ာ----ိ----ီ----ု----ူ----ဲ----ံ----ျ----်----့----

#ps ---း--- will need at the end according to theory (I skipped that because I never find the person who type း before others, may b I didnt know Ma KOM at that time :P)

Its mean ----ွ-------- come first than

so what I’ve done is to reorder the following patterns.

([_ွ____ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(____)
([____ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ွ__)
([___ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(____)
([__ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(____)
([_ါ_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(____)
([_ာ_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ါ__)
([_ိ_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ာ__)
([_ီ_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ိ__)
([_ု_ူ_ဲ_ံ_ျ_်_့]+__)(__ီ__)
([_ူ_ဲ_ံ_ျ_်_့]+__)(__ု__)
([_ဲ_ံ_ျ_်_့]+__)(__ူ__)
([_ံ_ျ_်_့]+__)(__ဲ__)
([_ျ_်_့]+__)(__ံ__)
([_်_့]+__)(__ျ__)

To get that list Javascript need to done like this
for(var i=0;i<vowel.length-2;i++){
var re=eval("/(["+vowel.slice(i+1).join("")+"]+)("+vowel[i]+")/g"__);
}

in python, its more easy like this in one line,
ps# Don’t tell me its complicated, I like obfuscated codes : - )
["("+"".join(vowel[vowel.index(x)+1:])+")("+x+")" for x in vowel]

in C|C++, have to do a bit more things.
for(int i=0;i<vowel_len-1;i++){
wchar_t *restr = new wchar_t[40];
sprintf(restr, L"([%s]+)(%c)", &vowel[i+1], &vowel[i]);

Regex re(restr,true);
if(re.test(content)){
re.sub(content,L"\2\1",content);
}
#ps That Regex function is 100% my own regex engine written in C++.

with that rules, replace / sub to $2$1 / \2\1 will get correct order.

Ps# I ve put extra underscores ___ on the codes, to see it clearly on screen, you should remove all of that characters.

And, This is the one I use in Burglish Web Input System to Correct Syntaxes.

Cheers,
Soe Min

5 comments:

S'orlok Reaves said...

Friend, thanks for this information; it's quite helpful for searching and sorting.

I notice that you didn't include "း", which is always the last letter. Is this because "း" is always in the correct location?

-->Seth

မာ့ခ္ said...

Yeah, Thanks for pointing out.
I almost forget that case.
According to Theory, Have to put that "း"
But my codes are targeted for light weight, I think I skipped that at that time.

I will put that one too.

Thanks

S'orlok Reaves said...

> But my codes are targeted for light weight
Yes, I liked your one-line Python example. :D

Just one more question.How do you apply the algorithm? Once to each word? Or once after each consonant, even if it's killed?

For example, with: "ခိုယ့္", do you apply the algorithm:
1) Only once, to --ိ---ု---္---့-
2) Twice, first to --ိ---ု- and second to --္---့-

I ask this because I am thinking of a very simple implementation:
String s = "ခိုယ့္";
String[] options = s.split("[\u1000..\u1020]);
for each String opt in options {
//Apply your re-ordering rule to opt
}

This approach "resets" the scan after each consonant, even if it's stacked or killed. Do you think it will work?

-->Seth

မာ့ခ္ said...

>> String[] options = s.split("[\u1000..\u1020]);

if you split into words or put some DELIMITER like "|". it will get accurate results.

for stacks characters, need extra replace functions. they are define in fontmap as fontmap[Myanmar3][order][after]
actual regex is
(__[__ါ__-__ဲ__်__-__ွ__]__+__)__(___[__က__-__အ__]__)
just moving stack indicator \u1039+Consonent infront of others parts.

S'orlok Reaves said...

Ah, thanks, that makes sense. This makes it a lot easier to do things like string comparison.

Cheers,
-->Seth

က်ေနာ္ဖတ္ေသာ အျခား ဘေလာ့ / ဆိုဒ္မ်ား