Loading

Encoding Most Using Myanmar Characters to Ascii range

UTF8 takes 3 bytes to store Myanmar Characters, so its not efficient way.

So I encode these into ascii range, relocate some ascii characters to different locations.
No characters eaten by that process.

I tested English to Burmese dictionary data, and
Here is interesting results.

UTF8 Encoding - 3.62 MB
UTF16 Encoding - 3.59 MB
My Encoding - 1.85 MB :-)

And My Encoding still support English Mixed texts like that UTF8 does :-)

# Also, My Encoding doesnot corrupt compressing algorithms,
when I 7zip those, mine still getting smallest size result.

Cheers,
Soe Min

7 comments:

Myint said...

Do you use ASCII range for European characters?

မာ့ခ္ said...

i am not sure with what is european characters,

but in my encoding

a-z is is not change,
some characters from A-Z changed, and some are in still same place,
0-9 is changed,
myanmar characters are in ascii range,
\t, \r, \n are in ascii range,
others are not

rgds,

မာ့ခ္ said...

and \x80 to \x7ff, and
others utf8 characters not in range of myanmar are as is.

Myint said...

that means your Myanmar ASCII characters are in 0-128 range?

မာ့ခ္ said...

absolutely! :-)

that can compatible with existing UTF8 rules also :-)

0-0x7f is 1 byte,
0x80-0x7ff is 2byte,
0x800-.... is 3 bytes ....

:-)

Myint said...

0-0x7F without removing English characters? That's really cool! Don't know how you even do that ;)

မာ့ခ္ said...

you know characters range btwn
0x01 - 0x1f - these are control characters

0x21 ~ some puntucations, not frequently use

and some A-Z characters, normally, Capital characters are not very use, right.

for these characters, i added a flag in front of that characters,
so its becomes 2 bytes for these characters,

and i move myanmar characters to these locations. :-)

its easy but just few tricks :P

cheers,

က်ေနာ္ဖတ္ေသာ အျခား ဘေလာ့ / ဆိုဒ္မ်ား