So I encode these into ascii range, relocate some ascii characters to different locations.
No characters eaten by that process.
I tested English to Burmese dictionary data, and
Here is interesting results.
UTF8 Encoding - 3.62 MB
UTF16 Encoding - 3.59 MB
My Encoding - 1.85 MB :-)
And My Encoding still support English Mixed texts like that UTF8 does :-)
# Also, My Encoding doesnot corrupt compressing algorithms,
when I 7zip those, mine still getting smallest size result.
Cheers,
Soe Min
7 comments:
Do you use ASCII range for European characters?
i am not sure with what is european characters,
but in my encoding
a-z is is not change,
some characters from A-Z changed, and some are in still same place,
0-9 is changed,
myanmar characters are in ascii range,
\t, \r, \n are in ascii range,
others are not
rgds,
and \x80 to \x7ff, and
others utf8 characters not in range of myanmar are as is.
that means your Myanmar ASCII characters are in 0-128 range?
absolutely! :-)
that can compatible with existing UTF8 rules also :-)
0-0x7f is 1 byte,
0x80-0x7ff is 2byte,
0x800-.... is 3 bytes ....
:-)
0-0x7F without removing English characters? That's really cool! Don't know how you even do that ;)
you know characters range btwn
0x01 - 0x1f - these are control characters
0x21 ~ some puntucations, not frequently use
and some A-Z characters, normally, Capital characters are not very use, right.
for these characters, i added a flag in front of that characters,
so its becomes 2 bytes for these characters,
and i move myanmar characters to these locations. :-)
its easy but just few tricks :P
cheers,
Post a Comment