Subject: Re: UCS-2 vs. UCS-4
From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Sat Jun 23 2001 - 08:44:16 CDT
Martin Sevior wrote:
>
> This is an interesting debate. One extra point we should all keep in mind
> is that we probabally don't waste much more space going from 16 => 32 bits
> for character representation.
>
> The Piece Table consists of a doubly linked-list of Fragments (Frags) of
> various sorts. These Fragments can represent 0 or more characters. Each
> Fragment is a sequence of contiguous text with identical properties.
> Format Marks are presentented by Fragments of 0 characters. Struxes (like
> a Paragragraph breaks) are Fragments of 1 character size.
>
> Each Fragment is a class which altogether must consist of at least a
> few hundred bytes. What I'm saying is that there is a considerable
> overhead for each textual character in the PieceTable. There are few
> occasions in AbiWord where the size of the text "in" a Fragment is
> actually larger than the Fragment itself.
>
> This being the case, if we really need to handle > 16 bits I don't think
> we lose much by just making a global change, UCS-2 => UCS-4. It will be
> easier to code and almost certainly be faster.
Okay if we definitely conclude that the text data itself is much
smaller than its meta data I have to agree. If people provide good
arguments that this is not the case then we should go with UTF-8 or
UTF-16, and make sure that all places expect this to be multibyte
(or multiword for UTF-16) data.
If UTF-32/UCS-4 does prove to make a significant memory impact we
must be aware that there will also be a cache impact.
Another thing to keep in mind is whether some iconvs might not
have support for UCS-2 or UTF-32. But we can always roll our own.
> BTW we currently make LOTS of assumptions of fixed size per character.
>
> Doing a global change UCS-2 => UCS-4 is probabally just a single perl
> script and a rewrite of some UT_UCS_* functions to handle 32 bit
> characters.
>
> Hunting theough all the abi code to fix fixed size character assumptions
> would be REALLY hard work.
>
> I'm strongly in favour of doing a UCS-2 => UCS-4 global change should
> support above 16 bits be deemed neccessary.
Another point when creating a full-featured word processor that is
fully multilingual is the difference between a character, a codepoint,
and a glyph. While going to UCS-4 would ensure that each codepoint
is a fixed size, our assumptios that each character is a fixed size
will still be wrong. We must keep in mind combining characters
and lesser-known details such as the fact that in some languages
changing the cases of a "character" can result in a "character"
which uses a different number of codepoints. There are many other
such subtleties in a full implementation.
But hey - if it were easy, everybody would be doing it! (:
Andrew.
> On Sat, 23 Jun 2001, Andrew Dunbar wrote:
>
> > Mike Nordell wrote:
> > >
> > > Please see this post as more-or-less brainstorming.
> > >
> > > It seems that currently all (?) of us don't use anything larger than UCS-2,
> > > but in a not too distand future perhaps we will have to use 2^32 for
> > > character representations (makes me whish for plain ASCII and console-mode
> > > again - I sure as hell don't want to keep track of 4 _billion_ chars).
> > >
> > > I don't know if this is a problem already, but if it is; what about creating
> > > a factory for encoding? Like:
> > >
> > > ASCII_Factory
> > > UTF8_Factory
> > > UCS2_Factory
> > > UCS4_Factory
> > >
> > > and let them return objects that can handle (what to the outside looks like
> > > a linked list of "void*") the chars from a document (or piece table or
> > > whatever, I'm not sure at what level this should be implemented)?
> > >
> > > My idea was something like:
> > > Start at ASCII. If someone enter an outside-ASCII-range char the
> > > document is "upgraded" to the nect level that can handle that type of chars.
> > >
> > > When saving, check what max "level" is used, and save using that one.
> > > Example: If someone used 16-bit chars but entered a UCS-4 char, the engine
> > > would "upgrade" the full document [1] to UCS-4. When saving, if those
> > > specific characters were removed, it would "back down" to UCS-2.
> >
> > I think this is a good idea. The key is that we have a "string" class
> > about which we do not make assumptions regarding character
> > representation.
> > UT_UCSChar is such an assumption currently.
> >
> > However, UTF-8, UTF-16, and UTF-32 can all handle 32-bit codepoints.
> > I'm still not sure if UCS-2 is defined as different to UTF-16 in this
> > regard. UTF-16 definitely handles surrogates but I'm not sure if
> > they're part of UCS-2 nor not. Otherwise the two are the same thing.
> > Correct me if I'm wrong please. Anyway, UTF-8 is always 8 bit
> > based so it is a superset of ASCII - no upgrading needed. It is
> > multibyte meaning for a 31 (not 32) bit range of characters it can
> > take from 1 to 6 bytes. Usually 1 byte for English, 2 bytes for
> > accented characters, and 3 bytes for Chinese/Japanese/Korean, and
> > more for really exotic stuff hardly used yet. UCS-2 can handle
> > up to 2^16 range of characters in sixteen bits. Above that we
> > have "surrogates" which mean we now have to handle possible pairs
> > of sixteen bit values. This is where our UT_UCSChar is not
> > compatible. UTF-32 always uses a single 32 bit value to hold
> > any character whatsoever.
> >
> > So if we used UTF-8 internally we never need to upgrade and
> > sizes are always pretty good. But we need functions which
> > expect to iterate through the string and never use true
> > random access. We can also do this using UCS-2/UTF-32 if
> > we handle surrogates properly. This also means no true
> > random access. If we really do need random access we might
> > be able to have a UTF-8 -> UCS-2 (no surrogates) -> UTF-32
> > system of upgrades.
> >
> > There are also issues involving the concept of a character
> > versus a codepoint versus a glyph which boil down to the
> > reality that we should always treat a single "character" as
> > a string.
> >
> > Properly designed and coded I don't think this is difficult
> > and should mean we can still have an ASCII only build
> > and a fully multilingual build and keep everyone happy.
> >
> > Andrew Dunbar.
> >
> > --
> > http://linguaphile.sourceforge.net
> >
> >
> > _________________________________________________________
> >
> > Do You Yahoo!?
> >
> > Get your free @yahoo.com address at http://mail.yahoo.com
> >
> >
> >
> >
> >
-- http://linguaphile.sourceforge.net _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
This archive was generated by hypermail 2b25 : Sat Jun 23 2001 - 08:42:11 CDT