Subject: Re: UCS-2 vs. UCS-4
From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Sat Jun 23 2001 - 08:28:37 CDT
This is an interesting debate. One extra point we should all keep in mind
is that we probabally don't waste much more space going from 16 => 32 bits
for character representation.
The Piece Table consists of a doubly linked-list of Fragments (Frags) of
various sorts. These Fragments can represent 0 or more characters. Each
Fragment is a sequence of contiguous text with identical properties.
Format Marks are presentented by Fragments of 0 characters. Struxes (like
a Paragragraph breaks) are Fragments of 1 character size.
Each Fragment is a class which altogether must consist of at least a
few hundred bytes. What I'm saying is that there is a considerable
overhead for each textual character in the PieceTable. There are few
occasions in AbiWord where the size of the text "in" a Fragment is
actually larger than the Fragment itself.
This being the case, if we really need to handle > 16 bits I don't think
we lose much by just making a global change, UCS-2 => UCS-4. It will be
easier to code and almost certainly be faster.
BTW we currently make LOTS of assumptions of fixed size per character.
Doing a global change UCS-2 => UCS-4 is probabally just a single perl
script and a rewrite of some UT_UCS_* functions to handle 32 bit
characters.
Hunting theough all the abi code to fix fixed size character assumptions
would be REALLY hard work.
I'm strongly in favour of doing a UCS-2 => UCS-4 global change should
support above 16 bits be deemed neccessary.
Cheers
Martin
On Sat, 23 Jun 2001, Andrew Dunbar wrote:
> Mike Nordell wrote:
> >
> > Please see this post as more-or-less brainstorming.
> >
> > It seems that currently all (?) of us don't use anything larger than UCS-2,
> > but in a not too distand future perhaps we will have to use 2^32 for
> > character representations (makes me whish for plain ASCII and console-mode
> > again - I sure as hell don't want to keep track of 4 _billion_ chars).
> >
> > I don't know if this is a problem already, but if it is; what about creating
> > a factory for encoding? Like:
> >
> > ASCII_Factory
> > UTF8_Factory
> > UCS2_Factory
> > UCS4_Factory
> >
> > and let them return objects that can handle (what to the outside looks like
> > a linked list of "void*") the chars from a document (or piece table or
> > whatever, I'm not sure at what level this should be implemented)?
> >
> > My idea was something like:
> > Start at ASCII. If someone enter an outside-ASCII-range char the
> > document is "upgraded" to the nect level that can handle that type of chars.
> >
> > When saving, check what max "level" is used, and save using that one.
> > Example: If someone used 16-bit chars but entered a UCS-4 char, the engine
> > would "upgrade" the full document [1] to UCS-4. When saving, if those
> > specific characters were removed, it would "back down" to UCS-2.
>
> I think this is a good idea. The key is that we have a "string" class
> about which we do not make assumptions regarding character
> representation.
> UT_UCSChar is such an assumption currently.
>
> However, UTF-8, UTF-16, and UTF-32 can all handle 32-bit codepoints.
> I'm still not sure if UCS-2 is defined as different to UTF-16 in this
> regard. UTF-16 definitely handles surrogates but I'm not sure if
> they're part of UCS-2 nor not. Otherwise the two are the same thing.
> Correct me if I'm wrong please. Anyway, UTF-8 is always 8 bit
> based so it is a superset of ASCII - no upgrading needed. It is
> multibyte meaning for a 31 (not 32) bit range of characters it can
> take from 1 to 6 bytes. Usually 1 byte for English, 2 bytes for
> accented characters, and 3 bytes for Chinese/Japanese/Korean, and
> more for really exotic stuff hardly used yet. UCS-2 can handle
> up to 2^16 range of characters in sixteen bits. Above that we
> have "surrogates" which mean we now have to handle possible pairs
> of sixteen bit values. This is where our UT_UCSChar is not
> compatible. UTF-32 always uses a single 32 bit value to hold
> any character whatsoever.
>
> So if we used UTF-8 internally we never need to upgrade and
> sizes are always pretty good. But we need functions which
> expect to iterate through the string and never use true
> random access. We can also do this using UCS-2/UTF-32 if
> we handle surrogates properly. This also means no true
> random access. If we really do need random access we might
> be able to have a UTF-8 -> UCS-2 (no surrogates) -> UTF-32
> system of upgrades.
>
> There are also issues involving the concept of a character
> versus a codepoint versus a glyph which boil down to the
> reality that we should always treat a single "character" as
> a string.
>
> Properly designed and coded I don't think this is difficult
> and should mean we can still have an ASCII only build
> and a fully multilingual build and keep everyone happy.
>
> Andrew Dunbar.
>
> --
> http://linguaphile.sourceforge.net
>
>
> _________________________________________________________
>
> Do You Yahoo!?
>
> Get your free @yahoo.com address at http://mail.yahoo.com
>
>
>
>
>
This archive was generated by hypermail 2b25 : Sat Jun 23 2001 - 08:29:12 CDT