Re: Strings, Was: profile results for new UT_* implementations?

Subject: Re: Strings, Was: profile results for new UT_* implementations?
From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Wed Jun 20 2001 - 08:51:33 CDT

sorted by: [ date ] [ thread ] [ subject ] [ author ]
Next message: Frodo Looijaard: "Warnings and errors on /doc documents"
Previous message: Martin Sevior: "Re: Memory leaks"
In reply to: Mike Nordell: "Re: Strings, Was: profile results for new UT_* implementations?"
Next in thread: Joaquín Cuenca Abela: "Re: Strings, Was: profile results for new UT_* implementations?"
Next in thread: Aaron Lehmann: "Re: Strings, Was: profile results for new UT_* implementations?"
Reply: Andrew Dunbar: "Re: Strings, Was: profile results for new UT_* implementations?"

Mike Nordell wrote:
>
> Dom Lachowicz wrote:
>
> > Abi historically has always used UCS-2 internally to represent strings,
> and as
> > you note, we're beginning to run into problems with that. Dealing with
> UTF-8 is
> > no more pleasant than dealing with UCS-2 in my experience, but perhaps it
> is
> > (much) more common in the programming communtiy as a whole.
>
> I'd say dealing with UTF-8 is _much´more of a hell:
> A discussion I and Joaquin had about this in the back of the cab on the way
> to the .dk party turned out that while having a document in any format
> on.disk, having it in UCS-2 in memory should be _much_ easier to deal with
> (only indexing on unsigned chars) than UTF-8 (indexing on... oh, we can't
> index). At the moment I believe we both felt it was the way to go. At least
> I still feel it's reasonable.

Well it seems natural to expect UTF-8 to be more of a pain and UCS-2
to be the easy one. But having ready lots of discussions, arguments,
and flame wars on the subject; most end up deciding on UTF-8 or
UTF-32 throughout because they have *no* special cases. UTF-16 has
ugly special cases. It turns out that in the real world, random
access to strings is very rare. 99% of the time they are iterated
through. Random access is possible with UTF-32 but it costs memory.
Random access is *not* possible with UCS-2!

> What are the problems? Please don't say we need more than 2^16chars.

I'm afraid we do. People who chose UCS-2 early on when everybody
thought 2^chars was enough are all in a horrible situation now.
It's easy to say we only support 16 bit chars but sooner or later
we'll get popular and Asian users (and who knows else) will need
the stuff beyond this. The earlier we start worrying about it the
better.

But I'm expert enough to try to convince anyone. Instead, I
refer you to a few mailing list archives where battles have
been raging. Interested parties please read. CJK people please
read:

http://groups.yahoo.com/group/unicode/messages
http://mail.nl.linux.org/linux-utf8/
http://groups.yahoo.com/group/vim-multibyte
http://archive.develooper.com/perl-unicode@perl.org/
http://oss.software.ibm.com/icu/archives/icu/index.html
http://archive.develooper.com/perl6-internals-unicode%40perl.org/
http://archive.develooper.com/perl6-internals%40perl.org/

That should be enough to scare anyone (:

Andrew Dunbar.

-- 
http://linguaphile.sourceforge.net

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

Next message: Frodo Looijaard: "Warnings and errors on /doc documents"
Previous message: Martin Sevior: "Re: Memory leaks"
In reply to: Mike Nordell: "Re: Strings, Was: profile results for new UT_* implementations?"
Next in thread: Joaquín Cuenca Abela: "Re: Strings, Was: profile results for new UT_* implementations?"
Next in thread: Aaron Lehmann: "Re: Strings, Was: profile results for new UT_* implementations?"
Reply: Andrew Dunbar: "Re: Strings, Was: profile results for new UT_* implementations?"

This archive was generated by hypermail 2b25 : Wed Jun 20 2001 - 08:49:38 CDT