Re: Strings, Was: profile results for new UT_* implementations?

Subject: Re: Strings, Was: profile results for new UT_* implementations?
From: Dom Lachowicz (dominicl@seas.upenn.edu)
Date: Tue Jun 19 2001 - 22:55:05 CDT

sorted by: [ date ] [ thread ] [ subject ] [ author ]
Next message: Martin Sevior: "Re: profile results for new UT_* implementations?"
Previous message: Andrew Dunbar: "Re: Memory leaks"
In reply to: Andrew Dunbar: "Re: Strings, Was: profile results for new UT_* implementations?"
Next in thread: Aaron Lehmann: "Re: Strings, Was: profile results for new UT_* implementations?"
Next in thread: Martin Sevior: "Re: profile results for new UT_* implementations?"
Reply: Dom Lachowicz: "Re: Strings, Was: profile results for new UT_* implementations?"
Reply: Aaron Lehmann: "Re: Strings, Was: profile results for new UT_* implementations?"
Reply: Mike Nordell: "Re: Strings, Was: profile results for new UT_* implementations?"

> Actually, I find UT_Bytebuf useful for strings. I use them in the text
> importer and exporter so I can have one set of functions regardless of
> whether I'm handling 8-bit or 16-bit text. And it'll work if and when
> we have to handle 32-bit text.

Regardless of their apparent usefulness, UT_Bytebufs are not strings. They
don't look or behave like strings - they represent a block of memory and
nothing more. So is this useful for importing text? Yes. Optimal? Probably not.
And (AFAIK) the UCS2 string class can properly handle appending a 'char' or a
UCSChar to its buffer, if this makes any impact on the discussion.

> This must have been discussed at some point, but I'll bring it up since
> I've not seen it here yet. I read all of the Unicode mailing lists
> and newsgroups I can and it seems everybody *hates* UCS-2. Except
> maybe Microsoft (: The rest of the world are coming to grips with
> using UTF-8 for interchange, and UTF-32 (UCS-4) internally. If you
> know anything about surrogates you'll understand why. Many people
> believe that using UCS-2, a character can always fit into one UCS-2
> char. Some believe that if they pretend surrogates don't exist
> they can keeping using UCS-2. But this is not true. Many characters
> take more than one codepoint even in UTF-32. The major concern with
> UTF-32 is that it doubles the amount of memory needed over UCS-2 ):
>
> What's our position? We're going to have to look into it sooner or
> later and it won't be fun.

Abi historically has always used UCS-2 internally to represent strings, and as
you note, we're beginning to run into problems with that. Dealing with UTF-8 is
no more pleasant than dealing with UCS-2 in my experience, but perhaps it is
(much) more common in the programming communtiy as a whole. As you have noted,
we do store data as UTF-8 in our file formats.

So I don't know what position to take. They all look like they suck a lot. My
vote was for the "eveyone use english" solution, but that didn't go over too
well ;-) So those persons more knowledegable than I on the subject are
encouraged to step up to the mike.

And, yes, converting Abi to use anything but UCS2 will be a PITA.

Dom

Next message: Martin Sevior: "Re: profile results for new UT_* implementations?"
Previous message: Andrew Dunbar: "Re: Memory leaks"
In reply to: Andrew Dunbar: "Re: Strings, Was: profile results for new UT_* implementations?"
Next in thread: Aaron Lehmann: "Re: Strings, Was: profile results for new UT_* implementations?"
Next in thread: Martin Sevior: "Re: profile results for new UT_* implementations?"
Reply: Dom Lachowicz: "Re: Strings, Was: profile results for new UT_* implementations?"
Reply: Aaron Lehmann: "Re: Strings, Was: profile results for new UT_* implementations?"
Reply: Mike Nordell: "Re: Strings, Was: profile results for new UT_* implementations?"

This archive was generated by hypermail 2b25 : Tue Jun 19 2001 - 22:55:28 CDT