Re: String encoding questions


Subject: Re: String encoding questions
From: David Mandelin (mandelin@cs.wisc.edu)
Date: Thu Aug 23 2001 - 12:26:15 CDT


Thank you. I'm converting to and from Unicode (UCS-2LE) for Win32 OLE
Automation. I do have a few more questions.

Dom Lachowicz wrote:
>
> Quoting David Mandelin <mandelin@cs.wisc.edu>:
>
> > 1. When I see a string variable of type 'char *' in AbiWord, how do I
> > tell if it is ASCII, native, or something else?
> >
> > 2. What is the Right Way to convert a string from ASCII to Unicode (or
> > native to Unicode) in AbiWord? I found 4 ways: (1) UT_Mbtowc, (2)
> > UT_iconv, (3) UT_convert, and (4) XAP_EncodingManager::nativeToU.
> > Methods (1) and (4) convert a character at a time.
> >
> > 3. What exactly is UT_convert supposed to do? I ask because the comment
> > above it is incorrect, it doesn't quite work if you are converting to a
> > wide-char encoding, and it doesn't seem to be used anywhere.
>
> Hello,
>
> It really depends on what you want to do. It's safe to say that any char *
> string that you see floating around in Abi isn't unicode.

ISO-8859 is used in the GUI, right? What about filenames and internal
identifiers? Always ASCII?

> We have 2 string
> classes that I suggest you use if you want better string handling:
>
> UT_String
> UT_UCS2String
>
> Now, for converting things, I really don't recommend ::nativeToU or UT_Mbtowc
> at all. I would use UT_iconv or UT_convert.
>
> UT_convert is a wrapper around iconv because of how much I hate iconv.
> UT_convert does a 1-shot conversion between charsets, and I'm pretty sure that
> it works (the original code was based on working code found in GLIB, then Mike
> Nordell and I rewrote it and Frodo made sure that it worked for use in our
> Pspell spell-checking driver.
>
> I highly recommend UT_convert if you don't need to keep around an iconv_t
> handle or need a 1-shot conversion. If not, I recommend UT_iconv. Do not use
> iconv - use our wrapper functions.
>
> Also, I need to re-integrate a parts of a large patch from Andrew Dunbar that
> I had lying around which dealt with UCS-2 strings and character encodings. I'm
> afraid that I'll have to do this by hand, though ;-(

Hmm. So are you saying that UT_convert doesn't fully support UCS-2 yet?

In any case, the comment is inaccurate. The comments on to_codeset and
from_codeset are transposed. The comment says that len=0 is interpreted
as len=strlen(str), but the code uses strlen only if len<0, which can't
happen since len is UT_uint32. It also bothers me that iconv returning
EINVAL and not converthing the whole input is an error if bytes_read_arg
is NULL but not otherwise. And it bothers me that most of the function
is in a try block even though only one line of code can throw an
exception.

I admit, I am picky. ;-) I did finally get it to work, though, so I
guess I'm OK now.



This archive was generated by hypermail 2b25 : Thu Aug 23 2001 - 12:26:19 CDT