Re: commit (HEAD): IMPORTANT - 32-bit UT_UCSChar

From: Andrew Dunbar (hippietrail@yahoo.com)
Date: Wed May 08 2002 - 23:38:14 EDT

  • Next message: Martin Sevior: "Re: improved press release"

     --- F J Franklin <F.J.Franklin@sheffield.ac.uk>
    wrote: > > > Support is there but incomplete. Byte
    sequences
    > > > longer 3 bytes will cause
    > > > problems, and there isn't a UTF-8 -> UCS-4
    > > > conversion yet.
    > >
    > > Sorry to keep whining about this but it was all in
    > my lost huge Unicode
    > > patch over a year ago. UTF-8 sequences can be up
    > to 6 bytes long. We
    > > should probably leave it up to iconv anyway since
    > we have to handle
    > > things like overlong sequences, illegal sequences
    > etc. iconv should
    > > handle this. I think my implementation used the
    > ByteBuf class so that
    > > it could handle UCS-2 and UCS-4 properly without
    > worrying about all
    > > those null bytes looking like string terminators
    > and stuff.
    >
    > Andrew, Andrew, I know. The reason why only 3-byte
    > sequences are handled
    > is that the routine was written to convert Abi's
    > internal UCS-2. Now that
    > Abi uses UCS-4 internally I'll add the code to
    > handle 6-byte sequences.
    >
    > In general I support the use of iconv for conversion
    > between encodings,
    > but conversion between validated UTF-8 and UCS-4 is
    > trivial and the
    > [UT_]UTF8String class was designed to handle the
    > conversion without
    > resorting to iconv.

    Okay if it's to be used with validated UTF-8 that can
    never contain overlong sequences, wrongly converted
    UTF-16 surrogates, etc then I agree of course. But if
    it's left as a general interface you can almost
    guarantee that sooner or later people are going to use
    it to process strings which will have the above
    oddities in them. Remember not everybody understands
    the intricacies of Unicode as well as some of us do.
    Unicode solves a lot of problems but there's quite a
    bit of cruft in there where things can go wrong if
    you're not careful.

    > Ciao, Frank
    >
    > ps. BTW, do you know anything about the overheads of
    > using various iconv
    > implementations? or their thread-safety, for
    > that matter? (Genuinely
    > curious/worried...)

    I really like the libiconv implementation. It's very
    elegant. I'm not familiar with the Linux/BSD
    implementation but I'm sure they're efficient too.
    We don't officially support any other iconv although
    people can force AbiWord to build with other system
    iconvs that is up to them.
    Unfortunately I don't know about thread-safety issues
    but I've got in touch with the libiconv maintainer
    before and he seems pretty responsive.

    Andrew Dunbar.

    > Francis James Franklin
    > F.J.Franklin@shef.ac.uk
    >
    > "No, she really likes me. She told me I look like
    > Britney Spears, and why
    > would you say that to somebody you don't like?"
    >
    > --- Elle Woods
    >
    >

    =====
    http://linguaphile.sourceforge.net http://www.abisource.com

    __________________________________________________
    Do You Yahoo!?
    Everything you'll ever need on one web page
    from News and Sport to Email and Music Charts
    http://uk.my.yahoo.com



    This archive was generated by hypermail 2.1.4 : Wed May 08 2002 - 23:41:05 EDT