Re: Patch: Encoding Manager


Subject: Re: Patch: Encoding Manager
From: Vlad Harchev (hvv@hippo.ru)
Date: Sun Jun 03 2001 - 05:51:51 CDT


On Sun, 3 Jun 2001, Andrew Dunbar wrote:

 Hi Andrew,

> This patch is mainly a large extension to the approximate() method.
> It's certainly not complete yet though...
>
> There's also a couple of minor changes and a few comments that
> should be looked at.
>
> Andrew Dunbar.

 Thank you very much for the patch!

 The comments to it:

Comment to the following modification:
----------------------
+/*
+ TODO I'm pretty sure you can't break Korean at any character.
+ And what about Japanese Katakana and Hiragana?
+*/
 static const _rmap can_break_words_data[]=
 {
-----------------------
 The 'can_break_words_data' - was filled by me, and it allows for chinese,
korean and japanese languages to break at any character of the word. We
(AW developers, including Chinese hackers) just thought that it's allowed for
all CJK langauges to break word at any character. The corectness of that
list is guaranteed for chinese only. So please edit that list according to
your knowledge. If it's unclear to you on how to do that, feel free to ask me
and I will be very glad to help you.

 As for this:
-------------------
 char XAP_EncodingManager::fallbackChar(UT_UCSChar c) const
 {
+ // TODO shouldn't we return U+FFFD "REPLACEMENT CHARACTER"
+ // TODO or U+25A0 "BLACK SQUARE" for Unicode?
     return '?';
 }
-------------------
 No, replacement char should be ascii and fit into 'char', so '?' seems to me
like the best replacement.

 As for this
---------------
+// Warning:
+// This code forces us to use "GB2312", "BIG5", etc instead
+// of "CP936", "CP950", etc even when our iconv supports
+// the "CPxxx" form and the encodings differ.
+// Be sure this is what you want if you call this function.
 const char* XAP_EncodingManager::charsetFromCodepage(int lid) const
 {
---------------
 Yes, the warning is correct. It would be nice if somebody rewritten it
properly.

 As for 'approximate' method:
 It seems to never be called with 'maxlength' > 1 in 0.7.14 at least AFAIR. It
would be nice to audit all places and see where it can be really nice to call
approximate() directly (the one is plain text exporter definitely).
 And just a small hint - there is a Markus Kuhn's list of unicode ascii
approximation table - it's used in libiconv for it's special target 'translit'
- see 'translit.def' in libiconv's sources. You can merge your code with his
list (and better yet, update his list too).

 Best regards,
  -Vlad



This archive was generated by hypermail 2b25 : Sun Jun 03 2001 - 04:27:41 CDT