GSoC 2011 - Hyphenation
From AbiWiki
Summary of What I have done in GSoc2011
Summary of What I have done in GSoc2011
Until now, my works in GSoc2011 including seven parts as following:
1. How to Support more languages
How to support more languages in ISpell
How to support more languages in mySepll
2. How to extend the enchant function
3.Hyphenation module in Enchant
Read and get totally understand the source code of Enchant
Reuse the abstract layer of Enchant and add Hyphenation function in Enchant, so that we can add more language easily
Deal with more languages
Add five backend implementation, including ispell, myspell, zemberek, voikko, uspell
Deal with the spelling-checking module
4.Call the Hyphenation function in Abiword.
Find split info using enchant_dict_hyphenate
Split Text_Run to split word pass the line width and keep their format
Deal with user's operation(select, delete, cut, paste)
User can select weather to enable the hyphenation function
5. Simple Implementation of Chinese Spell-Checking in Enchant
Add a simple spell-check framework for Chinese in Enchant
Add library to support
Some survey about Chinese Spell-checking
6. Code Re-factor and debug
Code Re-factor, include keep the code flexible
Debug coding problem
7. User interface to manage hyphenation
Windows, Linux, and Cocoa
The detail things:
How to reuse my works
I have created two patch files including all my coding in GSoc2011.
Chenxiajian_enchant.diff
Chenxiajian_abiword.diff
Chenxiajian_enchant.diff is about the jobs that I have done in the Enchant framework to provider an abstract level of hyphenation function for abiword.
Chenxiajian_abiword.diff is the concreate jobs that call the hyphenation function in abiword to implement hyphenation.
How to use?
You can just apply the diff files in SVN.
How to Support more languages
As mentioned before, we use Enchant to support more languages. So we have five backend to support more language. Take ISpell and mySpell for example.
In the folder “abiword\msvc2008\Debug\” there are the folder for hyphenation: Spell and mySpell. And there is two folder for their dictionary.
How to support more languages in ISpell
Go into the ISpell, you will see the folder language; you can just copy your languages’ hyphenation dictionary into it. So that our abiword will support your language’s hyphenation.
Now we support de, en, es, and fr.
How to support more languages in mySepll
The same as ISpell, to support more languages in mySpell, we can refer to the myspell folder.
How to extend the enchant function
I have read much codes in enchant. So I think enchant is a very useful framework for you to support dictionary-need function, such as spell-check, hyphenation. To extend the function in Enchant, we need to do the following things:
1 In order to achieve this, we need to add concreate function in EnchantDict firstly. Something like:
char **(*hyphenate) (struct str_enchant_dict * me, const char *const word, size_t len, size_t * out_n_suggs);
2 the function is implement by the backend.
static char ** ispell_dict_hyphenate (EnchantDict * me, const char *const word, size_t len, size_t * out_n_suggs) { ISpellChecker * checker; checker = (ISpellChecker *) me->user_data; return checker->hyphenate (word, len, out_n_suggs); }
3 we set the connetion with dic
dict->hyphenate = ispell_dict_hyphenate; dict->suggest = hspell_dict_hyphenate; dict->suggest = zemberek_dict_hyphenate;
Hyphenation module in Enchant
Add hyphenation function in Enchant
Firstly, I add hyphenation method in Enchant:
==========the code=====
I think we can combine the hyphenation with spell-checking together, So that we can make the code more flexible. In my opinion, the hyphenation function defines as following:
EnchantDict* enchant_broker_request_dict (EnchantBroker* broker, const char *const lang); //same as spell-checking char *enchant_dict_hyphenate(EnchantDict *dict, const char *const word,size_t len);
In order to achieve the function and implement in abstract layer, we need to add hyphenation function in EnchantDict. something like, just as a function pointer:
char* (*hyphenate) (struct str_enchant_dict * me, const char *const word, size_t len, size_t * out_n_suggs);
and the function is implement by the backend. Take “ispell” as example:
static char * ispell_dict_hyphenate (EnchantDict * me, const char *const word, size_t len, size_t * out_n_suggs) { ISpellChecker * checker; checker = (ISpellChecker *) me->user_data; return checker->hyphenate (word, len, out_n_suggs); }
Finally, we set the connetion
dict->hyphenate = ispell_dict_hyphenate; dict->suggest = hspell_dict_hyphenate; dict->suggest = zemberek_dict_hyphenate;
Add five backends to support hyphenation
including ispell, myspell, zemberek, voikko, uspell
Hunspell: using seperated dictionary: such as hyph_en_us.dic. we can download dic from internet
Libhyphenaiton: the dictionary is provided by author, sometimes limited
Zemberek: for Turkis
Voikko: for Finnish
the changes:
1 deleted the unneed connection, such as HSpell
2 add hunspell(myspell) hyphenation code
3 implement hyphenation using hunspell
4 implement hyphenation using Zemberek
1 deleted the unneed connection, such as HSpell=====
Hebrew don’t need any hyphenation
Yiddish don’t need any hyphenation
=======2 Implement hyphenation using hunspell
In order to use libhyphenation. We need to add files:
hyphen/hnjalloc.h hyphen/hnjalloc.c hyphen/hyph_en_US.dic hyphen/hyphen.c hyphen/hyphen.gyp hyphen/hyphen.h hyphen/hyphen.patch hyphen/hyphen.tex
========3 Implement hyphenation using Zemberek
just using dbus_g_proxy_call the same as Spell-Check in Zemberek:
the hyphenation is as following
char* Zemberek::hyphenate(const char* word) { char* result; GError *Error = NULL; if (!dbus_g_proxy_call (proxy, "hecele", &Error, G_TYPE_STRING,word,G_TYPE_INVALID, G_TYPE_STRV, &result,G_TYPE_INVALID)) { g_error_free (Error); return NULL; } char*result=0; return result; }
ISpell
I used Libhyphenation in ISpell. The simple code is just like this:
static char * ispell_dict_hyphenate (EnchantDict * me, const char *const word) {
ISpellChecker * checker; checker = (ISpellChecker *) me->user_data; if(me->tag!="") return checker->hyphenate (word,me->tag);
return checker->hyphenate (word,"en_us"); }
The concrete code in ISpellChecker is :
char * ISpellChecker::hyphenate(const char * const utf8Word, const char *const tag) { //we must choose the right language tag
char* param_value = enchant_broker_get_param (m_broker, "enchant.ispell.hyphenation.dictionary.path"); if(languageMap[tag]!="") { string result=Hyphenator(RFC_3066::Language(languageMap[tag]),param_value).hyphenate(utf8Word).c_str(); char* temp=new char[result.length()]; strcpy(temp,result.c_str()); return temp; } return NULL;
}
MySpell
I used Libhyphenate in ISpell. The simple code is just like this:
char* MySpellChecker::hyphenate (const char* const word, size_t len,char* tag) { if(len==-1) len=strlen(word); if (len > MAXWORDLEN || !g_iconv_is_valid(m_translate_in) || !g_iconv_is_valid(m_translate_out)) return 0; char* result=0; myspell->hyphenate(word,result,tag); return result; }
The concrete code in MySpellChecker is :
void Hunspell::hyphenate( const char* const word, char* result, char* tag ) {
HyphenDict *dict; char buf[BUFSIZE + 1]; char *hyphens=new char[BUFSIZE + 1]; char ** rep; int * pos; int * cut; /* load the hyphenation dictionary */ string filePath="hyph_"; filePath+=tag; filePath+=".dic";
if ((dict = hnj_hyphen_load(filePath.c_str())) == NULL) {
fprintf(stderr, "Couldn't find file %s\n",tag); fflush(stderr); exit(1); }
int len=strlen(word); if (hnj_hyphen_hyphenate2(dict, word, len-1, hyphens, NULL, &rep, &pos, &cut)) {
free(hyphens); fprintf(stderr, "hyphenation error\n"); exit(1); }
hnj_hyphen_free(dict); result=hyphens; }