Tutorial: writing stemming rules in Open Text Summarizer

From: Nadav Rotem (nadavrotem@mail.ru)
Date: Thu Jul 24 2003 - 01:27:48 EDT

Next message: Martin Sevior: "commit: Fix nesting of tables past 2 levels, cut on GUADEC document no longer hangs."

Previous message: William Lachance: "PATCH: win32 platform fix for earlier patch (5209)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

In the past few weeks “stemming” support was added to the Open Text
Summarizer.
Stemming is the ability to take a word such as "running" and trace it
back to its original form to the word "run". We use this feature to
group together all of the thederivativs of a certain stem. For OTS,
keywords equal ideas; We need this ability to group to words together to
recognize that “I ran” and “I am running” are of similar ideas.

The stemming process is govern by given rules. At the moment there are
two main rule groups. prefix and postfix. Each rule is defined as
[“replace this” : “with that”]. A set of two stringsseparated by a
colon. The <postfix> will try to match the end of the word while the
<prefix> will try to match the beginning. The program will try to apply
each of the rules , from top to bottom, until one is matched. It will
apply ONLY ONE rule of each group.

The stem rules are defined in en.xml (or any other language code
dot xml); They look like this:

<prefix>
        <rule>replaceThis:withThis</rule>
</prefix>
<postfix>
        <rule>sses:s</rule>
        <rule>ing:</rule>
        <rule>went:go</rule>
</postfix>

In the example file the program will replace each “sses” at the end of a
word with “s” , remove every “ing” from the end of a word and replace
the word “went” with “go”.

In the example the program will be able to tell that:

stem(“went”) == stem(“going”) == stem(“go”) == “go”

As Alan said “There are some grammar rules for this but because English
is such a bastard language they can be quite unreliable.”

for example: we cant automatically drop the “s” at the end of the word
to remove plural because first it might end with “es” and second it may
be a word such as “was”. One trick would be to place “es” before “s” and
“e” and maybe to have in the beginning a list of words that break our
algorithm.

You can go wild with the list because it is O(N), where N is the number
of words in the article. We already have O(N^2) in some other place.

In order to fully support the 24+ languages that OTS support
we need to define the rules for each language to make this connection.
I know that for many languages this feature is critical(russian for example).

Shalom,
Nadav

Next message: Martin Sevior: "commit: Fix nesting of tables past 2 levels, cut on GUADEC document no longer hangs."
Previous message: William Lachance: "PATCH: win32 platform fix for earlier patch (5209)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Thu Jul 24 2003 - 01:42:36 EDT