Tutorial: writing stemming rules in Open Text Summarizer

From: Nadav Rotem (nadavrotem@mail.ru)
Date: Thu Jul 24 2003 - 01:27:48 EDT

  • Next message: Martin Sevior: "commit: Fix nesting of tables past 2 levels, cut on GUADEC document no longer hangs."

    In the past few weeks “stemming” support was added to the Open Text
    Summarizer.
    Stemming is the ability to take a word such as "running" and trace it
    back to its original form to the word "run". We use this feature to
    group together all of the thederivativs of a certain stem. For OTS,
    keywords equal ideas; We need this ability to group to words together to
    recognize that “I ran” and “I am running” are of similar ideas.

    The stemming process is govern by given rules. At the moment there are
    two main rule groups. prefix and postfix. Each rule is defined as
    [“replace this” : “with that”]. A set of two stringsseparated by a
    colon. The <postfix> will try to match the end of the word while the
    <prefix> will try to match the beginning. The program will try to apply
    each of the rules , from top to bottom, until one is matched. It will
    apply ONLY ONE rule of each group.

    The stem rules are defined in en.xml (or any other language code
    dot xml); They look like this:

    <prefix>
            <rule>replaceThis:withThis</rule>
    </prefix>
    <postfix>
            <rule>sses:s</rule>
            <rule>ing:</rule>
            <rule>went:go</rule>
    </postfix>

    In the example file the program will replace each “sses” at the end of a
    word with “s” , remove every “ing” from the end of a word and replace
    the word “went” with “go”.

    In the example the program will be able to tell that:

    stem(“went”) == stem(“going”) == stem(“go”) == “go”

    As Alan said “There are some grammar rules for this but because English
    is such a bastard language they can be quite unreliable.”

    for example: we cant automatically drop the “s” at the end of the word
    to remove plural because first it might end with “es” and second it may
    be a word such as “was”. One trick would be to place “es” before “s” and
    “e” and maybe to have in the beginning a list of words that break our
    algorithm.

    You can go wild with the list because it is O(N), where N is the number
    of words in the article. We already have O(N^2) in some other place.

    In order to fully support the 24+ languages that OTS support
    we need to define the rules for each language to make this connection.
    I know that for many languages this feature is critical(russian for example).

    Shalom,
    Nadav



    This archive was generated by hypermail 2.1.4 : Thu Jul 24 2003 - 01:42:36 EDT