(Bug 3426) How fast is the dictionary algorithm? [2/2]

From: F J Franklin (F.J.Franklin@sheffield.ac.uk)
Date: Tue May 21 2002 - 13:02:58 EDT

  • Next message: David Chart: "Commit: French Dox"

    ---------- Forwarded message ----------
    Date: Tue, 21 May 2002 09:53:31 -0700 (PDT)
    From: Jesse Duke <darkphotn@yahoo.com>
    To: F J Franklin <F.J.Franklin@sheffield.ac.uk>
    Subject: Re5: How fast is the dictionary algorithm?

    > Thanks for the reply. Do you mind if I forward it to the list?

    Be my guest. This one too.

    > Q: How can it possibly be O(1)? "It runs in O(1) time when considering
    > the length of the dictionary file." - doesn't that imply O(log(N))
    > at a minimum? (Genuinely curious, here...)

    It is tricky if you store the dictionary file as a "brute force" list
    instead of using a different data structure.

    Imagine the following list of words;

    ape, age, apple, agoo.

    Every single word starts with an 'a'. It would be redundant to list it
    four times. Just list it once and have it point to the rest of the
    strings, with the first character removed.

    a: {pe, ge, pple, goo}

    Then do this recursively.

    a: { p:{e, ple}, g:{e, oo}}

    Not only is the dictionary file compressed (if it's coded right...), but
    finding out whether or not a word is in the dictionary file takes O(k)
    time, where n is the length of the dictionary file, and k is the length
    of the word you're checking.

    To see if "apeg" is in the dictionary file, it would go to the 'a'
    section, then the 'ap' section (using a pointer), then the 'ape' section
    (using a pointer), then it'd die (null pointer). So 'apeg' is not in the
    dictionary data structure. It runs in O(1) time, where n is the length
    of the dictionary file.

    I'd strongly suggest allocating 'a' through 'z' pointers on most nodes,
    so that a spellchecking algorithm can just access an index to get to the
    next one.

    if('a'<=word_to_check[i] && word_to_check[i]<='z')
         nextpointer = currentpointer[word_to_check[i]-'a'];
    else if('A'<=word_to_check[i] && word_to_check[i]<='Z')
         nextpointer = currentpointer[word_to_check[i]-'A'];
    else
         //search a linked list of special characters such as hyphens...

    Also, storing CAPS and àççéñt (accent) information should be done within
    the nodes themselves, so that foreign spellcheckers pick up both
    "francais" and "français", for example.

    Example:
    "Español" -> "espanol" + [<CAPS> <> <> <> <~> <> <>]

    On a side note, this data structure sorts strings in O(n) time.
    Inserting a string takes O(1) time (where n is the number of strings to
    insert), and there are n strings to insert. N times 1 is in O(n). Not
    only are the strings sorted, but they are compressed at the same time.
    Removing all strings can be done in O(n) time, where n is the number of
    strings, but if they're already compressed, why bother removing them?

    This method can be tweaked to work with integers of arbitrary size (just
    divide the integer up into bytes and have 256 'characters'), sorting
    integers in O(n) time. However, it's slow and wastes memory.
    Chunkensort() would probably work much better for sorting ints. It's the
    same basic idea, but most nodes are stored using other methods to reduce
    memory and processor useage.

    It should be possible to tweak this method to work on floating points.
    Just consider the exponent as the first 'byte', and then sort the
    mantissas as integers.

    Apparently there's a bunch of computer science guys running around saying
    that (n) sorts are either impossible or impractical. They're neither.
    Actually, they're rather common. There's at least two different ones
    floating around the internet, and I've come up with several on my own,
    including some that sort in a fraction of (n) time. If I'm really the
    first one to discover these, then "first post" and have fun, because
    they'll be made public domain.

    The trick to sorting in (n) time is to never compare values. Just
    'magically' throw the number into it's sorted position, then take out all
    the blank space between valid values. The only problem is that for
    anything larger than 8 bits, you have a lot of blank space (4 billion
    blank slots for 32-bit ints), so you have to figure out ways to compress
    the data structure in realtime.

    If Abiword needs any sorting algorithms, let me know.

    Jesse Duke
    darkphotn@yahoo.com

    __________________________________________________
    Do You Yahoo!?
    LAUNCH - Your Yahoo! Music Experience
    http://launch.yahoo.com



    This archive was generated by hypermail 2.1.4 : Tue May 21 2002 - 13:07:04 EDT