(Bug 3426) How fast is the dictionary algorithm? [2/2]

From: F J Franklin (F.J.Franklin@sheffield.ac.uk)
Date: Tue May 21 2002 - 13:02:58 EDT

Next message: David Chart: "Commit: French Dox"

Previous message: F J Franklin: "(Bug 3426) How fast is the dictionary algorithm? [1/2]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

---------- Forwarded message ----------
Date: Tue, 21 May 2002 09:53:31 -0700 (PDT)
From: Jesse Duke <darkphotn@yahoo.com>
To: F J Franklin <F.J.Franklin@sheffield.ac.uk>
Subject: Re5: How fast is the dictionary algorithm?

> Thanks for the reply. Do you mind if I forward it to the list?

Be my guest. This one too.

> Q: How can it possibly be O(1)? "It runs in O(1) time when considering
> the length of the dictionary file." - doesn't that imply O(log(N))
> at a minimum? (Genuinely curious, here...)

It is tricky if you store the dictionary file as a "brute force" list
instead of using a different data structure.

Imagine the following list of words;

ape, age, apple, agoo.

Every single word starts with an 'a'. It would be redundant to list it
four times. Just list it once and have it point to the rest of the
strings, with the first character removed.

a: {pe, ge, pple, goo}

Then do this recursively.

a: { p:{e, ple}, g:{e, oo}}

Not only is the dictionary file compressed (if it's coded right...), but
finding out whether or not a word is in the dictionary file takes O(k)
time, where n is the length of the dictionary file, and k is the length
of the word you're checking.

To see if "apeg" is in the dictionary file, it would go to the 'a'
section, then the 'ap' section (using a pointer), then the 'ape' section
(using a pointer), then it'd die (null pointer). So 'apeg' is not in the
dictionary data structure. It runs in O(1) time, where n is the length
of the dictionary file.

I'd strongly suggest allocating 'a' through 'z' pointers on most nodes,
so that a spellchecking algorithm can just access an index to get to the
next one.

if('a'<=word_to_check[i] && word_to_check[i]<='z')
     nextpointer = currentpointer[word_to_check[i]-'a'];
else if('A'<=word_to_check[i] && word_to_check[i]<='Z')
     nextpointer = currentpointer[word_to_check[i]-'A'];
else
     //search a linked list of special characters such as hyphens...

Also, storing CAPS and àççéñt (accent) information should be done within
the nodes themselves, so that foreign spellcheckers pick up both
"francais" and "français", for example.

Example:
"Español" -> "espanol" + [<CAPS> <> <> <> <~> <> <>]

On a side note, this data structure sorts strings in O(n) time.
Inserting a string takes O(1) time (where n is the number of strings to
insert), and there are n strings to insert. N times 1 is in O(n). Not
only are the strings sorted, but they are compressed at the same time.
Removing all strings can be done in O(n) time, where n is the number of
strings, but if they're already compressed, why bother removing them?

This method can be tweaked to work with integers of arbitrary size (just
divide the integer up into bytes and have 256 'characters'), sorting
integers in O(n) time. However, it's slow and wastes memory.
Chunkensort() would probably work much better for sorting ints. It's the
same basic idea, but most nodes are stored using other methods to reduce
memory and processor useage.

It should be possible to tweak this method to work on floating points.
Just consider the exponent as the first 'byte', and then sort the
mantissas as integers.

Apparently there's a bunch of computer science guys running around saying
that (n) sorts are either impossible or impractical. They're neither.
Actually, they're rather common. There's at least two different ones
floating around the internet, and I've come up with several on my own,
including some that sort in a fraction of (n) time. If I'm really the
first one to discover these, then "first post" and have fun, because
they'll be made public domain.

The trick to sorting in (n) time is to never compare values. Just
'magically' throw the number into it's sorted position, then take out all
the blank space between valid values. The only problem is that for
anything larger than 8 bits, you have a lot of blank space (4 billion
blank slots for 32-bit ints), so you have to figure out ways to compress
the data structure in realtime.

If Abiword needs any sorting algorithms, let me know.

Jesse Duke
darkphotn@yahoo.com

__________________________________________________
Do You Yahoo!?
LAUNCH - Your Yahoo! Music Experience
http://launch.yahoo.com

Next message: David Chart: "Commit: French Dox"
Previous message: F J Franklin: "(Bug 3426) How fast is the dictionary algorithm? [1/2]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Tue May 21 2002 - 13:07:04 EDT