Subject: Re: UCS-2 vs. UCS-4
From: Martin Sevior (msevior@mccubbin.ph.unimelb.edu.au)
Date: Tue Jun 26 2001 - 10:25:06 CDT
On Tue, 26 Jun 2001, Joaquin Cuenca Abela wrote:
>
> --- Thomas Fletcher <thomasf@qnx.com> wrote:
> > On Sat, 23 Jun 2001, Martin Sevior wrote:
> > >
> > > This is an interesting debate. One extra point we
> > should all keep in mind
> > > is that we probabally don't waste much more space
> > going from 16 => 32 bits
> > > for character representation.
> >
> > [Other comments about sizes of data structures
> > snipped]
> >
> > Martin,
> >
> > Call me crazy ... but I _totally_ don't believe
> > this statement. For
> > anyone working on documents of any size, our memory
> > consumption is an
> > issue. Deciding to double the per character memory
> > requirements will
> > add up. While some systems are swappable ... we
> > certainly don't want
> > to count out the fact that Abi could be used on
> > smaller devices.
>
> while I agree that we should try to remain as little
> as possible, I agree with Martin.
>
> Suppose a document of 66 chars per line, 30 lines per
> page, 100 pages. If the doc contains no images (only
> lines and lines of chars), we have:
>
> 66 * 30 * 100 = 198000 chars
>
> If we use UCS-2, we will need ~400K to store only the
> text. If we use UCS-4, we will need ~800K
>
> Last time I took a look at files so big (with the test
> that I executed with the perl bindings), gtop was
> saying me that AbiWord was using 10M of memory (and I
> think that the file was not 100 pages long, it was ~50
> pages long, I think).
>
Cool! I get to explain "theory" to Joaquin as to why his experimental
numbers are so big.
OK. For *Single View* on Joaquin's document, assuming there are no Changes
of any formatting properties what-so ever look at this:
Remember all our text is not only stored in a big 16 bit array but it is
also stored in all sorts of classes.
A Frag_strux class and a fl_BlockLayout class per paragraph.
(Every Frag_strux gets its own class, every fl_BlockLayout gets it's own
Block)
One Line class per line. (Every line get's it's own class.)
Let's say an average of two runs per line. (Every Run gets it's own
class. The text Frag's are hard to work out by let's assume a frag per run
too.)
Assume Joaquin's document has 4 lines per paragraph.
100 pages => 33 lines * 100 + 66 runs * 100 + 132 Frags * 100 + 8 blocks
* 100
This is the main source of memery usage. Other sources include the Hash
table for each unique attribute/property combination, the Page classes,
the container classes.
Now look through the header files of fp_Line.h, fp_Run.h, fp_TextRun.h,
pf_Frag.h, fl_BlockLayout.h
Each function listed is worth 4 bytes on a 32 bit CPU.
Each Member variable is also worth about 4 bytes.
Don't forget all the static variables in each function either. They have
to get counted in the total memory per class instance too. Finally there
are classes embedded in these classes that can also grow (like a UT_Vector
of squiggles.)
Now a quick glance through fl_BlockLayout makes me guess there are about
200 methods and member variables. That's 800 bytes right there. It's too
hard to add up all the static variables. A sizeof(fl_BlockLayout) would be
the most scientific.
I guess there are around 130 methods and member variables per fp_run
that's 520 bytes per run class.
And around 110 methods and member variables per fp_line so that's around
440 bytes per line.
OK So our calculation based only on the layout classes is:
100 pages => 33 lines * 100* 440 + 66 runs * 100*520 + 8 blocks *
100*800
= 1452000 + 3432000 + 640000 = 5524000 bytes for a 100 page doc. Which
is half Joaquins measurement of 10 megabytes for 50 page
document. Not bad given the crude calculation!
Now if the text were stored as UCS-2 that's 440 KBytes if stored as UCS-4
thats 880 kBytes.
> So we're using only ~5% of space to save the text (if
> we use UCS-4 we will need 10%). So IMO we can switch
> to UCS-4 without caring about memory consumation (at
> least at the first time).
>
See above "theory" to confirm Joaquin's measurement :-)
> Of course, if somebody cares enough to change for the
> simplistic "only UCS-4" aproach to a more complex
> approach such Mike's one that saves memory &/| speed,
> I will be more than happy, but it seems to me that
> this time will be best spended fixing the remaining
> 90% of memory.
>
This requires careful pruning of the run, line and blocklayout classes. In
the end I don't think it is worth the effort. People with big docs need
lots of memory. I can't imagine anyone wanting to edit more than a 1000
page document in a WYSIWYG WP but that still "only" requires 100 - 200
megabytes. Quite reasonable on todays workstations. In a years time it
will be even more reasonable.
Cheers
Martin
This archive was generated by hypermail 2b25 : Tue Jun 26 2001 - 10:25:40 CDT