[Snowball-discuss] Snowball goes Zope

Oleg Bartunov oleg@sai.msu.su
Fri, 15 Feb 2002 21:14:36 +0300 (GMT)

On Fri, 15 Feb 2002, Andreas Jung wrote:

> How large do you define "medium size". Let's say, I have 3 GB of text.
> Can GiST handle that ?

GiST probably can, if you invent data structure and define access methods
based on GiST. GiST itself is just a way to define various data types
and indexed access methods. We invented for OpenFTS something like
signature files, but instead of files we use RD-Tree for storing
signatures and RD-Tree is implemented using GiST. That way we get
indexed access methods. Document is represented by a fixed length signature,
distinct words are encoded in this signature.
3GB of text says nothing for us.  Our method is sensitive to the total
number of distinct words and average number of distinct words per document.
The full theory behind our search engine is quite complex but
I'd say, that 2-4 K distinct words per document is ok, but if you
have 18 K - that would work but slower, because of increasing of
probability of false drop. Cost analysis is a very complicated task.
18K distinct words is a lot. All postgresql documentation (>5Mb of text)
consists about 17K or so distinct words. After stemming this number
is much lower ! Eliminating of stop words also decrease this number.
But, in any case, our search will be faster than traditional like
with seq. scan. The challenge was to develope full text seach engine
integrated with database and with fast update. Inverted indices are in
no way for modern web sites with high rate of updates.
You may get feeling very quick - just try contrib/tsearch module from
7.2 distribution ( better to get it fro CVS). Next version of
OpenFTS  will use it.

U-uph, I feel I didn't answer to your question :-)

> Andreas

Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Snowball-discuss mailing list