[Snowball-discuss] Snowball goes Zope
Andreas Jung" <email@example.com
Fri, 15 Feb 2002 13:22:21 -0500
Currently we store all textindex related data inside Btrees
(usually the mapping word to wordId). For globbing search we
also have a seperate tree for keeping the digrams. Especially
updating the tree for globbing search is very time consuming.
Just let me have a look at some of the papers to get an idea
how the stuff works and if it might be suitable for us :-)
What about the license ? Under what license is the GiST stuff ?
----- Original Message -----
From: "Oleg Bartunov" <firstname.lastname@example.org>
To: "Andreas Jung" <email@example.com>
Cc: <firstname.lastname@example.org>; <email@example.com>
Sent: Friday, February 15, 2002 13:14
Subject: Re: [Snowball-discuss] Snowball goes Zope
> On Fri, 15 Feb 2002, Andreas Jung wrote:
> > How large do you define "medium size". Let's say, I have 3 GB of text.
> > Can GiST handle that ?
> GiST probably can, if you invent data structure and define access methods
> based on GiST. GiST itself is just a way to define various data types
> and indexed access methods. We invented for OpenFTS something like
> signature files, but instead of files we use RD-Tree for storing
> signatures and RD-Tree is implemented using GiST. That way we get
> indexed access methods. Document is represented by a fixed length
> distinct words are encoded in this signature.
> 3GB of text says nothing for us. Our method is sensitive to the total
> number of distinct words and average number of distinct words per
> The full theory behind our search engine is quite complex but
> I'd say, that 2-4 K distinct words per document is ok, but if you
> have 18 K - that would work but slower, because of increasing of
> probability of false drop. Cost analysis is a very complicated task.
> 18K distinct words is a lot. All postgresql documentation (>5Mb of text)
> consists about 17K or so distinct words. After stemming this number
> is much lower ! Eliminating of stop words also decrease this number.
> But, in any case, our search will be faster than traditional like
> with seq. scan. The challenge was to develope full text seach engine
> integrated with database and with fast update. Inverted indices are in
> no way for modern web sites with high rate of updates.
> You may get feeling very quick - just try contrib/tsearch module from
> 7.2 distribution ( better to get it fro CVS). Next version of
> OpenFTS will use it.
> U-uph, I feel I didn't answer to your question :-)
> > Andreas
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: firstname.lastname@example.org, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
Snowball-discuss mailing list