[Snowball-discuss] Snowball goes Zope

Andreas Jung Andreas Jung" <andreas@zope.com
Fri, 15 Feb 2002 13:22:21 -0500


Currently we store all textindex related data inside Btrees
(usually the mapping word to wordId). For globbing search we
also have a seperate tree for keeping the digrams. Especially
updating the tree for globbing search is very time consuming.
Just let me have a look at some of the papers to get an idea
how the stuff works and if it might be suitable for us :-)

What about the license ? Under what license is the GiST stuff ?

- aj


----- Original Message -----
From: "Oleg Bartunov" <oleg@sai.msu.su>
To: "Andreas Jung" <andreas@zope.com>
Cc: <k2pts@cytanet.com.cy>; <snowball-discuss@lists.sourceforge.net>
Sent: Friday, February 15, 2002 13:14
Subject: Re: [Snowball-discuss] Snowball goes Zope


> On Fri, 15 Feb 2002, Andreas Jung wrote:
>
> > How large do you define "medium size". Let's say, I have 3 GB of text.
> > Can GiST handle that ?
>
> GiST probably can, if you invent data structure and define access methods
> based on GiST. GiST itself is just a way to define various data types
> and indexed access methods. We invented for OpenFTS something like
> signature files, but instead of files we use RD-Tree for storing
> signatures and RD-Tree is implemented using GiST. That way we get
> indexed access methods. Document is represented by a fixed length
signature,
> distinct words are encoded in this signature.
> 3GB of text says nothing for us.  Our method is sensitive to the total
> number of distinct words and average number of distinct words per
document.
> The full theory behind our search engine is quite complex but
> I'd say, that 2-4 K distinct words per document is ok, but if you
> have 18 K - that would work but slower, because of increasing of
> probability of false drop. Cost analysis is a very complicated task.
> 18K distinct words is a lot. All postgresql documentation (>5Mb of text)
> consists about 17K or so distinct words. After stemming this number
> is much lower ! Eliminating of stop words also decrease this number.
> But, in any case, our search will be faster than traditional like
> with seq. scan. The challenge was to develope full text seach engine
> integrated with database and with fast update. Inverted indices are in
> no way for modern web sites with high rate of updates.
> You may get feeling very quick - just try contrib/tsearch module from
> 7.2 distribution ( better to get it fro CVS). Next version of
> OpenFTS  will use it.
>
> U-uph, I feel I didn't answer to your question :-)
>
>
>
> >
> > Andreas
> >
> >
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
>


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss