[Snowball-discuss] Snowball goes Zope

Fri, 15 Feb 2002 21:48:48 +0300 (GMT)

On Fri, 15 Feb 2002, Andreas Jung wrote:

> Currently we store all textindex related data inside Btrees
> (usually the mapping word to wordId). For globbing search we

inverted index, am I right ?

> also have a seperate tree for keeping the digrams. Especially
> updating the tree for globbing search is very time consuming.

I'd prefer trigrams, because selectivity of digrams is very small -
total number is 27^2. btw, n-grams would be very easy stored using
GiST.

> Just let me have a look at some of the papers to get an idea
> how the stuff works and if it might be suitable for us :-)
>
> What about the license ? Under what license is the GiST stuff ?

BSD, the same as PostgreSQL.

>
> - aj
>
>
> ----- Original Message -----
> From: "Oleg Bartunov" <oleg@sai.msu.su>
> To: "Andreas Jung" <andreas@zope.com>
> Cc: <k2pts@cytanet.com.cy>; <snowball-discuss@lists.sourceforge.net>
> Sent: Friday, February 15, 2002 13:14
> Subject: Re: [Snowball-discuss] Snowball goes Zope
>
>
> > On Fri, 15 Feb 2002, Andreas Jung wrote:
> >
> > > How large do you define "medium size". Let's say, I have 3 GB of text.
> > > Can GiST handle that ?
> >
> > GiST probably can, if you invent data structure and define access methods
> > based on GiST. GiST itself is just a way to define various data types
> > and indexed access methods. We invented for OpenFTS something like
> > signature files, but instead of files we use RD-Tree for storing
> > signatures and RD-Tree is implemented using GiST. That way we get
> > indexed access methods. Document is represented by a fixed length
> signature,
> > distinct words are encoded in this signature.
> > 3GB of text says nothing for us.  Our method is sensitive to the total
> > number of distinct words and average number of distinct words per
> document.
> > The full theory behind our search engine is quite complex but
> > I'd say, that 2-4 K distinct words per document is ok, but if you
> > have 18 K - that would work but slower, because of increasing of
> > probability of false drop. Cost analysis is a very complicated task.
> > 18K distinct words is a lot. All postgresql documentation (>5Mb of text)
> > consists about 17K or so distinct words. After stemming this number
> > is much lower ! Eliminating of stop words also decrease this number.
> > But, in any case, our search will be faster than traditional like
> > with seq. scan. The challenge was to develope full text seach engine
> > integrated with database and with fast update. Inverted indices are in
> > no way for modern web sites with high rate of updates.
> > You may get feeling very quick - just try contrib/tsearch module from
> > 7.2 distribution ( better to get it fro CVS). Next version of
> > OpenFTS  will use it.
> >
> > U-uph, I feel I didn't answer to your question :-)
> >
> >
> >
> > >
> > > Andreas
> > >
> > >
> > >
> >
> > Regards,
> > Oleg
> > _____________________________________________________________
> > Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> > Sternberg Astronomical Institute, Moscow University (Russia)
> > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> > phone: +007(095)939-16-83, +007(095)939-23-83
> >
> >
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/snowball-discuss
>

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss