[Snowball-discuss] Snowball goes Zope
Fri, 15 Feb 2002 21:48:48 +0300 (GMT)
On Fri, 15 Feb 2002, Andreas Jung wrote:
> Currently we store all textindex related data inside Btrees
> (usually the mapping word to wordId). For globbing search we
inverted index, am I right ?
> also have a seperate tree for keeping the digrams. Especially
> updating the tree for globbing search is very time consuming.
I'd prefer trigrams, because selectivity of digrams is very small -
total number is 27^2. btw, n-grams would be very easy stored using
> Just let me have a look at some of the papers to get an idea
> how the stuff works and if it might be suitable for us :-)
> What about the license ? Under what license is the GiST stuff ?
BSD, the same as PostgreSQL.
> - aj
> ----- Original Message -----
> From: "Oleg Bartunov" <email@example.com>
> To: "Andreas Jung" <firstname.lastname@example.org>
> Cc: <email@example.com>; <firstname.lastname@example.org>
> Sent: Friday, February 15, 2002 13:14
> Subject: Re: [Snowball-discuss] Snowball goes Zope
> > On Fri, 15 Feb 2002, Andreas Jung wrote:
> > > How large do you define "medium size". Let's say, I have 3 GB of text.
> > > Can GiST handle that ?
> > GiST probably can, if you invent data structure and define access methods
> > based on GiST. GiST itself is just a way to define various data types
> > and indexed access methods. We invented for OpenFTS something like
> > signature files, but instead of files we use RD-Tree for storing
> > signatures and RD-Tree is implemented using GiST. That way we get
> > indexed access methods. Document is represented by a fixed length
> > distinct words are encoded in this signature.
> > 3GB of text says nothing for us. Our method is sensitive to the total
> > number of distinct words and average number of distinct words per
> > The full theory behind our search engine is quite complex but
> > I'd say, that 2-4 K distinct words per document is ok, but if you
> > have 18 K - that would work but slower, because of increasing of
> > probability of false drop. Cost analysis is a very complicated task.
> > 18K distinct words is a lot. All postgresql documentation (>5Mb of text)
> > consists about 17K or so distinct words. After stemming this number
> > is much lower ! Eliminating of stop words also decrease this number.
> > But, in any case, our search will be faster than traditional like
> > with seq. scan. The challenge was to develope full text seach engine
> > integrated with database and with fast update. Inverted indices are in
> > no way for modern web sites with high rate of updates.
> > You may get feeling very quick - just try contrib/tsearch module from
> > 7.2 distribution ( better to get it fro CVS). Next version of
> > OpenFTS will use it.
> > U-uph, I feel I didn't answer to your question :-)
> > >
> > > Andreas
> > >
> > >
> > >
> > Regards,
> > Oleg
> > _____________________________________________________________
> > Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> > Sternberg Astronomical Institute, Moscow University (Russia)
> > Internet: email@example.com, http://www.sai.msu.su/~megera/
> > phone: +007(095)939-16-83, +007(095)939-23-83
> Snowball-discuss mailing list
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: firstname.lastname@example.org, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
Snowball-discuss mailing list