[Xapian-discuss] Size of the index

Robert Young rob at roryoung.co.uk
Tue Nov 25 18:19:00 GMT 2008


You could use a SimpleStopper

http://xapian.org/docs/apidoc/html/classXapian_1_1SimpleStopper.html

On Tue, Nov 25, 2008 at 3:55 PM, Justine Demeyer
<justine.demeyer at gmail.com>wrote:

> Thanks for your help but I don't know how to use this stop words. I saw
> that
> I have to add to my file : indexer.set_stopper() but what I have to put
> between ()??
>
> Thanks
>
> 2008/11/25 Robert Young <rob at roryoung.co.uk>
>
> > Oops, xapian-discuss doesn't seem to set reply-to.
> >
> > Stop words that appear in such a high proportion of the documents in your
> > corpus that they can be safely ignored. Words like 'the', 'and', 'a' etc.
> > Remove these and you can improve the precision of your queries, the
> > performance of both queries and indexing and reduce the size of your
> index.
> > At the potential expense of recall.
> >
> > Cheers
> > Rob
> >
> > On Tue, Nov 25, 2008 at 2:23 PM, Justine Demeyer
> > <justine.demeyer at gmail.com>wrote:
> >
> > >
> > > Ok, thanks!!
> > >
> > > But what is the purpose of the stop words??
> > >
> > >
> > > 2008/11/25 Robert Young <rob at roryoung.co.uk>
> > >
> > > As Henry alluded to earlier, you could potentially reduce the size of
> > your
> > >> index by removing stop words.
> > >>
> > >> Cheers
> > >> Rob
> > >>
> > >>
> > >> On Tue, Nov 25, 2008 at 10:32 AM, Justine Demeyer <
> > >> justine.demeyer at gmail.com> wrote:
> > >>
> > >>> Here is the code of the index :
> > >>>
> > >>> void Index(char* ind, char* directory)
> > >>> {
> > >>>       try
> > >>>       {
> > >>>           timeval tim;
> > >>>           double t1, t2, dif;
> > >>>
> > >>>           string index(ind);
> > >>>
> > >>>           //Heure de debut de l'operation
> > >>>           gettimeofday(&tim, NULL);
> > >>>       t1=tim.tv_sec+(tim.tv_usec/1000000.0);
> > >>>
> > >>>       //Creattion ou ouverture de l'index
> > >>>       Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
> > >>>       Xapian::TermGenerator indexer;
> > >>>       Xapian::Stem stemmer("english");
> > >>>       indexer.set_stemmer(stemmer);
> > >>>
> > >>>
> > >>>       struct dirent *lecture;
> > >>>       DIR *rep;
> > >>>
> > >>>       rep = opendir(directory);
> > >>>       while((lecture = readdir(rep)))
> > >>>       {
> > >>>
> > >>>               char* name = lecture->d_name;
> > >>>               std::string name2(name);
> > >>>
> > >>>               string path= directory+name2;
> > >>>
> > >>>               ifstream fichier(path.c_str(), ios::in);
> > >>>
> > >>>               if(fichier) // ce test Ã(c)choue si le fichier n'est
> pas
> > >>> ouvert
> > >>>               {
> > >>>                   string ligne; // variable contenant chaque ligne
> lue
> > >>>                       string contenu;
> > >>>
> > >>>                   // cette boucle s'arrête dès qu'une erreur de
> > lecture
> > >>> survient
> > >>>                       while(std::getline(fichier, ligne))
> > >>>                       {
> > >>>                           contenu = contenu + ligne + "\n";
> > >>>                       }
> > >>>
> > >>>                   //Indexation
> > >>>                       Xapian::Document doc;
> > >>>                       doc.set_data(contenu);
> > >>>
> > >>>                       indexer.set_document(doc);
> > >>>                       indexer.index_text(contenu);
> > >>>
> > >>>                       db.add_document(doc);
> > >>>                       cout << "add " << path.c_str() << endl;
> > >>>
> > >>>               }
> > >>>
> > >>>
> > >>>       }
> > >>>       //Mise a jour
> > >>>       cout << "Optimizing" << endl;
> > >>>       db.flush();
> > >>>       closedir(rep);
> > >>>
> > >>>       //Heure de fin de l'operation
> > >>>       gettimeofday(&tim, NULL);
> > >>>       t2=tim.tv_sec+(tim.tv_usec/1000000.0);
> > >>>
> > >>>       //Calcul de la duree de l'operation
> > >>>       dif = t2 - t1;
> > >>>       Calculate(dif);
> > >>>
> > >>>
> > >>>   }
> > >>>       catch (const Xapian::Error &e)
> > >>>       {
> > >>>               cout << e.get_description() << endl;
> > >>>       }
> > >>> }
> > >>>
> > >>> Thanks for helping me
> > >>>
> > >>>
> > >>> 2008/11/25 Henry <henka at cityweb.co.za>
> > >>>
> > >>> > Quoting "Justine Demeyer" <justine.demeyer at gmail.com>:
> > >>> > > I have a question about the size of the Xapian index.
> > >>> > >
> > >>> > > I indexed a set of 200 000 data who has a global size of about
> 1Gb
> > >>> and
> > >>> > the
> > >>> > > index created has a size of more than 3Gb!! What can explain this
> > >>> > > difference???
> > >>> >
> > >>> > You'll find this with all indexing systems, to some degree.  The
> size
> > >>> > of your index is almost always larger than the raw text, depending
> on
> > >>> > how you've structured the index/terms, whether you're stopalizing,
> > >>> > etc, and also depends on whether you've compacted the DB.
> > >>> >
> > >>> > If you post more detail about your index then that will help to
> > >>> > pinpoint why your index is so large.
> > >>> >
> > >>> > Cheers
> > >>> > Henry
> > >>> >
> > >>> >
> > >>> > _______________________________________________
> > >>> > Xapian-discuss mailing list
> > >>> > Xapian-discuss at lists.xapian.org
> > >>> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
> > >>> >
> > >>> _______________________________________________
> > >>> Xapian-discuss mailing list
> > >>> Xapian-discuss at lists.xapian.org
> > >>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
> > >>>
> > >>
> > >>
> > >
> > _______________________________________________
> > Xapian-discuss mailing list
> > Xapian-discuss at lists.xapian.org
> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
> >
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>


More information about the Xapian-discuss mailing list