[Xapian-discuss] Size of the index
Justine Demeyer
justine.demeyer at gmail.com
Tue Nov 25 15:55:18 GMT 2008
Thanks for your help but I don't know how to use this stop words. I saw that
I have to add to my file : indexer.set_stopper() but what I have to put
between ()??
Thanks
2008/11/25 Robert Young <rob at roryoung.co.uk>
> Oops, xapian-discuss doesn't seem to set reply-to.
>
> Stop words that appear in such a high proportion of the documents in your
> corpus that they can be safely ignored. Words like 'the', 'and', 'a' etc.
> Remove these and you can improve the precision of your queries, the
> performance of both queries and indexing and reduce the size of your index.
> At the potential expense of recall.
>
> Cheers
> Rob
>
> On Tue, Nov 25, 2008 at 2:23 PM, Justine Demeyer
> <justine.demeyer at gmail.com>wrote:
>
> >
> > Ok, thanks!!
> >
> > But what is the purpose of the stop words??
> >
> >
> > 2008/11/25 Robert Young <rob at roryoung.co.uk>
> >
> > As Henry alluded to earlier, you could potentially reduce the size of
> your
> >> index by removing stop words.
> >>
> >> Cheers
> >> Rob
> >>
> >>
> >> On Tue, Nov 25, 2008 at 10:32 AM, Justine Demeyer <
> >> justine.demeyer at gmail.com> wrote:
> >>
> >>> Here is the code of the index :
> >>>
> >>> void Index(char* ind, char* directory)
> >>> {
> >>> try
> >>> {
> >>> timeval tim;
> >>> double t1, t2, dif;
> >>>
> >>> string index(ind);
> >>>
> >>> //Heure de debut de l'operation
> >>> gettimeofday(&tim, NULL);
> >>> t1=tim.tv_sec+(tim.tv_usec/1000000.0);
> >>>
> >>> //Creattion ou ouverture de l'index
> >>> Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
> >>> Xapian::TermGenerator indexer;
> >>> Xapian::Stem stemmer("english");
> >>> indexer.set_stemmer(stemmer);
> >>>
> >>>
> >>> struct dirent *lecture;
> >>> DIR *rep;
> >>>
> >>> rep = opendir(directory);
> >>> while((lecture = readdir(rep)))
> >>> {
> >>>
> >>> char* name = lecture->d_name;
> >>> std::string name2(name);
> >>>
> >>> string path= directory+name2;
> >>>
> >>> ifstream fichier(path.c_str(), ios::in);
> >>>
> >>> if(fichier) // ce test Ã(c)choue si le fichier n'est pas
> >>> ouvert
> >>> {
> >>> string ligne; // variable contenant chaque ligne lue
> >>> string contenu;
> >>>
> >>> // cette boucle s'arrête dès qu'une erreur de
> lecture
> >>> survient
> >>> while(std::getline(fichier, ligne))
> >>> {
> >>> contenu = contenu + ligne + "\n";
> >>> }
> >>>
> >>> //Indexation
> >>> Xapian::Document doc;
> >>> doc.set_data(contenu);
> >>>
> >>> indexer.set_document(doc);
> >>> indexer.index_text(contenu);
> >>>
> >>> db.add_document(doc);
> >>> cout << "add " << path.c_str() << endl;
> >>>
> >>> }
> >>>
> >>>
> >>> }
> >>> //Mise a jour
> >>> cout << "Optimizing" << endl;
> >>> db.flush();
> >>> closedir(rep);
> >>>
> >>> //Heure de fin de l'operation
> >>> gettimeofday(&tim, NULL);
> >>> t2=tim.tv_sec+(tim.tv_usec/1000000.0);
> >>>
> >>> //Calcul de la duree de l'operation
> >>> dif = t2 - t1;
> >>> Calculate(dif);
> >>>
> >>>
> >>> }
> >>> catch (const Xapian::Error &e)
> >>> {
> >>> cout << e.get_description() << endl;
> >>> }
> >>> }
> >>>
> >>> Thanks for helping me
> >>>
> >>>
> >>> 2008/11/25 Henry <henka at cityweb.co.za>
> >>>
> >>> > Quoting "Justine Demeyer" <justine.demeyer at gmail.com>:
> >>> > > I have a question about the size of the Xapian index.
> >>> > >
> >>> > > I indexed a set of 200 000 data who has a global size of about 1Gb
> >>> and
> >>> > the
> >>> > > index created has a size of more than 3Gb!! What can explain this
> >>> > > difference???
> >>> >
> >>> > You'll find this with all indexing systems, to some degree. The size
> >>> > of your index is almost always larger than the raw text, depending on
> >>> > how you've structured the index/terms, whether you're stopalizing,
> >>> > etc, and also depends on whether you've compacted the DB.
> >>> >
> >>> > If you post more detail about your index then that will help to
> >>> > pinpoint why your index is so large.
> >>> >
> >>> > Cheers
> >>> > Henry
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > Xapian-discuss mailing list
> >>> > Xapian-discuss at lists.xapian.org
> >>> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
> >>> >
> >>> _______________________________________________
> >>> Xapian-discuss mailing list
> >>> Xapian-discuss at lists.xapian.org
> >>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
> >>>
> >>
> >>
> >
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
More information about the Xapian-discuss
mailing list