[Xapian-discuss] Size of the index

Justine Demeyer justine.demeyer at gmail.com
Tue Nov 25 18:54:41 GMT 2008


Oupssss, it was just a little error! It's okay now!!

Thank you for your help

2008/11/25 Justine Demeyer <justine.demeyer at gmail.com>

> More precisly, I have an error saying that I can't put a SimpleStopper as a
> parameter of set_stopper....
>
> 2008/11/25 Justine Demeyer <justine.demeyer at gmail.com>
>
> Yes, I tried it but it doesn't work.
>>
>> I tried an example with this :
>>
>> Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
>> Xapian::TermGenerator indexer;
>> Xapian::Stem stemmer("english");
>> Xapian::SimpleStopper stop;
>> stop.add("the");
>> indexer.set_stemmer(stemmer);
>> indexer.set_stopper(stop);
>>
>>
>> 2008/11/25 Robert Young <rob at roryoung.co.uk>
>>
>>> You could use a SimpleStopper
>>>
>>> http://xapian.org/docs/apidoc/html/classXapian_1_1SimpleStopper.html
>>>
>>> On Tue, Nov 25, 2008 at 3:55 PM, Justine Demeyer
>>> <justine.demeyer at gmail.com>wrote:
>>>
>>> > Thanks for your help but I don't know how to use this stop words. I saw
>>> > that
>>> > I have to add to my file : indexer.set_stopper() but what I have to put
>>> > between ()??
>>> >
>>> > Thanks
>>> >
>>> > 2008/11/25 Robert Young <rob at roryoung.co.uk>
>>> >
>>> > > Oops, xapian-discuss doesn't seem to set reply-to.
>>> > >
>>> > > Stop words that appear in such a high proportion of the documents in
>>> your
>>> > > corpus that they can be safely ignored. Words like 'the', 'and', 'a'
>>> etc.
>>> > > Remove these and you can improve the precision of your queries, the
>>> > > performance of both queries and indexing and reduce the size of your
>>> > index.
>>> > > At the potential expense of recall.
>>> > >
>>> > > Cheers
>>> > > Rob
>>> > >
>>> > > On Tue, Nov 25, 2008 at 2:23 PM, Justine Demeyer
>>> > > <justine.demeyer at gmail.com>wrote:
>>> > >
>>> > > >
>>> > > > Ok, thanks!!
>>> > > >
>>> > > > But what is the purpose of the stop words??
>>> > > >
>>> > > >
>>> > > > 2008/11/25 Robert Young <rob at roryoung.co.uk>
>>> > > >
>>> > > > As Henry alluded to earlier, you could potentially reduce the size
>>> of
>>> > > your
>>> > > >> index by removing stop words.
>>> > > >>
>>> > > >> Cheers
>>> > > >> Rob
>>> > > >>
>>> > > >>
>>> > > >> On Tue, Nov 25, 2008 at 10:32 AM, Justine Demeyer <
>>> > > >> justine.demeyer at gmail.com> wrote:
>>> > > >>
>>> > > >>> Here is the code of the index :
>>> > > >>>
>>> > > >>> void Index(char* ind, char* directory)
>>> > > >>> {
>>> > > >>>       try
>>> > > >>>       {
>>> > > >>>           timeval tim;
>>> > > >>>           double t1, t2, dif;
>>> > > >>>
>>> > > >>>           string index(ind);
>>> > > >>>
>>> > > >>>           //Heure de debut de l'operation
>>> > > >>>           gettimeofday(&tim, NULL);
>>> > > >>>       t1=tim.tv_sec+(tim.tv_usec/1000000.0);
>>> > > >>>
>>> > > >>>       //Creattion ou ouverture de l'index
>>> > > >>>       Xapian::WritableDatabase db(ind,
>>> Xapian::DB_CREATE_OR_OPEN);
>>> > > >>>       Xapian::TermGenerator indexer;
>>> > > >>>       Xapian::Stem stemmer("english");
>>> > > >>>       indexer.set_stemmer(stemmer);
>>> > > >>>
>>> > > >>>
>>> > > >>>       struct dirent *lecture;
>>> > > >>>       DIR *rep;
>>> > > >>>
>>> > > >>>       rep = opendir(directory);
>>> > > >>>       while((lecture = readdir(rep)))
>>> > > >>>       {
>>> > > >>>
>>> > > >>>               char* name = lecture->d_name;
>>> > > >>>               std::string name2(name);
>>> > > >>>
>>> > > >>>               string path= directory+name2;
>>> > > >>>
>>> > > >>>               ifstream fichier(path.c_str(), ios::in);
>>> > > >>>
>>> > > >>>               if(fichier) // ce test Ã(c)choue si le fichier
>>> n'est
>>> > pas
>>> > > >>> ouvert
>>> > > >>>               {
>>> > > >>>                   string ligne; // variable contenant chaque
>>> ligne
>>> > lue
>>> > > >>>                       string contenu;
>>> > > >>>
>>> > > >>>                   // cette boucle s'arrête dès qu'une erreur de
>>> > > lecture
>>> > > >>> survient
>>> > > >>>                       while(std::getline(fichier, ligne))
>>> > > >>>                       {
>>> > > >>>                           contenu = contenu + ligne + "\n";
>>> > > >>>                       }
>>> > > >>>
>>> > > >>>                   //Indexation
>>> > > >>>                       Xapian::Document doc;
>>> > > >>>                       doc.set_data(contenu);
>>> > > >>>
>>> > > >>>                       indexer.set_document(doc);
>>> > > >>>                       indexer.index_text(contenu);
>>> > > >>>
>>> > > >>>                       db.add_document(doc);
>>> > > >>>                       cout << "add " << path.c_str() << endl;
>>> > > >>>
>>> > > >>>               }
>>> > > >>>
>>> > > >>>
>>> > > >>>       }
>>> > > >>>       //Mise a jour
>>> > > >>>       cout << "Optimizing" << endl;
>>> > > >>>       db.flush();
>>> > > >>>       closedir(rep);
>>> > > >>>
>>> > > >>>       //Heure de fin de l'operation
>>> > > >>>       gettimeofday(&tim, NULL);
>>> > > >>>       t2=tim.tv_sec+(tim.tv_usec/1000000.0);
>>> > > >>>
>>> > > >>>       //Calcul de la duree de l'operation
>>> > > >>>       dif = t2 - t1;
>>> > > >>>       Calculate(dif);
>>> > > >>>
>>> > > >>>
>>> > > >>>   }
>>> > > >>>       catch (const Xapian::Error &e)
>>> > > >>>       {
>>> > > >>>               cout << e.get_description() << endl;
>>> > > >>>       }
>>> > > >>> }
>>> > > >>>
>>> > > >>> Thanks for helping me
>>> > > >>>
>>> > > >>>
>>> > > >>> 2008/11/25 Henry <henka at cityweb.co.za>
>>> > > >>>
>>> > > >>> > Quoting "Justine Demeyer" <justine.demeyer at gmail.com>:
>>> > > >>> > > I have a question about the size of the Xapian index.
>>> > > >>> > >
>>> > > >>> > > I indexed a set of 200 000 data who has a global size of
>>> about
>>> > 1Gb
>>> > > >>> and
>>> > > >>> > the
>>> > > >>> > > index created has a size of more than 3Gb!! What can explain
>>> this
>>> > > >>> > > difference???
>>> > > >>> >
>>> > > >>> > You'll find this with all indexing systems, to some degree.
>>>  The
>>> > size
>>> > > >>> > of your index is almost always larger than the raw text,
>>> depending
>>> > on
>>> > > >>> > how you've structured the index/terms, whether you're
>>> stopalizing,
>>> > > >>> > etc, and also depends on whether you've compacted the DB.
>>> > > >>> >
>>> > > >>> > If you post more detail about your index then that will help to
>>> > > >>> > pinpoint why your index is so large.
>>> > > >>> >
>>> > > >>> > Cheers
>>> > > >>> > Henry
>>> > > >>> >
>>> > > >>> >
>>> > > >>> > _______________________________________________
>>> > > >>> > Xapian-discuss mailing list
>>> > > >>> > Xapian-discuss at lists.xapian.org
>>> > > >>> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>> > > >>> >
>>> > > >>> _______________________________________________
>>> > > >>> Xapian-discuss mailing list
>>> > > >>> Xapian-discuss at lists.xapian.org
>>> > > >>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>> > > >>>
>>> > > >>
>>> > > >>
>>> > > >
>>> > > _______________________________________________
>>> > > Xapian-discuss mailing list
>>> > > Xapian-discuss at lists.xapian.org
>>> > > http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>> > >
>>> > _______________________________________________
>>> > Xapian-discuss mailing list
>>> > Xapian-discuss at lists.xapian.org
>>> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>> >
>>> _______________________________________________
>>> Xapian-discuss mailing list
>>> Xapian-discuss at lists.xapian.org
>>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>>
>>
>>
>


More information about the Xapian-discuss mailing list