[Xapian-discuss] Size of the index

Robert Young rob at roryoung.co.uk
Tue Nov 25 14:37:18 GMT 2008


Oops, xapian-discuss doesn't seem to set reply-to.

Stop words that appear in such a high proportion of the documents in your
corpus that they can be safely ignored. Words like 'the', 'and', 'a' etc.
Remove these and you can improve the precision of your queries, the
performance of both queries and indexing and reduce the size of your index.
At the potential expense of recall.

Cheers
Rob

On Tue, Nov 25, 2008 at 2:23 PM, Justine Demeyer
<justine.demeyer at gmail.com>wrote:

>
> Ok, thanks!!
>
> But what is the purpose of the stop words??
>
>
> 2008/11/25 Robert Young <rob at roryoung.co.uk>
>
> As Henry alluded to earlier, you could potentially reduce the size of your
>> index by removing stop words.
>>
>> Cheers
>> Rob
>>
>>
>> On Tue, Nov 25, 2008 at 10:32 AM, Justine Demeyer <
>> justine.demeyer at gmail.com> wrote:
>>
>>> Here is the code of the index :
>>>
>>> void Index(char* ind, char* directory)
>>> {
>>>       try
>>>       {
>>>           timeval tim;
>>>           double t1, t2, dif;
>>>
>>>           string index(ind);
>>>
>>>           //Heure de debut de l'operation
>>>           gettimeofday(&tim, NULL);
>>>       t1=tim.tv_sec+(tim.tv_usec/1000000.0);
>>>
>>>       //Creattion ou ouverture de l'index
>>>       Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
>>>       Xapian::TermGenerator indexer;
>>>       Xapian::Stem stemmer("english");
>>>       indexer.set_stemmer(stemmer);
>>>
>>>
>>>       struct dirent *lecture;
>>>       DIR *rep;
>>>
>>>       rep = opendir(directory);
>>>       while((lecture = readdir(rep)))
>>>       {
>>>
>>>               char* name = lecture->d_name;
>>>               std::string name2(name);
>>>
>>>               string path= directory+name2;
>>>
>>>               ifstream fichier(path.c_str(), ios::in);
>>>
>>>               if(fichier) // ce test Ã(c)choue si le fichier n'est pas
>>> ouvert
>>>               {
>>>                   string ligne; // variable contenant chaque ligne lue
>>>                       string contenu;
>>>
>>>                   // cette boucle s'arrête dès qu'une erreur de lecture
>>> survient
>>>                       while(std::getline(fichier, ligne))
>>>                       {
>>>                           contenu = contenu + ligne + "\n";
>>>                       }
>>>
>>>                   //Indexation
>>>                       Xapian::Document doc;
>>>                       doc.set_data(contenu);
>>>
>>>                       indexer.set_document(doc);
>>>                       indexer.index_text(contenu);
>>>
>>>                       db.add_document(doc);
>>>                       cout << "add " << path.c_str() << endl;
>>>
>>>               }
>>>
>>>
>>>       }
>>>       //Mise a jour
>>>       cout << "Optimizing" << endl;
>>>       db.flush();
>>>       closedir(rep);
>>>
>>>       //Heure de fin de l'operation
>>>       gettimeofday(&tim, NULL);
>>>       t2=tim.tv_sec+(tim.tv_usec/1000000.0);
>>>
>>>       //Calcul de la duree de l'operation
>>>       dif = t2 - t1;
>>>       Calculate(dif);
>>>
>>>
>>>   }
>>>       catch (const Xapian::Error &e)
>>>       {
>>>               cout << e.get_description() << endl;
>>>       }
>>> }
>>>
>>> Thanks for helping me
>>>
>>>
>>> 2008/11/25 Henry <henka at cityweb.co.za>
>>>
>>> > Quoting "Justine Demeyer" <justine.demeyer at gmail.com>:
>>> > > I have a question about the size of the Xapian index.
>>> > >
>>> > > I indexed a set of 200 000 data who has a global size of about 1Gb
>>> and
>>> > the
>>> > > index created has a size of more than 3Gb!! What can explain this
>>> > > difference???
>>> >
>>> > You'll find this with all indexing systems, to some degree.  The size
>>> > of your index is almost always larger than the raw text, depending on
>>> > how you've structured the index/terms, whether you're stopalizing,
>>> > etc, and also depends on whether you've compacted the DB.
>>> >
>>> > If you post more detail about your index then that will help to
>>> > pinpoint why your index is so large.
>>> >
>>> > Cheers
>>> > Henry
>>> >
>>> >
>>> > _______________________________________________
>>> > Xapian-discuss mailing list
>>> > Xapian-discuss at lists.xapian.org
>>> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>> >
>>> _______________________________________________
>>> Xapian-discuss mailing list
>>> Xapian-discuss at lists.xapian.org
>>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>>
>>
>>
>


More information about the Xapian-discuss mailing list