[Xapian-discuss] Size of the index
Justine Demeyer
justine.demeyer at gmail.com
Tue Nov 25 10:32:29 GMT 2008
Here is the code of the index :
void Index(char* ind, char* directory)
{
try
{
timeval tim;
double t1, t2, dif;
string index(ind);
//Heure de debut de l'operation
gettimeofday(&tim, NULL);
t1=tim.tv_sec+(tim.tv_usec/1000000.0);
//Creattion ou ouverture de l'index
Xapian::WritableDatabase db(ind, Xapian::DB_CREATE_OR_OPEN);
Xapian::TermGenerator indexer;
Xapian::Stem stemmer("english");
indexer.set_stemmer(stemmer);
struct dirent *lecture;
DIR *rep;
rep = opendir(directory);
while((lecture = readdir(rep)))
{
char* name = lecture->d_name;
std::string name2(name);
string path= directory+name2;
ifstream fichier(path.c_str(), ios::in);
if(fichier) // ce test Ã(c)choue si le fichier n'est pas
ouvert
{
string ligne; // variable contenant chaque ligne lue
string contenu;
// cette boucle s'arrête dès qu'une erreur de lecture
survient
while(std::getline(fichier, ligne))
{
contenu = contenu + ligne + "\n";
}
//Indexation
Xapian::Document doc;
doc.set_data(contenu);
indexer.set_document(doc);
indexer.index_text(contenu);
db.add_document(doc);
cout << "add " << path.c_str() << endl;
}
}
//Mise a jour
cout << "Optimizing" << endl;
db.flush();
closedir(rep);
//Heure de fin de l'operation
gettimeofday(&tim, NULL);
t2=tim.tv_sec+(tim.tv_usec/1000000.0);
//Calcul de la duree de l'operation
dif = t2 - t1;
Calculate(dif);
}
catch (const Xapian::Error &e)
{
cout << e.get_description() << endl;
}
}
Thanks for helping me
2008/11/25 Henry <henka at cityweb.co.za>
> Quoting "Justine Demeyer" <justine.demeyer at gmail.com>:
> > I have a question about the size of the Xapian index.
> >
> > I indexed a set of 200 000 data who has a global size of about 1Gb and
> the
> > index created has a size of more than 3Gb!! What can explain this
> > difference???
>
> You'll find this with all indexing systems, to some degree. The size
> of your index is almost always larger than the raw text, depending on
> how you've structured the index/terms, whether you're stopalizing,
> etc, and also depends on whether you've compacted the DB.
>
> If you post more detail about your index then that will help to
> pinpoint why your index is so large.
>
> Cheers
> Henry
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
More information about the Xapian-discuss
mailing list