[Xapian-discuss] bigrams search speed and index documents

Ying Liu liux0395 at umn.edu
Wed Dec 16 19:55:32 GMT 2009

Hello all,

I wrote some codes for finding the bigrams in text. You can download the 
package at:

In the package, I tried several methods to build the Xapian index 
database. I am still
looking for the best way to count the bigrams frequency and build the 
matrix, especially for huge text and bigger window size. Welcome 
download and try
the software, please let me know if you have any questions and comments.


>> Is there other way to index this 3.3G file? It works well on smaller  
>> files. I am testing some extreme cases. Thank you very much!
> If you are just doing this as a way to count frequencies, you could simple
> start a new document every N lines read.  The collection frequency of each term
> at the end will be the total number of times it appeared.
> Personally, I'd not use Xapian for that, but just use Perl hashes.  Your data
> is probably too large to process in one go, but you can make multiple runs
> over subsets of the bigrams.  A simple way would be to partition by the first
> byte, and run once for each possible first byte - something like this (totally
> untested) code:
>     foreach my $first_byte (0 .. 255) {
> 	my %frequency = ();
> 	seek FILE, 0, 0 or die $!;
>         while (my $line = <FILE>)
>         {
>             chop ($line);
>             ++$frequency{$line} if ord($line) == $first_byte;
>         }
> 	foreach my $bigram (sort keys %frequency) {
> 	    print "$frequency{$bigram}\t$bigram\n";
> 	}
>     }
> If this is going to get run a lot, you probably want to partition on a
> hashed version of $line to get a more even split so you can make fewer
> passes.
> Cheers,
>     Olly

More information about the Xapian-discuss mailing list