[Xapian-discuss] bigrams search speed and index documents

Wed Dec 16 19:55:32 GMT 2009

Hello all,

I wrote some codes for finding the bigrams in text. You can download the 
package at:
http://www.tc.umn.edu/~liux0395/

In the package, I tried several methods to build the Xapian index 
database. I am still
looking for the best way to count the bigrams frequency and build the 
co-occurrence
matrix, especially for huge text and bigger window size. Welcome 
download and try
the software, please let me know if you have any questions and comments.

Thanks,
Ying

>
>   
>> Is there other way to index this 3.3G file? It works well on smaller  
>> files. I am testing some extreme cases. Thank you very much!
>>     
>
> If you are just doing this as a way to count frequencies, you could simple
> start a new document every N lines read.  The collection frequency of each term
> at the end will be the total number of times it appeared.
>
> Personally, I'd not use Xapian for that, but just use Perl hashes.  Your data
> is probably too large to process in one go, but you can make multiple runs
> over subsets of the bigrams.  A simple way would be to partition by the first
> byte, and run once for each possible first byte - something like this (totally
> untested) code:
>
>     foreach my $first_byte (0 .. 255) {
> 	my %frequency = ();
> 	seek FILE, 0, 0 or die $!;
>         while (my $line = <FILE>)
>         {
>             chop ($line);
>             ++$frequency{$line} if ord($line) == $first_byte;
>         }
> 	foreach my $bigram (sort keys %frequency) {
> 	    print "$frequency{$bigram}\t$bigram\n";
> 	}
>     }
>
> If this is going to get run a lot, you probably want to partition on a
> hashed version of $line to get a more even split so you can make fewer
> passes.
>
> Cheers,
>     Olly
>
>