[Xapian-discuss] bigrams search speed and index documents
Ying Liu
liux0395 at umn.edu
Wed Dec 16 19:55:32 GMT 2009
Hello all,
I wrote some codes for finding the bigrams in text. You can download the
package at:
http://www.tc.umn.edu/~liux0395/
In the package, I tried several methods to build the Xapian index
database. I am still
looking for the best way to count the bigrams frequency and build the
co-occurrence
matrix, especially for huge text and bigger window size. Welcome
download and try
the software, please let me know if you have any questions and comments.
Thanks,
Ying
>
>
>> Is there other way to index this 3.3G file? It works well on smaller
>> files. I am testing some extreme cases. Thank you very much!
>>
>
> If you are just doing this as a way to count frequencies, you could simple
> start a new document every N lines read. The collection frequency of each term
> at the end will be the total number of times it appeared.
>
> Personally, I'd not use Xapian for that, but just use Perl hashes. Your data
> is probably too large to process in one go, but you can make multiple runs
> over subsets of the bigrams. A simple way would be to partition by the first
> byte, and run once for each possible first byte - something like this (totally
> untested) code:
>
> foreach my $first_byte (0 .. 255) {
> my %frequency = ();
> seek FILE, 0, 0 or die $!;
> while (my $line = <FILE>)
> {
> chop ($line);
> ++$frequency{$line} if ord($line) == $first_byte;
> }
> foreach my $bigram (sort keys %frequency) {
> print "$frequency{$bigram}\t$bigram\n";
> }
> }
>
> If this is going to get run a lot, you probably want to partition on a
> hashed version of $line to get a more even split so you can make fewer
> passes.
>
> Cheers,
> Olly
>
>
More information about the Xapian-discuss
mailing list