[Snowball-discuss] English stemming: -ise vs. -ize (revisited)

Sat Mar 20 16:21:38 GMT 2010

Giles,

I think the general points you make are very good. From the point of view of
computing power, there is no reason why the stemming algorithms should not
be greatly increased in complexity. But I think to go beyond what these
stemmers do (formulate some grammatical rules into an algorithm), a new
approach is best taken. To associate two words x and y in a search, 'color'
and 'colour' for example, one might look at patterns of use of the two words
in a large query set recovered from the use of a big search engine. I
imagine Google and msn use a host of techniques, of which query
co-occurrence may be one, to draw variant forms of words together.

Google do have an 'open source' section incidentally, run by Chris DiBona,

http://en.wikipedia.org/wiki/Chris_DiBona

The reason 'apology' in its various forms goes so wrong is the 'logy'
ending: it should be made an exception, and I may put that in in a while (I
have a new short list for addition).

Martin

At 09:39 AM 3/19/2010 -0000, Giles Kennedy wrote:

>I was looking at this excerpt from the sample vocab and its stemmed equivalent:
>apologetic      -> apologet
>apologetically  -> apologet
>apologies       -> apolog
>...
>
>
>...And I imagine that the likes of Google are doing all this, and more.
Except that of course what they are doing is closed source ...
>
>