[Snowball-discuss] Multi language full text: Which stemming language should be used?

Richard R. Liu richard.liu at pueo-owl.ch
Tue May 1 13:06:54 BST 2012


@ Manoj M,

By "multi language" which of these do you mean:
A.  Each document in the corpus is in a single language, but not all documents are in the same language.
B.  Passages (e.g., paragraphs) in a single document may be in different languages.

At any rate, since the whole point of stemming is conflation, i.e., ignoring different forms of the same word, stemming depends on the language, since it dictates how different forms of a word are built.  So, assuming you have some way of determining the language of a word and you use the correct stemmer to stem it, you will have to store not only the stem, but also its language, since, just as two languages may have words that are spelled the same, two stems in two different languages may also be spelled the same.

With regard to determining the language, if you are dealing with A (above), you might have some meta data with this information; however, you will have to judge how dependable it is.  If you are dealing with B, there are some heuristics that operate on the sentence level.  You could apply stop word lists in all the possible languages to the sentence, then select the language that produces the most (unique) stop words.  Or you could apply stemmers and select the language that produces the most stemming.

On the query side, you also have the problem of determining the language -- presumably just one! -- of the query terms.  In this case, if you resort to one of the heuristics described above, it would probably have to be the one based on stemmers, since people rarely use stop words in queries.

Regards,
Richard
 
On May 1, 2012, at 13:00 , snowball-discuss-request at lists.tartarus.org wrote:

> Send Snowball-discuss mailing list submissions to
> 	snowball-discuss at lists.tartarus.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> or, via email, send a message with subject or body 'help' to
> 	snowball-discuss-request at lists.tartarus.org
> 
> You can reach the person managing the list at
> 	snowball-discuss-owner at lists.tartarus.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Snowball-discuss digest..."
> 
> 
> Today's Topics:
> 
>   1. Multi language full text: Which stemming	language should be
>      used? (Manoj M)
>   2. Re: Multi language full text: Which stemming language should
>      be used? (Craig Rairdin)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 30 Apr 2012 17:47:38 +0530
> From: Manoj M <manojmarathayil at gmail.com>
> Subject: [Snowball-discuss] Multi language full text: Which stemming
> 	language should be used?
> To: snowball-discuss at lists.tartarus.org
> Message-ID:
> 	<CAHbYkLSELfdZ_VRQoZTRhGVWxUZ_KNF=-Xd3hb5z=0Cx+7c8ow at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Which stemming language I should be using if I want to support all
> language full text search. As far as I know the index need to created
> using that specific stemming language to support search with that
> language, but this is not possible for me as my search program may
> contain different languages.
> 
> Thanks in advance.
> 
> --
> Regards,
> Manoj Marathayil
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 30 Apr 2012 09:19:50 -0500
> From: Craig Rairdin <craigr at laridian.com>
> Subject: Re: [Snowball-discuss] Multi language full text: Which
> 	stemming language should be used?
> To: <Snowball-discuss at lists.tartarus.org>
> Message-ID: <CBC408E1.41575%craigr at laridian.com>
> Content-Type: text/plain;	charset="US-ASCII"
> 
> I would think you would index your original documents according to the
> language they are written in. You would end up with multiple indexes, one
> for each language.
> 
> Then, assuming the user does not tell you what language their search terms
> are in, stem the search terms in each of your supported language and do
> lookups for each term in each language.
> 
> You cannot just do the stemming operation once in, say, English because
> stemming for each language is different.
> 
> Craig
> 
> On 4/30/12 7:17 AM, "Manoj M" <manojmarathayil at gmail.com> wrote:
> 
> Which stemming language I should be using if I want to support all
> language full text search. As far as I know the index need to created
> using that specific stemming language to support search with that
> language, but this is not possible for me as my search program may
> contain different languages.
> 
> Thanks in advance.
> 
> --
> Regards,
> Manoj Marathayil
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> 
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> 
> 
> End of Snowball-discuss Digest, Vol 86, Issue 1
> ***********************************************
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4394 bytes
Desc: not available
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20120501/faaf012d/attachment.bin>


More information about the Snowball-discuss mailing list