<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div> </div>


<div>Thanks for the explanation Olly. That makes a lot of sense actually.<br/>

<br/>

Unfortunately in my specific use-case I cannot split my tokens on hyphens (at least not without significant overhead), but I just tried elastic's "minimal_german" stemmer, which actually seems to solve this too, as it really only does minimal stemming, like the name suggests, where for example "PA-Schiene" is stemmed to "PA-Schien", which is absolutely sufficient for me.<br/>

<br/>

But thanks anyway for helping me understand why the german stemmer behaves the way it does.<br/>

<br/>

Best Regards<br/>

Simon

<div> 

<div name="quote" style="margin:10px 5px 5px 10px; padding: 10px 0 10px 10px; border-left:2px solid #C3D9E5; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">

<div style="margin:0 0 10px 0;"><b>Gesendet:</b> Mittwoch, 25. September 2024 um 00:55 Uhr<br/>

<b>Von:</b> "Olly Betts" <olly@survex.com><br/>

<b>An:</b> "Simon" <harrypotter137@gmx.de><br/>

<b>Cc:</b> snowball-discuss@lists.tartarus.org<br/>

<b>Betreff:</b> Re: [Snowball-discuss] Stemming with german2 on hyphenated compound words</div>


<div name="quoted-content">On Tue, Sep 24, 2024 at 10:23:23AM +0200, Simon wrote:<br/>

> As an example I have 2 words: "Export-Schnittstelle" and "Schnittstelle",<br/>

> for these words the stemmer creates "Export-Schnittstell" or<br/>

> "Schnittstell" respectively, which is great because with the right<br/>

> tokenization I can now search for "Schnitstelle" (which the stemmer within<br/>

> my search analyzer will transform to "Schnittstell") and it will match the<br/>

> second part from the word "Export-Schnittstelle" aka<br/>

> "Export-Schnittstell".<br/>

><br/>

> Now I would expect that this is how it works for all hyphenated compound<br/>

> words. But unfortunately that's not the case. So I now have 2 other words<br/>

> "PA-Schiene" and "Schiene". Here the stemmer creates two completely<br/>

> different words: "PA-Schi" and "Schien".<br/>

><br/>

> Can someone explain to my why this is and if there is a way to fix this?<br/>

<br/>

It's essentially because the algorithm won't remove an ending if the<br/>

stem that would leave is too short. In this case removing `-ene` would<br/>

leave `schi` (too short) vs `pa-schi` (OK). It removes `-e` instead<br/>

for the former case since `schien` is long enough.<br/>

<br/>

I think you probably want to split words as hyphens when tokenising,<br/>

which would avoid such problems.<br/>

<br/>

Cheers,<br/>

Olly</div>

</div>

</div>

</div></div></body></html>