Three Principles for Multilingal Indexing in Elasticsearch

Recently I’ve been working on how to build Elasticsearch indices for WordPress blogs in a way that will work across multiple languages. Elasticsearch has a lot of built in support for different languages, but there are a number of configuration options to wade through and there are a few plugins that improve on the built in support.

Below I’ll lay out the analyzers I am currently using. Some caveats before I start. I’ve done a lot of reading on multi-lingual search, but since I’m really only fluent in one language there’s lots of details about how fluent speakers of other languages use a search engine that I’m sure I don’t understand. This is almost certainly still a work in progress.

In total we have 30 analyzers configured and we’re using the elasticsearch-langdetect plugin to detect 53 languages. For WordPress blogs, users have sometimes set their language to the same language as their content, but very often they have left it as the default of English. So we rely heavily on the language detection plugin to determine which language analyzer to use.

Update: In comments, Michael pointed out that since this post was written the langdetect plugin now has a custom mapping that the mapping example below is not using. I’d highly recommend checking it out for any new implementations.

For configuring the analyzers there are three main principles I’ve pulled from a number of different sources.

1) Use very light or minimal stemming to avoid losing semantic information.

Stemming removes the endings of words to make searches more general, however it can lose a lot of meaning in the process. For instance, the (quite popular) Snowball Stemmer will do the following:

computation -> comput
computers -> comput
computing -> comput
computer -> comput
computes -> comput

international -> intern
internationals -> intern
intern -> intern
interns -> intern

A lot of information is lost in doing such a zealous transformation. There are some cases though where stemming is very helpful. In English, stemming off the plurals of words should rarely be a problem since the plural is still referring to the same concept. This article on SearchWorkings gives further discussion of the pitfalls of the Snowball Stemmer, and leads to Jacque Savoy’s excellent paper on stemming and stop words as applied to French, Italian, German, and Spanish. Savoy found that doing minimal stemming of plurals and feminine/masculine forms of words performed well for these languages. The minimal_* and light_* stemmers included in Elasticsearch implement these recommendations allowing us to take a limited stemming approach.

So when there is a minimal stemmer available for a language we use it, otherwise we do not do any stemming at all.

2) Use stop words for those languages that we have them for.

This ensures that we reduce the size of the index and speed up searches by not trying to match on very frequent terms that provide very little information. Unfortunately, stop words will break certain searches. For instance, searching for “to be or not to be” will not get any results.

The new (to 0.90) cutoff_frequency parameter on the match query may provide a way to allow indexing stop words, but I currently am still unsure whether there are other implications on other types of queries, or how I would decide what cutoff frequency to use given the wide range of documents and languages in a single index. The very high number of English documents compared to say Hebrew also means that Hebrew stopwords may not be frequent enough to trigger the cutoff frequencies correctly if searching across all documents.

For the moment I’m sticking with the stop words approach. Weaning myself off of them will require a bit more experimentation and thought, but I am intrigued by finding an approach that would allow us to avoid the limitations of stop words and enable finding every blog post referencing Shakespeare’s most famous quote.

3) Try and retain term consistency across all analyzers.

We use the ICU Tokenizer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures we treat other languages in a consistent manner. Individual terms are then filtered using the ICU Folding and Normalization filters to ensure consistent terms.

Folding converts a character to an equivalent standard form. The most common conversion that ICU Folding provides is converting characters to lower case as defined in this exhaustive definition of case folding. But folding goes far beyond lowercasing, there are symbols in many languages where multiple characters essentially mean the same thing (particularly from a search perspective). UTR30-4 defines the full set of foldings that the ICU Folding performs.

Where Folding converts a single character to a standard form, Normalization converts a sequence of characters to a standard form. A good example of this, straight from Wikipedia, is “the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet).” Another entertaining example of character normalization is that some Roman numerals (Ⅸ) can be expressed as a single UTF-8 character. But of course for search you’d rather have that converted to “IX”. The ICU Normalization sections have links to the many docs defining how normalization is handled.

By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.

The Details (there’s always exceptions to rules)

  • Asian languages that do not use whitespace for word separations present a non-trivial problem when indexing content. ES comes with the built in CJK analyzer that indexes every pair of symbols into a term, but there are plugins that are much smarter about how to tokenize the text.
    • For Japanese (ja) we are using the Kuromoji plugin built on top of the seemingly excellent library by Atilika. I don’t know any Japanese, so really I am probably just impressed by their level of documentation, slick website, and the fact that they have an online tokenizer for testing tokenization.
    • There are a couple of different versions of written Chinese (zh), and the language detection plugin distinguishes between zh-tw and zh-cn. For analysis we use the ES Smart Chinese Analyzer for all versions of the language. This is done out of necessity rather than any analysis on my part. The ES plugin wraps the Lucene analyzer which performs sentence and then word segmentation using a Hidden Markov Model.
    • Unfortunately there is currently no custom Korean analyzer for Elasticsearch that I have come across. For that reason we are only using the CJK Analyzer which takes each bi-gram of symbols as a term. However, while writing this post I came across a Lucene mailing list thread from a few days ago which says that a Korean analyzer is in the process of being ported into Lucene. So I have no doubt that will eventually end up in ES or as an ES plugin.
  • Elasticsearch doesn’t have any built in stop words for Hebrew (he) so we define a custom list pulled from an online list (Update: this site doesn’t exist anymore, our list of stopwords is located here). I had some co-workers cull the list a bit to remove a few of the terms that they deemed a bit redundant. I’ll probably end up doing this for some other languages as well if we stick with the stop words approach.
  • Testing 30 analyzers was pretty non-trivial. The ES Inquisitor plugin’s Analyzers tab was incredibly useful for interactively testing text tokenization and stemming against all the different language analyzers to see how they functioned differently.

Finally we come to defining all of these analyzers. Hope this helps you in your multi-lingual endeavors.

Update [Feb 2014]: The PHP code we use for generating analyzers is now open sourced as a part of the wpes-lib project. See that code for the latest methods we are using.

Update [May 2014]: Based on the feedback in the comments and some issues we’ve come across running in production I’ve updated the mappings below. The changes we made are:

  • Perform ICU normalization before removing stopwords, and ICU folding after stopwords. Otherwise stopwords such as “même” in French will not be correctly removed.
  • Adjusted our Japanese language analysis based on a slightly adjusted use of GMO Media’s methodology. We were seeing a significantly lower click through rate on Japanese related posts than for other languages, and there was pretty good evidence that the morphological language analysis would help.
  • Added the Elision Token filter to French. “l’avion” => “avion”

Potential improvements I haven’t gotten a chance to test yet because we need to run real performance tests to be sure they will actually be an improvement:

  • Duplicate tokens to handle different spellings (eg “recognize” vs “recognise”).
  • Morphological analysis of en and ru
  • Should we run spell checking or phonetic analysis
  • Include all stopwords and rely on cutoff_frequency to avoid the performance problems this will introduce
  • Index bigrams with the shingle analyzer
  • Duplicate terms, stem them, then unique the terms to try and index both stemmed and non-stemmed terms

Thanks to everyone in the comments who have helped make our multi-lingual indexing better.

{
  "filter": {
    "ar_stop_filter": {
      "type": "stop",
      "stopwords": ["_arabic_"]
    },
    "bg_stop_filter": {
      "type": "stop",
      "stopwords": ["_bulgarian_"]
    },
    "ca_stop_filter": {
      "type": "stop",
      "stopwords": ["_catalan_"]
    },
    "cs_stop_filter": {
      "type": "stop",
      "stopwords": ["_czech_"]
    },
    "da_stop_filter": {
      "type": "stop",
      "stopwords": ["_danish_"]
    },
    "de_stop_filter": {
      "type": "stop",
      "stopwords": ["_german_"]
    },
    "de_stem_filter": {
      "type": "stemmer",
      "name": "minimal_german"
    },
    "el_stop_filter": {
      "type": "stop",
      "stopwords": ["_greek_"]
    },
    "en_stop_filter": {
      "type": "stop",
      "stopwords": ["_english_"]
    },
    "en_stem_filter": {
      "type": "stemmer",
      "name": "minimal_english"
    },
    "es_stop_filter": {
      "type": "stop",
      "stopwords": ["_spanish_"]
    },
    "es_stem_filter": {
      "type": "stemmer",
      "name": "light_spanish"
    },
    "eu_stop_filter": {
      "type": "stop",
      "stopwords": ["_basque_"]
    },
    "fa_stop_filter": {
      "type": "stop",
      "stopwords": ["_persian_"]
    },
    "fi_stop_filter": {
      "type": "stop",
      "stopwords": ["_finnish_"]
    },
    "fi_stem_filter": {
      "type": "stemmer",
      "name": "light_finish"
    },
    "fr_stop_filter": {
      "type": "stop",
      "stopwords": ["_french_"]
    },
    "fr_stem_filter": {
      "type": "stemmer",
      "name": "minimal_french"
    },
    "he_stop_filter": {
      "type": "stop",
      "stopwords": [/*excluded for brevity*/]
    },
    "hi_stop_filter": {
      "type": "stop",
      "stopwords": ["_hindi_"]
    },
    "hu_stop_filter": {
      "type": "stop",
      "stopwords": ["_hungarian_"]
    },
    "hu_stem_filter": {
      "type": "stemmer",
      "name": "light_hungarian"
    },
    "hy_stop_filter": {
      "type": "stop",
      "stopwords": ["_armenian_"]
    },
    "id_stop_filter": {
      "type": "stop",
      "stopwords": ["_indonesian_"]
    },
    "it_stop_filter": {
      "type": "stop",
      "stopwords": ["_italian_"]
    },
    "it_stem_filter": {
      "type": "stemmer",
      "name": "light_italian"
    },
    "ja_pos_filter": {
      "type": "kuromoji_part_of_speech",
      "stoptags": ["\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c", "\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
    },
    "nl_stop_filter": {
      "type": "stop",
      "stopwords": ["_dutch_"]
    },
    "no_stop_filter": {
      "type": "stop",
      "stopwords": ["_norwegian_"]
    },
    "pt_stop_filter": {
      "type": "stop",
      "stopwords": ["_portuguese_"]
    },
    "pt_stem_filter": {
      "type": "stemmer",
      "name": "minimal_portuguese"
    },
    "ro_stop_filter": {
      "type": "stop",
      "stopwords": ["_romanian_"]
    },
    "ru_stop_filter": {
      "type": "stop",
      "stopwords": ["_russian_"]
    },
    "ru_stem_filter": {
      "type": "stemmer",
      "name": "light_russian"
    },
    "sv_stop_filter": {
      "type": "stop",
      "stopwords": ["_swedish_"]
    },
    "sv_stem_filter": {
      "type": "stemmer",
      "name": "light_swedish"
    },
    "tr_stop_filter": {
      "type": "stop",
      "stopwords": ["_turkish_"]
    }
  },
  "analyzer": {
    "ar_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ar_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "bg_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "bg_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ca_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ca_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "cs_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "cs_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "da_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "da_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "de_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "de_stop_filter", "de_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "el_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "el_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "en_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "es_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "es_stop_filter", "es_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "eu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "eu_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fa_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fa_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fi_stop_filter", "fi_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "elision", "fr_stop_filter", "fr_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "he_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "he_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hi_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hu_stop_filter", "hu_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hy_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hy_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "id_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "id_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "it_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "it_stop_filter", "it_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ja_analyzer": {
      "type": "custom",
      "filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
      "tokenizer": "kuromoji_tokenizer"
    },
    "ko_analyzer": {
      "type": "cjk",
      "filter": []
    },
    "nl_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "nl_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "no_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "no_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "pt_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "pt_stop_filter", "pt_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ro_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ro_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ru_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ru_stop_filter", "ru_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "sv_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "sv_stop_filter", "sv_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "tr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "tr_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "zh_analyzer": {
      "type": "custom",
      "filter": ["smartcn_word", "icu_normalizer", "icu_folding"],
      "tokenizer": "smartcn_sentence"
    },
    "lowercase_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "keyword"
    },
    "default": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    }
  },
  "tokenizer": {
    "kuromoji": {
      "type": "kuromoji_tokenizer",
      "mode": "search"
    }
  }
}

 

Leave a comment

57 Comments

  1. Gregor

     /  May 27, 2013

    So for indexing you use the language detection plugin to determine the language of the document and use the corresponding analyzer.
    And for searching you always relay on the default analyzer without attempting to “guess” the language?

    Like

    Reply
    • Greg

       /  May 27, 2013

      For indexing, yes we do language detection to select the analyzer.

      When querying, it depends. If we have a good guess at the user’s language (ie they are on de.search.wordpress.com or the site they are on has a particular language selected) then we can use the appropriate language. But when we don’t have a good guess, then we can fall back to the default analyzer which should work pretty well across most languages.

      Ideally we try and use the appropriate language analyzer, but there are definitely cases where I know we won’t be able to so having a fallback is important. The biggest concern with the fallback is how stemming will truncate terms. Hopefully using only minimal stemming will minimize how much impact this has.

      I haven’t done any deep analysis of what impact this has on search relevancy yet though.

      Like

      Reply
      • Nate

         /  November 20, 2013

        Greg,

        So when you say you do “language detection” are you doing this independently from elastic search? Or is there a way to tie content.lang as set by the plugin to a particular analyzer automatically? I am very new to elastic search and it would be helpful to know.

        Like

      • Greg

         /  November 20, 2013

        Hi Nate

        We run the elasticsearch-langdetect plugin on the same ES cluster and then when indexing first make a call to it to determine the language of the content of the doc. Then we make a separate call to index the document.

        I don’t believe there is a way to index the document and determine the language at the same time.

        It’s also possible to run the langdetect code independent of ES (potentially in your client), but for us using the ES plugin made it easier to deploy and it doesn’t add much load to the cluster.

        Like

      • Nate

         /  November 21, 2013

        Thanks for the prompt reply! Makes sense.

        Like

  2. Avi G

     /  October 24, 2013

    Amazing post! Helped me a lot. Thank you for all the information!

    Like

    Reply
  3. Ale

     /  October 31, 2013

    Hei,
    It’s not clear for me how do you decide which analyzer to use depending on the field’s content. Do you have a field for each language, or were you able to use different analyzers for the same field at indexing ?

    Thanks !

    Like

    Reply
    • Greg

       /  October 31, 2013

      You can specify the analyzer to use when indexing. In my case I have a field for each document called lang_analyzer which specifies which analyzer to use for the document.

      You configure which field is used for specifying the analysis in the _analyzer mapping field.

      For querying you either need to specify the analyzer or you just rely on the default. Using the ICU plugins for analysis ensures consistent tokenization across all languages so that the default should work pretty well.

      Like

      Reply
  4. Daniel

     /  December 17, 2013

    Hi Greg! First of all, great post, thanks for it!
    How would you go about if your data was region names in various languages, for instance, I have more than 100k regions, and one document in ES contains the names Munich, München, Munique, etc. Same goes for the 100k+ regions. Having one document per language would make my index grow a lot.
    What I want to have is an auto complete where people can search regions, but I don’t really know the language they best know the region, so they can be seeing the site in English but searching the region in German. So to have an educated guess of the language is hard. Do you think a set up like the one you presented would be appropriate for data such as the one I stated? Or would you do something different?

    Thanks a lot,
    Daniel

    Like

    Reply
    • Greg

       /  December 18, 2013

      If I understand the use case, I think you could just use the ICU tokenizer, folding, and normalization on a single field without any stemming or stop words (the “default” analyzer in the code above). If you are only indexing place names across multiple languages you shouldn’t need stemming/stop words anyways. ICU should give you results that work pretty well across European languages at least. You wouldn’t have any fancy tokenization of Korean, Japanese, or Chinese. I don’t know enough about place names in those languages to know how big a problem that would be.

      If all of the place names you have are already separated, then be sure to index them as an array of strings, and consider indexing them as an analyzed and a non-analyzed (see the multi-field type mapping example).

      That way you can retain the original text and sequence of words. Probably some other details to work out to get auto suggest working well also, but I haven’t yet played with the new suggest features.

      Like

      Reply
      • Daniel

         /  January 8, 2014

        Thanks for the feedback Greg, I’m trying some stuff to see how it works, and what is faster, and your tips certainly helped.

        Thank you

        Like

  5. Michael

     /  January 22, 2014

    any idea how to plug in the polish (stempel) analyzer? have you tried it?

    Like

    Reply
  6. Michael

     /  January 22, 2014

    also, how does one uses the elasticsearch-langdetect plugin to automatically apply the right analyzer based on computed language?

    Like

    Reply
    • Greg

       /  January 23, 2014

      You can’t auto apply the analyzer to a field unfortunately. You need to make one request to analyze a block of text and get the language and then a separate request to index the data with the appropriate analyzer specified.

      Like

      Reply
      • Greg

         /  January 27, 2014

        Oh cool! The langdetect plugin has been updated since I originally wrote this post, and I hadn’t noticed that change.

        Yes, I think that should work. I’ll need to use this method in the future. Thanks!

        Like

      • Michael

         /  January 27, 2014

        unfortunately the _langdetect method is wayyy inaccurate, especially for short phrases..

        Like

      • Greg

         /  January 27, 2014

        Ya, I have some custom client code wrapping my call to langdetect so that if there is less than 300 chars of actual text then we don’t bother running it and use some fallbacks.

        I hacked together a quick (probably not working) gist of how we call langdetect: https://gist.github.com/gibrown/8652399

        Might be good to submit an issue against the plugin with specific examples. Short text is generally a harder problem, but there may be some simple changes that will make things better.

        Like

  7. Excellent article. I thought readers might be interested in Rosette Search Essentials for Elasticsearch, from Basis Technologies, which we launched last night at hack/reduce in Cambridge, MA. It’s a plugin that does three neat things that improve multilingual search quality:

    – Intelligent CJKT tokenization/segmentation
    – Lemmatization: performs morphological analysis to find the “lemma” or dictionary form of the words in your documents, which is far superior to stemming.
    – Decompounding: languages like German contain compound words that don’t always make great index terms. We break these up into their constituents so you can index them too.

    Handles Arabic, Chinese, Czech, Danish, Dutch, English, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Swedish, Thai and Turkish.

    Check it out here: http://basistech.com/elasticsearch

    Read a bit more about the recall and precision benefits that lemmatization and decompounding can offer here: See this paper for examples: http://www.basistech.com/search-in-european-languages-whitepaper/)

    I’m the Director of Product Management at Basis. I would love feedback on the product and to hear from anyone who has gnarly multilingual search problems.

    Like

    Reply
    • Greg

       /  February 9, 2014

      Hi Gregor, thanks for pointing this out and for working to make multi-lingual search better.

      I pretty strongly recommend against using a closed source solution such as yours for something so fundamental as search. My reasoning got lengthy, so I turned it into a full post.

      Happy to discuss more, either publicly or privately.

      Cheers.

      Like

      Reply
  8. slushi

     /  February 14, 2014

    the link to hebrew stop words seems to be broken. any ideas on where a good list can be found?

    Like

    Reply
  9. slushi

     /  February 14, 2014

    I tried out the above settings. I suspected that the above definition could cause issues when language specific stop words contain “special” characters that would be folded into ascii characters. I built a gist that demonstrates the problem in french.

    Did you guys decide this is acceptable? I think if the folding filter is moved to the end of the filter chain, this issue would disappear, but I don’t know what other effects that would have.

    Like

    Reply
    • Greg

       /  February 14, 2014

      Wow, you’re totally right. No, its not really acceptable, definitely a bug. Thanks!

      I think the folding filter should be last in the list, or we should use custom stopword lists that have the characters already folded. Probably this:

      "filter": ["icu_normalizer", "fr_stop_filter", "fr_stem_filter", "icu_folding"]
      

      This bug probably doesn’t affect search quality too much. It only applies to a few words in each language. However, including stop words in the index definitely makes the index bigger and could significantly slow down searches.

      We’ll have to do some experimentation to figure out what the right filtering is. Will be interesting to see how much of a performance improvement we get from this change.

      FYI, character folding is definitely very worthwhile. We did some work with one of our VIPs on a French site, and without character folding there were definitely complaints about the search.

      Thanks again!

      Like

      Reply
  10. Thanks for the nice article. One of the links is dead. The article on searchworkings has moved to: http://blog.trifork.com/2011/12/07/analysing-european-languages-with-lucene/

    regards Jettro

    Like

    Reply
  11. I love this post – come back here from time to time, because you’re regularly updating it – thanks for that! Learned a lot here! We’ve used that information for improving search results on our multilingual site Pixabay.com (20 languages).

    To give back something – as a German based company, we could fine tune some things for search in German:

    Instead of plan “icu_folding” one should better use a customized filter and exclude a few special characters:

    “filter”: {
    “de_icu_folding”: { “type”: “icu_folding”, “unicodeSetFilter”: “[^ßÄäÖöÜü]” },
    “de_stem_filter”: { “type”: “stemmer”, “name”: “minimal_german” },
    }

    Then, add a char filter to transform the excluded characters:

    “char_filter”: {
    “de_char_filter”: {
    “type”: “mapping”,
    “mappings”: [u"ß=>ss", u"Ä=>ae", u"ä=>ae", u"Ö=>oe", u"ö=>oe", u"Ü=>ue", u"ü=>ue", "ph=>f"]
    }
    }

    Put it all together in the analyzer:

    “de_analyzer”: {
    “type”: “custom”, “tokenizer”: “icu_tokenizer”,
    “filter”: ["de_stop_filter", "de_icu_folding", "de_stem_filter", "icu_normalizer"],
    “char_filter”: ["de_char_filter"]
    }

    Advantage: For example there are the words like “blut” and “blüte” in German, meaning “blood” and “blossom”. Using standard icu_folding, both terms are treated exactly the same way. With the custom char filter, results work as expected. The character “ü” may be written as “ue” in German, which is what the transformation basically does.

    Like

    Reply
    • Greg Ichneumon Brown

       /  May 23, 2014

      This is very helpful, thanks.

      I’ve been testing these changes out today, and I’m looking at adding this with a few slight changes into wpes-lib:
      – I just used the default icu_folding because as far as I could tell the char_filter will have changed these characters anyways
      – I also changed the order of the filters to put the normalizer first since one of the reasons for this filter is to combine multi-character sequences into one character before folding.

      I think both of these changes matter more when you are dealing with multi-lingual content in a single document. Any problems you see with this? For your examples it seems to still work well.

      I’m also curious if you have looked at all at using a decompounder in German.

      Like

      Reply
      • If the char_filter is applied before icu_folding takes place, it should work. In which order does ES go though those filters?

        I think, icu normalizer first makes totally sense – I’ll change that in our own code right away.

        Didn’t know about the decompounder so far – but it sounds great! Going to test this soon!

        Thanks, Simon

        Like

      • Greg Ichneumon Brown

         /  May 23, 2014

        ES always applies char filters first (even before tokenization), so ya that should work well.

        I’d be really interested to hear how the decompounder works for you. It feels like too big a change for me to universally change without doing some thorough testing of its performance. I’d also like to test it for multiple languages and just don’t have the time to devote to it right now.

        Thanks again for the help, I’m going to commit these changes and make them live when we rebuild our index in a few weeks.

        Like

      • Not sure if that’s interesting for you, but we also use a word delimiter filter for all latin languages, so not for ja, zh, ko: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html

        “filter”: {
        “my_word_delimiter”: {
        “type”: “word_delimiter”,
        “generate_word_parts”: False,
        “catenate_words”: True,
        “catenate_numbers”: True,
        “split_on_case_change”: False,
        “preserve_original”: True
        },
        }

        Like

      • Greg Ichneumon Brown

         /  May 23, 2014

        Good to hear that works well for you.

        I have used a word delimiter on some smaller indices, but I (vaguely) remember running into problems in a few cases. I think I decided I didn’t have enough data to figure out how to configure it properly.

        I still feel like my analyzers don’t do a good job with product names and other words where punctuation or case is used as part of the word.

        I’m surprised you don’t use the same filter for ja, zh, and ko. I often see a lot of latin languages mixed in with Asian languages.

        Like

      • I guess it wouldn’t really hurt, but in our case, the delimiter also wouldn’t make a (relevant) difference for ja, ko, zh. We’re not dealing with full texts/sentences, but with a lot of keywords that are strictly separated into the different languages. There are a few latin names for cities, countries and the likes, but they would not be affected by the delimiter. So the delimiter would only cost a bit of performance with no real benefit …

        Like

      • I’ve looked at the German decompounder – in theory it really looks good and I’d like to use it. However, it’s not well maintained. The update frequency appears to be rather low and there’s no working version for the current ES server 1.1.x or 1.2.

        Like

      • Greg Ichneumon Brown

         /  May 27, 2014

        Thanks for the update.

        jprante has been pretty responsive to Github issues I’ve submitted elsewhere in the past, so maybe either submit one or even better build it locally and submit a pull request. My guess is that very little has changed in 1.1.x that would affect this.

        Like

    • Do you think that the new built-in German analyzer (available in 1.3.X) is good enough, or do you still need the custom ICU folding?

      Like

      Reply
      • Greg Ichneumon Brown

         /  September 2, 2014

        Thanks for pointing that out. We’re still on 1.2 and I hadn’t noticed the new analyzers yet.

        My 2 cents after briefly examining the analyzers (caveat: I don’t know German).

        Some differences with our analyzers:
        – German uses light_stemming rather than minimal_stemming. The ES docs link to this paper. My selection of minimal stemming is based largely on effects of Spanish/French languages. Quite possible light stemming would be better.
        – There is normalization, but it is language specific. I worry that this choice will run into problems with foreign words mixed into the language. Particularly words with accents. ICU seems like a very strong standard. I believe ICU is well regarded within ES, but they don’t bundle it mainly due to its size.
        – In English the analyzer is using the Porter stemmer. I disagree with this decision in most contexts. When I have used the Porter stemmer in real applications I’ve gotten pushback.
        – There appears to be no normalization in English. This means resume and resumé are two different terms.

        I think there are some interesting ideas in this set of analyzers. I like a lot that there are links to the papers that were used to justify the decisions. I also like that ES has tried to put together a basic set of analyzers for users even if I disagree with some of the details. Those details probably depend a lot on your application.

        Like

      • These are good and helpful remarks. Thanks!

        Like

      • FYI these analyzers are just exposing the lucene per-language analyzers, which means they are specific to the needs of the language, typically undergone formal evaluation etc. They don’t depend on any external libraries like ICU.

        The german normalization, for example incorporates context-dependent handling that is pretty common in these analyzers, usually can’t be expressed as a unicode normalization form (i before e, except except after c, x only at the end of the word, etc), and usually isn’t appropriate for other languages.

        And in english, resume is not always a noun…

        Like

      • Greg Ichneumon Brown

         /  September 2, 2014

        Thanks for the background on the language analyzers (and presumably for building a lot of them :) ).

        Is there a list somewhere of how they are being evaluated? We’re not doing as much systematic evaluation as I’d like yet. We don’t have everything in place to make that worth the effort yet, but its something I want us to spend more time on.

        Just as some more background, part of the reason I like ICU so much is that it can mostly be applied across all languages. This helps in cases where the language is unknown or there are a mix of languages being searched across.

        “resumé” wasn’t necessarily the best example I could come up with, but looking at 170k en searches from a few hours of logs I see 7 searches: 4 searches for “resume”, 1 for “iit resume”, 1 for “College resume”, 1 for “resume writer”, and none of “resumé”. I’m kinda doubtful “resume” is used as a verb very often in web search. :)

        That’s diving into the weeds a bit (and me being overly nit-picky), I’ve just seen cases where search is considered “broken” by users if the search engine doesn’t correct for these sorts of issues. One case that comes to mind is one of our VIP clients: http://olympic.ca/ and http://olympique.ca/

        Naturally there are a lot of names with accents that need to be handled well whether the user is searching in English or French, and English speaking users rarely type accents.

        I believe that using the Porter stemmer results in similar feelings from users that search is “broken”, but I only have anecdotal feedback as I haven’t tested click through rates.

        This is pretty specific to web search though where I think users have been trained by Google what to expect. Depending on the application though ICU certainly isn’t always the best way to go, and I understand not wanting to bundle with Lucene or ES.

        Cheers

        Like

      • In our case (Pixabay.com that is) we stick with the custom analyzer described above. Unfortunately, there’s no (simple) perfect way of handling these special characters:

        Using our own custom analyzer, there’s a difference between “bluten” (bleeding) and “blüten” (blossoms). The new analyzer folds the “ü” to “u”, so both terms become the same -> big problem for us! Same issue as with the previous built-in analyzer(s).

        However, there’s an issue with plural forms: e.g. “häuser” (houses) is the plural of “haus” (house). Using the built-in analyzer including stemmer, there’s no difference between “häuser” and “haus”, because the stemmed term is in both cases “haus”. That’s good. With our own analyzer, stemming won’t work in this case, because the stemmed form of “haeuser” is “haeus” – and not “haus”. Thus, plural and singular are treated as different terms. There are several German nouns that show this behavior, but as I said, in our case, the custom analyzer is probably the better choice.

        Liked by 1 person

  12. Florian

     /  May 22, 2014

    very helpful article – thanks a lot greg!

    i have two questions:
    – what would be a good way to deal with a non detected/defined language? i build a mapping along the lines of the gist Michael posted. each language needs to be defined… content.en, content.ja, etc. how would i deal with a language that had not been defined there?

    – is there a way to use the langdetect plugin to also add/populate a field in the mapping that would contain the language code – for example to use it as a filter?

    cheers
    _f

    Like

    Reply
    • Greg Ichneumon Brown

       /  May 23, 2014

      Both of your questions would probably be good feature requests for the langdetect plugin. We still make a separate call to ES to do language detection and then set our lang_analyzer field to indicate which analyzer to apply. There’s three reasons we do this:
      – langdetect does not support every language
      – We do not have a custom analyzer for every language, some need to fall back on our default analyzer (eg Latvian).
      – We have other potential fallbacks we can use if the language detection fails. For example: user settings, lang detection on other content, or predicting based on other user behavior.

      Like

      Reply
      • Florian

         /  May 27, 2014

        using the detection separately (or inferring the language from ui settings etc.), works fine for me too. i would have to send the name of the analyzer to use with the query though and i ran into a small problem:

        if i use this query i do net get the highlights. if i remove the analyzer parameter i do get hightlights, but it then uses the default analyzer…

        am i doing something wrong with the parameter? do you have an example query somewhere that you could post that shows how you send the language/analyzer parameter with the query?

        thanks.

        Like

      • Hmm, sorry to hear you have to do hacks for Latvian. Actually ES has Latvian support, but somehow its missing from the documentation. I’ll fix and make sure no others are missing.

        Like

      • Greg Ichneumon Brown

         /  September 2, 2014

        That’d be awesome, thanks!

        Like

  13. Prashanth

     /  September 16, 2014

    smartcn_word and smartcn_sentence are no longer available from the plugin. How do you modify your configuration to use the smartcn analyzer and smartcn_tokenizer? Thanks.

    Like

    Reply
    • Greg Ichneumon Brown

       /  September 17, 2014

      Hi Prashanth,

      We’ve only just come across this problem as we start upgrading to ES 1.3.2. For new indices we’re changing to using the smartcn_tokenizer with no additional filters. We have existing indices using smartcn_sentence and smartcn_word. Our plan is to hack the smartcn plugin to make these an alias for smartcn_tokenizer, but we haven’t completed that work yet.

      We’ll try submitting our changes as a pull request to the plugin, but not sure whether that’s something they’d like.

      Like

      Reply
      • Prashanth

         /  September 19, 2014

        Thanks Greg. Does your chinese analyzer look like this now?

        “zh_analyzer”: {
        “type”: “smartcn”,
        “filter”: “smartcn_tokenizer”,
        “tokenizer”: “smartcn_tokenizer”,
        }

        Like

      • Greg Ichneumon Brown

         /  September 19, 2014

        Almost. You don’t need a filter. The tokenizer does all the work.

        Like

      • Prashanth

         /  September 19, 2014

        Thank you!

        Like

  14. Is there a reason for not using a stemmer on Bulgarian?

    Like

    Reply
  1. Elasticsearch: Vyhledáváme hezky česky | IT mag - novinky z IT

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,089 other followers

%d bloggers like this: