This essay started as a response to a comment on my multilingual indexing post. The comment is mostly an advertisement, but brings up some interesting points so I decided to publish it and turn my response into a full post.
For some context here’s the key part of the comment:
I thought readers might be interested in Rosette Search Essentials for Elasticsearch, from Basis Technologies, which we launched last night at hack/reduce in Cambridge, MA. It’s a plugin that does three neat things that improve multilingual search quality:
- Intelligent CJKT tokenization/segmentation
- Lemmatization: performs morphological analysis to find the “lemma” or dictionary form of the words in your documents, which is far superior to stemming.
- Decompounding: languages like German contain compound words that don’t always make great index terms. We break these up into their constituents so you can index them too.
There are many areas where Elasticsearch could benefit from better handling of multi-lingual text. And the NLP geek in me would love to see ES get some more modern Natural Language Processing techniques – such as these – applied within it.
Unfortunately I’m a lot less excited about this system because it is closed source, limiting its impact on the overall ES ecosystem. I love that Basis Technologies has whitepapers and it seems to be doing great evaluation of their system, but even the whitepapers require registering with them. This just seems silly.
Search engine technology has been around for a while, a big part of Elasticsearch’s success is due to it being open source. And particularly due to Lucene being an open source, collaborative effort over the past 15 years. I think ES has the potential to become a phenomenal NLP platform over the next five years, bringing many amazing NLP technologies coming out of academia to a massive number of developers. NLP researchers have done tremendous science in an open and collaborative manner. We should work to scale that technology in open and collaborative ways as well.
Building a platform on closed source solutions is not sustainable.
Humans express their dreams, opinions, and ideas in hundreds of languages. Bridging that gap between humans and computers – and ultimately between humans – is a noble endeavor that will subtly shape the next century. I’d like to see Elasticsearch be a force in democratizing the use of natural language processing and machine learning. These methods will impact how we understand the world, how we communicate with each other, and ultimately our democracy. We should not build that future on licensing that explicitly prevents citizens of some countries from participating.
I recognize that working for a successful open source company makes me luckily immune to certain the pressures of business, investors, and government contracts. But I have seen the unfortunate cycle of cool NLP technology getting trapped within a closed source company and eventually being completely shut down. I’ve recently been reading “The Theory That Would Not Die”. Where would we be now if the efficacy of Bayesian probability had not been locked inside classified government organizations for 40 years after World War II?
Basis Technologies has been around a long time, and has far more NLP talent and experience than I do, but the popularity of my very brief multi-lingual post tells me that there is also huge opportunity for the community to improve the multi-lingual capabilities of Elasticsearch. A company leading the way could build a strong business providing all the support that inevitably will be needed when dealing with multiple languages. I’d be happy to talk with anyone about how WordPress.com’s rather large set of multi-lingual data could help in such an endeavor. I bet other organizations that would be interested also.