Back to technology with this blog post and a review of Elasticsearch – The Definitive Guide by Clinton Gormley and Zachary Tong. The book is available for free online, and probably more up to date (here), that said Elasticsearch seems to be quite stable now. I have a dead tree copy because I’m old-fashioned.
Elasticsearch is a full-text search engine based on the Apache Lucene project. I was first made aware of it when I was working at ScraperWiki where we used it for a proof of concept system for analysing legalisation from many countries (I wasn’t involved hands-on with this work). Recently, I used it to make a little auto-completion web form for company names using the Companies House dataset. From download to implementing a solution which was x1000 times faster than a naive SQL querying system took less than a day – the default configuration and system is that good!
You can treat Elasticsearch like a SQL database to a fair degree, what it refers to indexes are what would be separate databases on a SQL server. Elasticsearch refers to document types instead of tables, and what would be rows in a SQL database are called “documents”. There are no joins as such in Elasticsearch but there are a number of workarounds such as parent-child relationships, nested objects or plain old denormalisation. I suspect one needs to be a bit cautious of treating Elasticsearch as a funny looking SQL database.
The preferred way to interact with Elasticsearch is using the HTTP API, this means that once installed you can prod away at your Elasticsearch database using curl from the commandline or the Sense plugin for Google Chrome. The book is liberally scattered with examples written as HTTP requests, and online these can be launched from the browser (given a bit of configuration). To my mind the only downside of this is that queries are written in JSON which introduces a lot of extraneous brackets and quoting. For my experiments I moved quickly to using the Python interface which seems well-supported and complete (as do other language bindings).
Elasticsearch: The Definitive Guide is divided into 7 sections: Getting started, Search in Depth, Dealing with Human Language, Aggregations, Geolocation, Modelling your data, and finishes with Administration, Monitoring and Deployment.
The Getting Started section of the book covers everything you need to get you going but no single topic in any depth. The subsequent sections are largely about filling in that detail. The query language is completely different to SQL and queries come back with results ranked by a relevance score. I suspect this is where I’ll find myself working a lot in future, currently my queries give me a set of results which I filter in Python. I suspect I could write better queries which would return relevance scores which matched my application (and that I would trust). As it stands my queries always return *something* which may or may not be what I want.
I found the material regarding analyzers (which are applied to searchable fields and, symmetrically, search terms) very interesting and applicable to wider search problems where Elasticsearch is not necessarily the technology to be used. There is an overlap here with natural language processing in the sense that analyzers can include tokenizers, stemmers, and synonym lookups which are all part of the NLP domain. This is expanded on further in the “Dealing with human language” section.
The section on aggregations explains Elasticsearch’s “group by”-like functionality, and that on geolocation touches on spatial extension-like behaviour. Elasticsearch handles geohashes which are a relatively recent innovation in encoding spatial coordinates.
The book mentions very briefly the ELK stack which is Elasticsearch, Logstash and Kibana (all available from the elastic website). This is used to analyse log files, logstash funnels the log data into elasticsearch where it is visualised using Kibana. I tried out kibana briefly, its an easy to use visualising frontend.
Elasticsearch is a Big Data technology from the start which means it supports sharding, replication and distribution over nodes out of the box but it runs fine on a simple single node such as my laptop.
Elasticsearch is a pretty big book but the individual chapters are pretty short and to the point. As I’d expect from O’Reilly Elasticsearch is well-edited, and readable. I found it great for working out what all the parts of Elasticsearch are and now know what exists when it comes to solving live problems. The book is pretty good at telling you which things you can do, and which things you should do.