Lucene & Elasticsearch

Table of Contents

Lucene

Download the current source tarball from one the mirrors at: http://lucene.apache.org/core/downloads.html

Extract the tarball and build:

$ tar xf lucene-5.1.0-src.tgz
$ cd lucene-5.1.0
$ CLASSPATH=/PATH/TO/ivy.jar ant jar

Finding the ivy jar is a problem. I used Homebrew to install it, and it was at /usr/local/Cellar/ivy/2.4.0/libexec/ivy-2.4.0.jar.

Set the CLASSPATH:

$ export LUCENE_VER=5.1.0
$ export CLASSPATH=\
build/queryparser/lucene-queryparser-${LUCENE_VER}-SNAPSHOT.jar\
:build/analysis/common/lucene-analyzers-common-${LUCENE_VER}-SNAPSHOT.jar\
:build/core/lucene-core-${LUCENE_VER}-SNAPSHOT.jar\
:build/demo/lucene-demo-${LUCENE_VER}-SNAPSHOT.jar

Create the file index which is an index of the Lucene source code in core/src:

$ java org.apache.lucene.demo.IndexFiles -docs core/src

Query the file index:

$ java org.apache.lucene.demo.SearchFiles

The index was stored in the directory index:

$ ls -l index
total 4176
-rw-r--r--  1 clark  staff      299 Apr 26 13:00 _0.cfe
-rw-r--r--  1 clark  staff  2125248 Apr 26 13:00 _0.cfs
-rw-r--r--  1 clark  staff      301 Apr 26 13:00 _0.si
-rw-r--r--  1 clark  staff      130 Apr 26 13:00 segments_1
-rw-r--r--  1 clark  staff        0 Apr 26 13:00 write.lock

This is the compound index file format. An index consists of one or more segments. The _0.cf, _0.cfs, and _0.si files belong to the first and only segment. If we were to re-open the index and add more documents to it, a second segment would be created. When the second segment is created, the segment_1 file is replaced by a segment_2 file. The contents of the file are binary, but they appear to contain the version number of Lucene that created the index.

Find the place where the IndexWriterConfig object gets created in the demo and make this call on it:

iwc.setUseCompoundFile(false);

Use ant to re-compile and then delete and rebuild the index. Here is what the multifile index format looks like:

$ ls -l index
total 4216
-rw-r--r--  1 clark  staff    19505 Apr 26 13:35 _0.fdt
-rw-r--r--  1 clark  staff      101 Apr 26 13:35 _0.fdx
-rw-r--r--  1 clark  staff      336 Apr 26 13:35 _0.fnm
-rw-r--r--  1 clark  staff      695 Apr 26 13:35 _0.nvd
-rw-r--r--  1 clark  staff      102 Apr 26 13:35 _0.nvm
-rw-r--r--  1 clark  staff      394 Apr 26 13:35 _0.si
-rw-r--r--  1 clark  staff   317861 Apr 26 13:35 _0_Lucene50_0.doc
-rw-r--r--  1 clark  staff  1135071 Apr 26 13:35 _0_Lucene50_0.pos
-rw-r--r--  1 clark  staff   640875 Apr 26 13:35 _0_Lucene50_0.tim
-rw-r--r--  1 clark  staff    10640 Apr 26 13:35 _0_Lucene50_0.tip
-rw-r--r--  1 clark  staff      130 Apr 26 13:35 segments_1
-rw-r--r--  1 clark  staff        0 Apr 26 13:35 write.lock

The multifile index format gives us a bit more insight into how data in a segment is organized.

a description of the file formats

Elasticsearch

Download and run an elasticsearch node:

$ curl -L -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz > elasticsearch-1.4.4.tar.gz
$ tar xf elasticsearch-1.4.4.tar.gz
$ cd elasticsearch-1.4.4
$ ./bin/elasticsearch -d

By default the node will create or join the cluster named 'elasticsearch'. It uses multicasting to detect any other nodes in the cluster.

Query the health of the cluster and list the indices:

$ curl 'localhost:9200/_cat/health?v'
$ curl 'localhost:9200/_cat/indices?v'

The health of the cluster will be green, yellow, or red. The health is green if all of the data is available and backed up, yellow if all the data is available but some of it is not backed up, and red if some of the data is not available.

Each index is divided into one or more shards. A shard must be small enough to fit on a single node. In addition, the index can have zero or more replicas. If the number of replicas is 1, then each shard will have a copy stored on a different node than the original.

Create an index named 'books' and index a few book titles:

$ curl -XPUT 'localhost:9200/books?pretty'
$ curl -XPUT 'localhost:9200/books/external/1?pretty' -d '{"title": "Fear and Loathing in Las Vegas"}'
$ curl -XPUT 'localhost:9200/books/external/2?pretty' -d '{"title": "Confessions of an English Opium-Eater"}'

Get the fields in an index:

$ curl 'localhost:9200/books/_mapping'

Search using GET or POST:

$ curl 'localhost:9200/books/_search?q=fear'

$ curl 'localhost:9200/books/_search' -d '{"query":{"match":{"title":"fear"}}}'

Only bring back selected fields from the documents:

$ curl http://localhost:9200/books/_search -d '{"fields": ["title"]}'

Searching with POST requests gives one access to a more complete query language. Some of the top level keys:

  • query
  • size
  • from
  • sort

Partial list of possible subkeys of the query key:

  • match
  • match_all
  • term
  • terms
  • range

Delete the books index:

$ curl -XDELETE 'http://localhost:9200/books/'
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License