Lucene
Download the current source tarball from one the mirrors at: http://lucene.apache.org/core/downloads.html
Extract the tarball and build:
$ tar xf lucene-5.1.0-src.tgz
$ cd lucene-5.1.0
$ CLASSPATH=/PATH/TO/ivy.jar ant jar
Finding the ivy jar is a problem. I used Homebrew to install it, and it was at /usr/local/Cellar/ivy/2.4.0/libexec/ivy-2.4.0.jar.
Set the CLASSPATH:
$ export LUCENE_VER=5.1.0
$ export CLASSPATH=\
build/queryparser/lucene-queryparser-${LUCENE_VER}-SNAPSHOT.jar\
:build/analysis/common/lucene-analyzers-common-${LUCENE_VER}-SNAPSHOT.jar\
:build/core/lucene-core-${LUCENE_VER}-SNAPSHOT.jar\
:build/demo/lucene-demo-${LUCENE_VER}-SNAPSHOT.jar
Create the file index which is an index of the Lucene source code in core/src:
$ java org.apache.lucene.demo.IndexFiles -docs core/src
Query the file index:
$ java org.apache.lucene.demo.SearchFiles
The index was stored in the directory index:
$ ls -l index
total 4176
-rw-r--r-- 1 clark staff 299 Apr 26 13:00 _0.cfe
-rw-r--r-- 1 clark staff 2125248 Apr 26 13:00 _0.cfs
-rw-r--r-- 1 clark staff 301 Apr 26 13:00 _0.si
-rw-r--r-- 1 clark staff 130 Apr 26 13:00 segments_1
-rw-r--r-- 1 clark staff 0 Apr 26 13:00 write.lock
This is the compound index file format. An index consists of one or more segments. The _0.cf, _0.cfs, and _0.si files belong to the first and only segment. If we were to re-open the index and add more documents to it, a second segment would be created. When the second segment is created, the segment_1 file is replaced by a segment_2 file. The contents of the file are binary, but they appear to contain the version number of Lucene that created the index.
Find the place where the IndexWriterConfig object gets created in the demo and make this call on it:
iwc.setUseCompoundFile(false);
Use ant to re-compile and then delete and rebuild the index. Here is what the multifile index format looks like:
$ ls -l index
total 4216
-rw-r--r-- 1 clark staff 19505 Apr 26 13:35 _0.fdt
-rw-r--r-- 1 clark staff 101 Apr 26 13:35 _0.fdx
-rw-r--r-- 1 clark staff 336 Apr 26 13:35 _0.fnm
-rw-r--r-- 1 clark staff 695 Apr 26 13:35 _0.nvd
-rw-r--r-- 1 clark staff 102 Apr 26 13:35 _0.nvm
-rw-r--r-- 1 clark staff 394 Apr 26 13:35 _0.si
-rw-r--r-- 1 clark staff 317861 Apr 26 13:35 _0_Lucene50_0.doc
-rw-r--r-- 1 clark staff 1135071 Apr 26 13:35 _0_Lucene50_0.pos
-rw-r--r-- 1 clark staff 640875 Apr 26 13:35 _0_Lucene50_0.tim
-rw-r--r-- 1 clark staff 10640 Apr 26 13:35 _0_Lucene50_0.tip
-rw-r--r-- 1 clark staff 130 Apr 26 13:35 segments_1
-rw-r--r-- 1 clark staff 0 Apr 26 13:35 write.lock
The multifile index format gives us a bit more insight into how data in a segment is organized.
a description of the file formats
Elasticsearch
Download and run an elasticsearch node:
$ curl -L -O https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz > elasticsearch-1.4.4.tar.gz
$ tar xf elasticsearch-1.4.4.tar.gz
$ cd elasticsearch-1.4.4
$ ./bin/elasticsearch -d
By default the node will create or join the cluster named 'elasticsearch'. It uses multicasting to detect any other nodes in the cluster.
Query the health of the cluster and list the indices:
$ curl 'localhost:9200/_cat/health?v'
$ curl 'localhost:9200/_cat/indices?v'
The health of the cluster will be green, yellow, or red. The health is green if all of the data is available and backed up, yellow if all the data is available but some of it is not backed up, and red if some of the data is not available.
Each index is divided into one or more shards. A shard must be small enough to fit on a single node. In addition, the index can have zero or more replicas. If the number of replicas is 1, then each shard will have a copy stored on a different node than the original.
Create an index named 'books' and index a few book titles:
$ curl -XPUT 'localhost:9200/books?pretty'
$ curl -XPUT 'localhost:9200/books/external/1?pretty' -d '{"title": "Fear and Loathing in Las Vegas"}'
$ curl -XPUT 'localhost:9200/books/external/2?pretty' -d '{"title": "Confessions of an English Opium-Eater"}'
Get the fields in an index:
$ curl 'localhost:9200/books/_mapping'
Search using GET or POST:
$ curl 'localhost:9200/books/_search?q=fear'
$ curl 'localhost:9200/books/_search' -d '{"query":{"match":{"title":"fear"}}}'
Only bring back selected fields from the documents:
$ curl http://localhost:9200/books/_search -d '{"fields": ["title"]}'
Searching with POST requests gives one access to a more complete query language. Some of the top level keys:
- query
- size
- from
- sort
Partial list of possible subkeys of the query key:
- match
- match_all
- term
- terms
- range
Delete the books index:
$ curl -XDELETE 'http://localhost:9200/books/'