Core Java

Elasticsearch for Java Developers: Elasticsearch from the command line

This article is part of our Academy Course titled Elasticsearch Tutorial for Java Developers.

In this course, we provide a series of tutorials so that you can develop your own Elasticsearch based applications. We cover a wide range of topics, from installation and operations, to Java API Integration and reporting. With our straightforward tutorials, you will be able to get your own projects up and running in minimum time. Check it out here!

1. Introduction

From the previous part of the tutorial we have got a pretty good understanding of what Elasticsearch is, its basic concepts and the power of search capabilities it could bring to our applications. In this section we are jumping right into the battle and going to apply our knowledge in practice. Along this section curl and/or http would be the only tools we are going to use to make friends with Elasticsearch.

To sum up, we have already finalized our book catalog index and mapping types so we are going to pick it up from there. In order to keep things as close to reality as possible, we are going to use Elasticsearch cluster with three nodes (all run as Docker containers), while catalog index is going to be configured with replication factor of two.

As we are going to see, working with Elasticsearch cluster has quite a few subtleties comparing to standalone instance and it is better to be prepared to deal with them. Hopefully, you still remember from the previous part of the tutorial how to start Elasticsearch as this is going to be the only prerequisite: having the cluster up and running. With that, let us get started!

2. Is My Cluster Healthy?

The first thing you would need to know about your Elasticsearch cluster before doing anything with it is its health. There are a couple of ways to gather this information but arguably the easiest and most convenient one is by using Cluster APIs, particularly cluster health endpoint.

$ http http://localhost:9200/_cluster/health

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "active_primary_shards": 0,
    "active_shards": 0,
    "active_shards_percent_as_number": 100.0,
    "cluster_name": "es-catalog",
    "delayed_unassigned_shards": 0,
    "initializing_shards": 0,
    "number_of_data_nodes": 3,
    "number_of_in_flight_fetch": 0,
    "number_of_nodes": 3,
    "number_of_pending_tasks": 0,
    "relocating_shards": 0,
    "status": "green",
    "task_max_waiting_in_queue_millis": 0,
    "timed_out": false,
    "unassigned_shards": 0
}

Among those details we are looking for status indicator which should be set to green, meaning that that all shards are allocated and cluster is in a good operational shape.

3. All About Indices

Our Elasticsearch cluster is all green and ready to rock. The next logical step would be to create a catalog index, with the mapping types and settings we have outlined before. But before doing that, let us check if there are any indices already created this time using Indices APIs.

$ http http://localhost:9200/_stats

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_all": {
        "primaries": {},
        "total": {}
    },
    "_shards": {
        "failed": 0,
        "successful": 0,
        "total": 0
    },
    "indices": {}
}

As expected, our cluster has nothing in it yet and we are good to go with creating the index for our book catalog. As we know, Elasticsearch speaks JSON but manipulating ald be said about usag more or less complex JSON document from the command line is somewhat cumbersome. Let us better store the catalog settings and mappings in the catalog-index.json document.

{ 
  "settings": {
    "index" : {
      "number_of_shards" : 5, 
      "number_of_replicas" : 2 
    }
  },
  "mappings": {
    "books": {
      "_source" : {
        "enabled": true
      },
      "properties": {
        "title": { "type": "text" },
        "categories" : {
          "type": "nested",
          "properties" : {
            "name": { "type": "text" }
          }
        },
        "publisher": { "type": "keyword" },
        "description": { "type": "text" },
        "published_date": { "type": "date" },
        "isbn": { "type": "keyword" },
        "rating": { "type": "byte" }
       }
   },
   "authors": {
     "properties": {
       "first_name": { "type": "keyword" },
       "last_name": { "type": "keyword" }
     },
     "_parent": {
        "type": "books"
      }
    }
  }
}

And use this document as an input to create an index API.

$ http PUT http://localhost:9200/catalog < catalog-index.json

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}

A few words should be said about the usage of acknowledged response property across most of the Elasticsearch APIs, especially the ones which apply mutations. In general, this value simply indicates whether the operation completed before the timeout (“true”) or may take an effect sometime soon (“false”). We are going to see more examples of its usage in a different context later on.

That is it and we have brought our catalog index live. To ensure the truthiness of this fact, we could ask Elasticsearch to return catalog index settings.

$ http http://localhost:9200/catalog/_settings

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "catalog": {
        "settings": {
            "index": {
                "creation_date": "1487428863824",
                "number_of_replicas": "2",
                "number_of_shards": "5",
                "provided_name": "catalog",
                "uuid": "-b63dCesROC5UawbHz8IYw",
                "version": {
                    "created": "5020099"
                }
            }
        }
    }
}

Awesome, exactly what we have ordered. You might wonder how Elasticsearch would react if we would have tried to update the index settings by increasing the number of shards (as we know, not all index settings could be updated once index has been created).

$ echo '{"index":{"number_of_shards":6}}' | http PUT http://localhost:9200/catalog/_settings

HTTP/1.1 400 Bad Request
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "error": {
        "reason": "can't change the number of shards for an index",
        "root_cause": [
            ...
        ],
        "type": "illegal_argument_exception"
    },
    "status": 400
}

The error response comes as no surprise (please notice that the response details have been reduced for illustrative purposes only). Along with settings, it is very easy to get the mapping types for a particular index, for example:

$ http http://192.168.99.100:9200/catalog/_mapping

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "catalog": {
        "mappings": {
            "authors": {
                ...
            },
            "books": {
                ...
            }
        }
    }
}

By and large, the index mappings for existing fields cannot be updated; however there are some exceptions of the rule. One of the greatest features of the indices APIs is the ability to perform the analysis process against a particular index mapping type and field without actually sending any documents.

$ http http://localhost:9200/catalog/_analyze field=books.title text="Elasticsearch: The Definitive Guide. A Distributed Real-Time Search and Analytics Engine"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 13,
            "position": 0,
            "start_offset": 0,
            "token": "elasticsearch",
            "type": ""
        },
        {
            "end_offset": 18,
            "position": 1,
            "start_offset": 15,
            "token": "the",
            "type": ""
        },
        
        ...

        {
            "end_offset": 88,
            "position": 11,
            "start_offset": 82,
            "token": "engine",
            "type": ""
        }
    ]
}

It is exceptionally useful in case you would like to validate your mapping types’ parameters before throwing a bunch of data into Elasticsearch for indexing.

And last but not least, there is one important detail about index states. Any particular index could be in opened (fully operational) or closed (blocked for read/write operations, archived would be a good analogy) states. As for everything else, Elasticsearch provided an APIs for that.

$ http POST http://localhost:9200/catalog/_open

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true
}

4. Documents, More Documents, …

The empty index without documents is not very useful so let us switch gears from indices APIs to another great one, document APIs. We are going to start exploring it using the simplest single document operations, relying on the following book.json document:

{
  "title": "Elasticsearch: The Definitive Guide. A Distributed Real-Time Search and Analytics Engine",
  "categories": [
      { "name": "analytics" },
      { "name": "search" },
      { "name": "database store" }
  ],
  "publisher": "O'Reilly",
  "description": "Whether you need full-text search or real-time analytics of structured data—or both—the Elasticsearch distributed search engine is an ideal way to put your data to work. This practical guide not only shows you how to search, analyze, and explore data with Elasticsearch, but also helps you deal with the complexities of human language, geolocation, and relationships.", 
  "published_date": "2015-02-07",
  "isbn": "978-1449358549",
  "rating": 4
}

Before sending this JSON to Elasticsearch, it would be great to talk a little bit about documents identification. Each document in Elasticsearch has a unique identifier, stored in a special _id field. You may provide one while uploading the document to Elasticsearch (like we do in the example below using isbn as it is a great example of natural identifier), or it will be generated and assigned by Elasticsearch.

$ http PUT http://localhost:9200/catalog/books/978-1449358549 < book.json

HTTP/1.1 201 Created
Location: /catalog/books/978-1449358549
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_id": "978-1449358549",
    "_index": "catalog",
    "_shards": {
        "failed": 0,
        "successful": 3,
        "total": 3
    },
    "_type": "books",
    "_version": 1,
    "created": true,
    "result": "created"
}

Our first document just made its way into a catalog index, under books type. But we also have authors type, which is in a parent / child relationship with books. Let us complement the book with its authors from authors.json document.

[
  {
    "first_name": "Clinton",
    "last_name": "Gormley",
    "_parent": "978-1449358549"
  },
  {
    "first_name": "Zachary",
    "last_name": "Tong",
    "_parent": "978-1449358549"
  }
]

The book has more than one author so we still can use the single document API by indexing each author document one by one. However, let us not do that but switch over to bulk document API instead and transform our authors.json document a bit to be compatible with bulk document API format.

{ "index" : { "_index" : "catalog", "_type" : "authors", "_id": "1", "_parent": "978-1449358549" } }
{ "first_name": "Clinton", "last_name": "Gormley" }
{ "index" : { "_index" : "catalog", "_type" : "authors", "_id": "2", "_parent": "978-1449358549" } }
{ "first_name": "Zachary", "last_name": "Tong" }

Done deal, let us save this document as authors-bulk.json and feed it directly into bulk document API endpoint.

$ http POST http://localhost:9200/_bulk < authors-bulk.json

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "errors": false,
    "items": [
        {
            "index": {
                "_id": "1",
                "_index": "catalog",
                "_shards": {
                    "failed": 0,
                    "successful": 3,
                    "total": 3
                },
                "_type": "authors",
                "_version": 5,
                "created": false,
                "result": "updated",
                "status": 200
            }
        },
        {
            "index": {
                "_id": "2",
                "_index": "catalog",
                "_shards": {
                    "failed": 0,
                    "successful": 3,
                    "total": 3
                },
                "_type": "authors",
                "_version": 2,
                "created": true,
                "result": "created",
                "status": 201
            }
        }
    ],
    "took": 105
}

And we have book and author documents as the first citizens of the catalog index! It is time to fetch those documents back.

$ http http://localhost:9200/catalog/books/978-1449358549

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_id": "978-1449358549",
    "_index": "catalog",
    "_source": {
        "categories": [
            { "name": "analytics" },
            { "name": "search"},
            { "name": "database store" }
        ],
        "description": "...",
        "isbn": "978-1449358549",
        "published_date": "2015-02-07",
        "publisher": "O'Reilly",
        "rating": 4,
        "title": "Elasticsearch: The Definitive Guide. A Distributed Real-Time Search and Analytics Engine"
    },
    "_type": "books",
    "_version": 1,
    "found": true
}

Easy! However to fetch the documents from authors collection, which are children of their respective documents from books collection, we have to supply the parent identifier along with the document own one, for example:

$ http http://localhost:9200/catalog/authors/1?parent=978-1449358549

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_id": "1",
    "_index": "catalog",
    "_parent": "978-1449358549",
    "_routing": "978-1449358549",
    "_source": {
        "first_name": "Clinton",
        "last_name": "Gormley"
    },
    "_type": "authors",
    "_version": 1,
    "found": true
}

This is one of the specifics working with parent / child relations in Elasticsearch. As it has been already mentioned, you may model such relationships in a simpler way but our goal is to learn how to deal with it if you choose to go this route in your applications.

The delete and update APIs are pretty straightforward so we just leaf them through, please notice that the same rules regarding identifying the child documents apply. You may be surprised, but deleting a parent document does not automatically delete its children, so keep that in mind. We are going to see how to workaround that a bit later.

To finish up, let us take a look at the term vectors API which returns all the details and statistics about terms in the fields of the document, for example (only the small part of the response has been pasted):

$ http http://localhost:9200/catalog/books/978-1449358549/_termvectors?fields=description

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_id": "978-1449358549",
    "_index": "catalog",
    "_type": "books",
    "_version": 1,
    "found": true,
    "term_vectors": {
        "description": {
            "field_statistics": {
                "doc_count": 1,
                "sum_doc_freq": 46,
                "sum_ttf": 60
            },
            "terms": {
                "analyze": {
                    "term_freq": 1,
                    "tokens": [ ... ]
                },
                "and": {
                    "term_freq": 2,
                    "tokens": [ ... ]

                },
                "complexities": {
                    "term_freq": 1,
                    "tokens": [ ... ]

                },
                "data": {
                    "term_freq": 3,
                    "tokens": [ ... ]

                },
                ...
            }
        }
    },
    "took": 5
}

You may not find yourself using the term vectors API often however it is a terrific tool to troubleshoot why certain documents may not pop up in the search results.

5. What if My Mapping Types Are Suboptimal

Very often over time you may discover that your mapping types may not be optimal and could be made better. However, Elasticsearch supports only limited modifications over existing mapping types. Luckily, Elasticsearch is providing a dedicated reindexing API, for example:

$ echo '{"source": {"index": "catalog"}, "dest": {"index": "catalog-v2"}}' | http POST http://localhost:9200/_reindex

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "batches": 0,
    "created": 200,
    "deleted": 0,
    "failures": [],
    "noops": 0,
    "requests_per_second": -1.0,
    "retries": {
        "bulk": 0,
        "search": 0
    },
    "throttled_millis": 0,
    "throttled_until_millis": 0,
    "timed_out": false,
    "took": 265,
    "total": 200,
    "updated": 0,
    "version_conflicts": 0
}

The trick here is to create a new index with updated mapping types, catalog-v2, and than just ask Elasticsearch to fetch all documents from old index (catalog) and put them into the new one (catalog-v2), and finally swap the indices. Please notice, it also works not only for local but remote indices as well.

Although simple, this API is still considered an experimental and may not be suitable in all cases, for example if your index is really massive or your Elasticsearch is experiencing a high load and should prioritize the application requests.


 

6. The Search Time

We have learned how to create indexes, mapping types and index the documents, all important but not really exciting topics. But search is definitely the heart and soul of Elasticsearch, so let us get to know it right away.

In order to demonstrate different search features we would need a couple of more documents, please upload them to your Elasticsearch cluster from books-and-authors-bulk.json using our friend bulk document API.

$ http POST http://localhost:9200/_bulk < books-and-authors-bulk.json

Having a few documents in our collections we could start off to issue search queries against them using the most accessible form of search API which accepts search criteria in URI by means of query string. For example, let us search for a term engine (keeping in mind search engine phrase).

$ http POST http://localhost:9200/catalog/books/_search?q=engine

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "hits": {
        "hits": [
            {
                "_id": "978-1449358549",
                "_index": "catalog",
                "_score": 0.7503276,
                "_source": {
                    "categories": [
                        { "name": "analytics },
                        { "name": "search" },
                        { "name": "database store" }
                    ],
                    "description": " Whether you need full-text search or real-time ...",
                    "isbn": "978-1449358549",
                    "published_date": "2015-02-07",
                    "publisher": "O'Reilly",
                    "rating": 4,
                    "title": " Elasticsearch: The Definitive Guide. ..."
                },
                "_type": "books"
            }
        ],
        "max_score": 0.7503276,
        "total": 1
    },
    "timed_out": false,
    "took": 22
}

Good starting point indeed, this API is quite useful for doing quick and shallow searches, but its capabilities are very limited. The search using request body API is a very different beast and reveals the full power of Elasticsearch. It is built on top of JSON-based Query DSL, the concise and intuitive language to construct arbitrarily complex search queries.

There are quite a few query types which Query DSL allows to describe, each have own syntax and parameters. However, there is a set of common parameters, like sort, from, size, stored_fields (the list is really long actually) which are agnostic to query type and could be applied for any of those.

For the next couple of sections we are going to switch over from http to curl as the latter is a bit more convenient when dealing with JSON payloads.

The first query type we are going to try out using Query DSL is the match all query. In some extent, it is not really a query because it just matches all documents. As such, it may return a lot of results and as a general rule, please always annotate your queries with reasonable size limit, here is an example:

$ curl –i http://localhost:9200/catalog/books/_search?pretty -d '                                                                                                                                  
{
    "size": 10,
    "query": {
        "match_all" : {
        }
    }
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 3112
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "catalog",
        "_type" : "books",
        "_id" : "978-1449358549",
        "_score" : 1.0,
        "_source" : {
          "title" : "Elasticsearch: The Definitive Guide ...",
          "categories" : [
            { "name" : "analytics" },
            { "name" : "search" },
            { "name" : "database store" }
          ],
          "publisher" : "O'Reilly",
          "description" : "Whether you need full-text ...",
          "published_date" : "2015-02-07",
          "isbn" : "978-1449358549",
          "rating" : 4
        }
      },
      ...
    ]
  }
}

The next one is a real query type and is referred as a class of full text queries which do a search against full text document fields (probably the most widely used ones). In the basic form it does a match against a single document field, like for example book’s description.

$ curl -i http://localhost:9200/catalog/books/_search?pretty -d '
{
    "query": {
        "match" : {
            "description" : "engine"
        }
    }
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 1271
{
  "took" : 17,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.28004453,
    "hits" : [
      {
        "_index" : "catalog",
        "_type" : "books",
        "_id" : "978-1449358549",
        "_score" : 0.28004453,
        "_source" : {
          "title" : "Elasticsearch: The Definitive Guide. ...",
          "categories" : [
            { "name" : "analytics" },
            { "name" : "search" },
            { "name" : "database store" }
          ],
          "publisher" : "O'Reilly",
          "description" : "Whether you need full-text ...",
          "published_date" : "2015-02-07",
          "isbn" : "978-1449358549",
          "rating" : 4
        }
      }
    ]
  }
}

But the full text queries are very powerful and have quite a few other variations, including match_phrase, match_phrase_prefix,  multi_match,  common_terms,  query_string and simple_query_string.

Moving on, we are entering the world of term level queries which operate on the exact terms and usually used for field types like numbers, dates, and keywords. The publisher book field is a good candidate to try it out.

$ curl -i http://localhost:9200/catalog/books/_search?pretty -d '
{
   "size": 10,
   "_source": [ "title" ],
   "query": {
        "term" : {
            "publisher" : "Manning"
        }
    }
}'  

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 675

{
  "took" : 21,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.18232156,
    "hits" : [
      {
        "_index" : "catalog",
        "_type" : "books",
        "_id" : "978-1617291623",
        "_score" : 0.18232156,
        "_source" : {
          "title" : "Elasticsearch in Action"
        }
      },
      {
        "_index" : "catalog",
        "_type" : "books",
        "_id" : "978-1617292774",
        "_score" : 0.18232156,
        "_source" : {
          "title" : "Relevant Search: With applications ..."
        }
      }
    ]
  }
}

Please notice how we have limited the properties of document’s _source to return title field only. The other variations of term level queries include terms, range, exists, prefix, wildcard, regexp, fuzzy, type and ids.

Joining queries are exceptionally interesting ones in the context of our book catalog index. Those queries allow to perform the search against nested objects or documents with parent/child relationship. For example, let us find out all the books in the analytics category.

$ curl -i http://localhost:9200/catalog/books/_search?pretty -d '
{
   "size": 10,
   "_source": [ "title", "categories" ],
   "query": {
        "nested": {
            "path": "categories",
            "query" : {
                "match": {
                    "categories.name" : "analytics"
                }
            }
       }
    }
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 1177

{
  "took" : 45,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.3112576,
    "hits" : [
      {
        "_index" : "catalog",
        "_type" : "books",
        "_id" : "978-1617291623",
        "_score" : 1.3112576,
        "_source" : {
          "categories" : [
            { "name" : "analytics" },
            { "name" : "search" },
            { "name" : "database store" }
          ],
          "title" : "Elasticsearch in Action"
        }
      },
      {
        "_index" : "catalog",
        "_type" : "books",
        "_id" : "978-1449358549",
        "_score" : 1.0925692,
        "_source" : {
          "categories" : [
            { "name" : "analytics" },
            { "name" : "search" },
            { "name" : "database store" }
          ],
          "title" : "Elasticsearch: The Definitive Guide ..."
        }
      }
    ]
  }
} 

Similarly, we could have searched for all the books authored by Clinton Gormley, leveraging parent/child relationships between books and authors collections.

$ curl -i http://localhost:9200/catalog/books/_search?pretty -d '
{
   "size": 10,
   "_source": [ "title" ],
   "query": {
       "has_child" : {
            "type" : "authors",
            "inner_hits" : {
                "size": 5
            },
            "query" : {
                "term" : {
                    "last_name" : "Gormley"
                }
            }
        }
    }
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 1084

{
  "took" : 38,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "catalog",
        "_type" : "books",
        "_id" : "978-1449358549",
        "_score" : 1.0,
        "_source" : {
          "title" : "Elasticsearch: The Definitive Guide ..."
        },
        "inner_hits" : {
          "authors" : {
            "hits" : {
              "total" : 1,
              "max_score" : 0.6931472,
              "hits" : [
                {
                  "_type" : "authors",
                  "_id" : "1",
                  "_score" : 0.6931472,
                  "_routing" : "978-1449358549",
                  "_parent" : "978-1449358549",
                  "_source" : {
                    "first_name" : "Clinton",
                    "last_name" : "Gormley"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

Please pay attention to the presence of inner_hits query parameter which let the search results to include the inner documents that matched the joining criteria.

The other query types like Geo queries, specialized queries and span queries work in a very similar way so we would just skip over them and finish up by looking into composite queries. The examples we have seen so far included the queries with only one search criteria but Query DSL has a way to construct the compound queries as well. Let us take a look at the example of using bool query which is the composition of some of the query types we have seen already.

$ curl -i http://localhost:9200/catalog/books/_search?pretty -d '
{
   "size": 10,
   "_source": [ "title", "publisher" ],
   "query": {
       "bool" : {
          "must" : [
              {
                  "range" : {
                      "rating" : { "gte" : 4 }
                  }
              },
              {
                  "has_child" : {
                      "type" : "authors",
                      "query" : {
                          "term" : {
                              "last_name" : "Gormley"
                          }
                      }
                  }
              },
              {
                  "nested": {
                      "path": "categories",
                      "query" : {
                          "match": {
                              "categories.name" : "search"
                          }
                      }
                  }
              }
          ]
       }
    }
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 531

{
  "took" : 79,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 3.0925694,
    "hits" : [
      {
        "_index" : "catalog",
        "_type" : "books",
        "_id" : "978-1449358549",
        "_score" : 3.0925694,
        "_source" : {
          "publisher" : "O'Reilly",
          "title" : "Elasticsearch: The Definitive Guide.  ..."
        }
      }
    ]
  }
}

It would be fair to say that the search API of Elasticsearch powered by Query DSL is exceptionally flexible, easy to use and expressive. More to that, it is worth to mention that additionally to queries the search API also supports the concept of filters which offer yet another option to exclude the document from the search results.

7. Mutations by Query

Surprisingly (or not), queries could be used by Elasticsearch to perform mutations like updates or deletes over the documents in the index. For example, the following snippet will remove all the books which have low rating in our catalog, published by Manning.

$ curl -i http://localhost:9200/catalog/books/_delete_by_query?pretty -d '
{
   "query": {
      "bool": {
          "must": [
              { "range" : { "rating" : { "lt" : 3 } } }
          ],
          "filter": [
             { "term" :  { "publisher" : "Manning" } }
          ]
      }
   }
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 296

{
  "took" : 12,
  "timed_out" : false,
  "total" : 0,
  "deleted" : 0,
  "batches" : 0,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

It uses the same Query DSL and, for the purpose of illustration of how filtering could be used, includes filter as part of the query. But instead of returning the matched documents, the update or delete modifications are going to be applied.

The delete by query API could be used to overcome the limitations of regular delete API and remove the child documents in case their parent is deleted.

8. Know Your Queries Better

Sometimes you may find out that your search queries are returning the documents in order you don’t expect, ranking some documents higher than others. To help you out, Elasticsearch provides two very helpful APIs. One of these is explain API, which computes a score explanation for a query (and a specific document if needed).

The explanation could be received by specifying the explain parameter as part of query:

$ curl -i http://localhost:9200/catalog/books/_search?pretty -d '
{
   "size": 10,
   "explain": true,
   "query": {
        "term" : {
            "publisher" : "Manning"
        }
    }
}

Or using dedicated explain API endpoint and specific document, for example:

$ curl -i http://localhost:9200/catalog/books/978-1617292774/_explain?pretty -d '
{
   "query": {
        "term" : {
            "publisher" : "Manning"
        }
    }
}'

The responses have not been included intentionally as the tons of useful details are being returned. Another very useful feature of Elasticsearch is validation API which allows to perform a validation of the query without actually executing it, for example:

$ curl -i http://localhost:9200/catalog/books/_validate/query?pretty -d ' {
   "query": {
        "term" : {
            "publisher" : "Manning"
        }
    }                            
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 98

{
  "valid" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  }
}

Both APIs are very useful to troubleshoot the relevance or analyze potentially impactful search query without executing it on a live Elasticsearch cluster.

9. From Search to Insights

Often you may find yourself in the situation when the search is just not enough, you need some kind of aggregations on top of the matches. The great example would be faceting (or as Elasticsearch calls it, terms aggregations), where the search results are grouped into buckets.

$ curl -i http://localhost:9200/catalog/books/_search?pretty -d '
{
   "query": {
        "match" : {
            "description" : "elasticsearch"
        }
    },
    "aggs" : {
        "publisher" : {
            "terms" : { "field" : "publisher" }
        }
    }
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 3447

{
  "took" : 176,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.38828257,
    "hits" : [
      {
          ...
      }
    ]
  },
  "aggregations" : {
    "publisher" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Manning",
          "doc_count" : 2
        },
        {
          "key" : "O'Reilly",
          "doc_count" : 1
        }
      ]
    }
  }
}

In this example along with the search query, we have asked Elasticsearch to count the documents by publishers. By and large, search query could be completely omitted and only aggregations could be sent in request body, for example:

$ curl -i http://localhost:9200/catalog/books/_search?pretty -d '
{
  "aggs" : {
      "authors": {
        "children": {
          "type" : "authors"
        },
        "aggs": {
          "top-authors": {
            "terms": {
            "script" : {
              "inline": "doc['first_name'].value + ' ' + doc['last_name'].value",
              "lang": "painless"
            },
            "size": 10
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-type: application/json; charset=UTF-8
content-length: 1031
{
  "took": 381,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 1,
    "hits": [
      ...
    ]
  },
  "aggregations": {
    "authors": {
      "doc_count": 6,
      "top-authors": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "Clinton Gormley",
            "doc_count": 1
          },
          {
            "key": "Doug Turnbull",
            "doc_count": 1
          },
          {
            "key": "Matthew Lee Hinman",
            "doc_count": 1
          },
          {
            "key": "Radu Gheorghe",
            "doc_count": 1
          },
          {
            "key": "Roy Russo",
            "doc_count": 1
          },
          {
            "key": "Zachary Tong",
            "doc_count": 1
          }
        ]
      }
    }
  }
}

In this slightly more complicated example we have bucketed over top authors using Elasticsearch scripting support to compose the terms out of the author’s first and last name:

"script" : {                           
  "inline": "doc['first_name'].value + ' ' + doc['last_name'].value",  
  "lang": "painless"                 
}

The list of supported aggregations is really impressive and includes bucket aggregations (some of those we have tried out already), metrics aggregations, pipeline aggregations, and matrix aggregations. Covering just one class of those would need its own tutorial, so please check them out to understand the purpose of each one in depth.

10. Watch Your Cluster Breathing

Elasticsearch clusters are living “creatures” and should be watched and monitored closely in order to proactively spot any issues and react on them quickly. The cluster health endpoint we have seen before is the easiest way to get overall high-level status of the cluster.

$ http http://localhost:9200/_cluster/health

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "active_primary_shards": 5,
    "active_shards": 5,
    "active_shards_percent_as_number": 20.0,
    "cluster_name": "es-catalog",
    "delayed_unassigned_shards": 0,
    "initializing_shards": 0,
    "number_of_data_nodes": 1,
    "number_of_in_flight_fetch": 0,
    "number_of_nodes": 1,
    "number_of_pending_tasks": 0,
    "relocating_shards": 0,
    "status": "red",
    "task_max_waiting_in_queue_millis": 0,
    "timed_out": false,
    "unassigned_shards": 20
}

If you cluster goes red (like in the example above), there is certainly a problem to fix. To help you out, Elasticsearch has cluster statistics API, cluster state API, cluster node level statistics API and cluster node indices  statistics API.

A bit aside stays another exceptionally important group of APIs, the cat APIs. They are different in a sense that the representation is not in JSON but rather text based, with compact and aligned output, suitable for terminals.

11. Conclusions

In this section of the tutorial we went through many features of the Elasticsearch by exploring them through RESTful APIs, using command line tools only. By and large, it is just a tiny part of what Elasticsearch offers through the APIs and the official documentation is a great place to learn them all. Hopefully, at this point we are comfortable enough with Elasticsearch and know how to work with it.

12. What’s next

In the next part of the tutorial we are going to learn the several flavors of native APIs which Elasticsearch has to offer to Java/JVM developers. These APIs are the essential building blocks of any Java/JVM application which leverages Elasticsearch capabilities.

Andrey Redko

Andriy is a well-grounded software developer with more then 12 years of practical experience using Java/EE, C#/.NET, C++, Groovy, Ruby, functional programming (Scala), databases (MySQL, PostgreSQL, Oracle) and NoSQL solutions (MongoDB, Redis).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Don
Don
7 years ago

Nice article! Thanks. I noticed that the commands are starting with http rather than curl.

Andriy Redko
7 years ago
Reply to  Don

Hi Don,

Thank you for the comment, that’s right, some commands use http (https://httpie.org/), which is a really great tool, quite similar to curl. The reason, aside from pure educational, is that http does many convenient things like prettying the JSON and coloring in the output, just out of the box. However, everything you could do with http in this article, you could do with curl as well. Thank you.

Best Regards,
Andriy Redko

Back to top button