Elasticsearch – Searching and Querying

Searching with Request URI

The query DSL is the most common approach. The way you issue a search request is to send a get request to the search API, then specify the query with the query parameter named q.

Searching the name param and trying to find if any name field contains the name lobster

The matching document is sorted by relevance, meaning the first document has a higher relevance score than the document at the bottom.

Searching tags containing the word “meat”, the approach is the same. ES is clever enough to understand this even we are dealing with an array of strings. Let’s use boolean logic within a search query.

If we want to run the curl version off this you have to URL encode it like below. Because if you run with spaces it will through error.

Query DSL

The way we use query DSL is specified the search query within a JSON object instead of a request Uri. There are two main groups of queries in DSL, leaf query and compound query. The leaf query search for values within a particular field, and compound queries are consist of multiple leaf queries. Compound queries therefore recursive in nature. We can nest other compound queries in a single compound query.

This is a very basic structure of DSL query. We add the query within a JSON object and within an object named query. The general structure for defining searches is to specify the name of the query as a key and query definition as a value. Here we are matching all.

Understanding Query Results

This is one of the results of the match_all query. First, there is a “took” property which is an integer representing the number of a millisecond the query took to execute. Next “timed_out” which is boolean property, it’s a flag indicating whether or not the search request timed out. Next “_shards” this object contains the total number of shards that were searched, and the number of shards that completed successfully or failed or skipped.
Next “hits” which contains the search results, within object we have a “total’ property containing the total number of documents that match the search criteria. The hits object actually contains a “hits” property itself which is an array containing the matched documents. By default first 10 documents are returned, but this can be changed.
Each matching document has a “_score” property which is a number indicating how well the document matched the search query. Within the hits object, there is a “max_score” property which contains the highest score for any of the searched documents, by default, matches are sorted by their relevance score.

Understanding Relevance Scores

Elasticsearch makes use of this algorithm “TF/IDF = Term Frequency/Inverse Document Frequency”. But recently Okapi BM25 is being used by elasticsearch. So let’s see how the old algorithm used to work and then we can spot the difference between them.

Term Frequency (TF) it simply means how many times the term appears in the field for the given document. The more times the term appears the more relevant the document is, at least for that term.
Inverse Document Frequency (IDF) refers to how often does the term appears within the index, the more often term appears the lower the score and relevance. The logic here is that a term appears in many documents then it has a lover weight. This means the words appear many times are less significant.

Field-length Norm, this simply refers to how long the field is. The longer the field the less likely the word within the field is to be relevant. A term appearing in a short field has more weight then appearing in a small field.

BM25 algorithm is better at handling stop words, stop words are the words that appear many times in a document. BM25 uses nonlinear term frequency saturation, the idea is BM25 has an upper limit on how much a term can be boosted based on how many times it occurs.

Debugging Unexpected Search

If a document does not match the search query we can debug it using an API because developers face this issue a lot. We use the get verb, give the index name, type, id then use _explain API. This is an extremely useful way of debugging

Full-text Query vs Term Query

A query clause can be executed in two different contexts in a query context or in a filter context.

Above there are three queries if you run them the result will be different. The only difference between first and the second query is the string “lobster” is first letter capital in the second query. The third query is a full-text query which will also give us some results. So why the 2nd query is not giving us any output? The reason is there is a major difference between term level queries and full-text queries. First of all term level queries search for exact values. The reason for that we are searching an inverted index for the term and not the document itself. So when we are searching for the term lobster in all lowercase letters, this term gets looked up in the inverted index, there we have a match because of the document filter by the standard analyzer. Remember the standard analyzer has a filter for lowercase letters. Which is why the term gets stored in all lowercase letters in the inverted index. Because the capitalized Lobster was analyzed the term in the inverted index was an exact match that we were searching for, since both terms were in all lowercase letters.

The same was not the case for the second query, here we are searching for the term lobster in a capitalized form. And the term is stored in lowercase in the inverted index. The term level queries search for the exact matches, so the casing the letters matters.

So what about the third query then??

It matches because unlike term level queries full-text queries are analyzed i.e. that the search query already converted the capitalized Lobster into lowercase and then matched it with the inverted index.

So please don’t use term level queries for performing full-text searches.