Elasticsearch – Analysis Process

All this process does is tokenizing and normalizing a block of text. This is done to make the text easier to search. We have full control over the analysis process. The analyzed term is stored within the inverted index. That means whenever we perform search queries we are searching through the results of the analysis process and not the document. The below diagram will give some idea that the text or document becomes a full text searchable, and the conversion carried out.
Analyze API

We can send a post request to an API named _analyze and send along a request body

We are using a standard tokenizer here and then adding the desired text. The result is what explained in the above image. But we get some more additional information, such as start_offset and end_offset, type and position of the token.

If you run this you will see the first token is in lowercase and the result will be split into tokens even though we didn’t specified tokenizer, because if we use this API a standard tokenizer will be used even if we do not specify.

Inverted Index
When performing full-text searches we are actually querying an inverted index and not the JSON document that we have defined when indexing the documents. A cluster would have at least one inverted index. An inverted index works at the documents field level and stores the terms of a given field it doesn’t need to deal with different fields.

Character Filters 
There are three types of character filters available.

  1. HTML Strip Character Filter: As the name suggests this filter strips HTML from the text but it also decodes HTML entities. An example could be we are indexing comments from a blog and may contain HTML markup.
    From the above example, we can see how the HTML tag has been removed and the apostrophe HTML entity has been converted into an actual apostrophe.
  2. Mapping Character Filter: The mapping filter replaces values based on a supplied list of values and their replacements. From the above example, we can see certain characters can be replaced by a smiley. We simply specify a map of tokens and their replacements, and the character filter will take care of all of the replacements.
  3. Pattern Replace: It’s kind of the same thing except it does the matching based on regular expressions and also allows us to access the matched values with the capture group and use those within the replaced values.