Elastic Search

02-01-2022 Elastic Search Comments Word Count: 1.5k(words) Read Count: 9(minutes)

Basics

Nodes store the data that we add to ElasticSearch
A cluster is a collection of nodes
Data is stored as document which are JSON objects
Documents are grouped together with indices

The purpose of sharding

mainly to be able to store more documents
to easier fit large indices onto nodes
improved performance
- parallelization of queries increases the throughput of an index

Configuring the number of shards

an index contains a single shard by default
increase the number of shards with the Split API
reduce the number of shards with the Shrink API
when changing the number of shards, new index will be created and documents in the old index will be migrated to the new index

Replication

replication is configured at the index level
replication works by creating copies of shards, referred to as replica shards
a shard that has been replicated is called a primary shard
a primary shard and its replica shards are referred as a replication group
replica shards are a complete copy of a shard
a replica shard can serve search requests, exactly like its primary shard
the number of replicas can be configured at index creation
node can store multiple shards, and primary shard and replica shards will never be stored in the same node. So the data will NOT be lost if the node fails

snapshots

ElasticSearch supports taking snapshots as backups
snapshots can be used to restore to a given point in time
snapshots can be taken at the index level, or for the entire cluster
use snapshots for backups, and replication for high availability and performance

increasing query throughput with replication

replica shards of a replication group can serve different search requests simultaneously
- this increases the number of requests that can be handled at the same time
ElasticSearch intelligently routes requests to the best shard
CPU parallelization (CPU has multiple cores now and can run queries on different threads at the same time) improves performance if multiple replica shards are stored on the same node

Master-eligible node

the node may be elected as the cluster’s master node
a master node is responsible for creating and deleting indices, among others
may be used for having dedicated master nodes
- useful for large clusters
- meaning that this master node will not be serving requests, only focusing on its own tasks

Data node

enables a node to store data
storing data includes performing queries related to that data, such as search queries
for relatively small clusters, this role is almost always enabled

Managing Documents

Delete index

1	DELETE /index_name

create index

PUT /index_name
{
  "settings": {
    "number_of_shards": 2
  }
}

indexing documents

POST /products/_doc
{
  "name": "Coffee Maker",
  "price": 53,
  "in_stack": 10
}

Retrieving document by ID

1	GET /products/_doc/document_ID

Updating document

POST /products/_update/document_ID
{
  "doc": {
    "name": "new name",
    "new field": "this is a new field"
  }
}

Upserts

insert the new document if not exists, and run the script if the document exists

POST /products/_update/101
{
  "script": {
    "source": "ctx._source.in_stock++
  },
  "upsert": {
    "name": "name",
    "price": 399,
    "in_stock": 5
  }
}

Delete document

1	DELETE /products/_doc/101

Routing

routing is the process of resolving a shard for a document
the default routing strategy ensures that documents are distributed evenly

Optimistic concurrency control

prevent overwriting documents inadvertently dur to concurrent operations
primary terms
- a way to distinguish between old and new primary shards
- essentially a counter for how many times the primary shard has changed
- the primary term is appended to write operations
sequence numbers
- appended to write operations together with the primary term
- essentially a counter that is incremented for each write operation
- the primary shard increases the sequence number
- enables ElasticSearch to order write operations
sending write requests to ElasticSearch concurrently may overwrite changes made by other concurrent processes
we use the primary terms and sequence number fields
ElasticSearch will reject a write operation if it contains the wrong primary term or sequence number

Updating multiple document

the query creates a snapshot to do optimistic concurrency control
search queries and bulk requests are sent to replication groups sequentially
- ElasticSearch retries these queries up to ten times
- if the queries still fail, the whole query is aborted
- any changes already made to documents, are NOT rolled back
the API returns information about failures

Mapping and Analysis

a field’s values are stored in one of several data structures
- the data structure depends on the field’s data type
ensures efficient data access

Inverted indices

mapping between terms (tags) and which documents contain them
outside the context of analyzers, we use the terminology ‘terms’
an inverted index is created for EACH text field
values for a text field are analyzed and the results are stored within an inverted index
each field has a dedicated inverted index
an inverted index is a mapping between terms and which documents contain them
terms are sorted alphabetically for performance reasons
created and maintained by Apache Lucene
inverted indices enable fast searches

Note: for array of strings

In ElasticSearch, there is no dedicated array data type, any field can contain zero or more values by default, however, all values in the array must be of the same data type
when adding a field dynamically, the first value in the array determines the field type
meaning for array of strings, ElasticSearch would have created a inverted index mapping table for it

keyword data type

keyword fields are analyzed with the keyword analyzer
the keyword analyzer is an no-op analyzer
- it outputs the unmodified string as a single token
- this token is then placed into the inverted index
used for exact matching of values
typically used for filtering, aggregations, and sorting
for full-text searches, use the text data type instead

Arrays

there is no such thing as an array data type
any field may contain zero or more values
- no configuration or mapping needed
- simply supply an array when indexing a document
how is the array stored in the ElasticSearch internally?
- e.g. if it is an array of strings
- the strings are simply concatenated before being analyzed
- and the resulting tokens are stored within an inverted index as normal String data type

Dates

specified in one of the three ways
- specially formatted strings
- milliseconds since the epoch
- seconds since the epoch
epoch refers to the 1st of Jan 1970
custom formats are supported

How date fields are stored

stored internally as milliseconds since the epoch
any valid value that you supply at index time is converted to a long value internally
dates are converted to the UTC timezone
the same date conversion happens for search queries too

Missing fields

all fields are ElasticSearch are optional
you can leave out a field when indexing documents
unlike relational databases when you need to allow NULL values
some integrity checks need to be done at the application level
- e.g. having required fields
adding a field mapping does not make a field required
searches automatically handle missing fields

Stemming and stop words

Stemming

reduces words to their root form
- e.g. loved -> love
- drinking -> drink

stop words

words that are filtered out during the text analysis
- common words such as ‘a’, ‘the’, ‘at’, ‘of’, ‘on’ etc…
they provide little to no value to the relevance scoring
fairly common to remove such words
- less common in ElasticSearch today than in the past
- the relevance algorithm has been improved significantly
not removed by default

Analyzer

Standard analyzer

splits text at word boundaries and removes punctuation
- done by the standard tokenizer
lowercase letter with the lowercase token filter
contains the stop token filter (for removing stop words, disabled by default)

Simple analyzer

whitespace analyzer

keyword analyzer

pattern analyzer

a regular expression is used to match token separators
- it should match whatever should split the text into tokens
this analyzer is very flexible
the default pattern matches all non-word characters
lowercase letters by default

Introduction to Searching

GET /product/default/_search
{
  "query": {
    "match_all": {}
  }
}

search queries will hit the coordinating node first, and this node will broadcast the query to all the other nodes, they will fetch the result and combine them together and return it.

term level queries

search for exact matches, case sensitive, searching the inverted index, not the original document
term level queries are more suited for searching static values, like enums

Link： http://hellcy.github.io/2022/01/02/Elastic-Search/

Copyright： This article is using CC BY 4.0 license.

Yuan ChengSoftware Engineer

Wait and see