Elastic Search

Basics

  • Nodes store the data that we add to ElasticSearch
  • A cluster is a collection of nodes
  • Data is stored as document which are JSON objects
  • Documents are grouped together with indices

The purpose of sharding

  • mainly to be able to store more documents
  • to easier fit large indices onto nodes
  • improved performance
    • parallelization of queries increases the throughput of an index

Configuring the number of shards

  • an index contains a single shard by default
  • increase the number of shards with the Split API
  • reduce the number of shards with the Shrink API
  • when changing the number of shards, new index will be created and documents in the old index will be migrated to the new index

Replication

  • replication is configured at the index level

  • replication works by creating copies of shards, referred to as replica shards

  • a shard that has been replicated is called a primary shard

  • a primary shard and its replica shards are referred as a replication group

  • replica shards are a complete copy of a shard

  • a replica shard can serve search requests, exactly like its primary shard

  • the number of replicas can be configured at index creation

  • node can store multiple shards, and primary shard and replica shards will never be stored in the same node. So the data will NOT be lost if the node fails

snapshots

  • ElasticSearch supports taking snapshots as backups
  • snapshots can be used to restore to a given point in time
  • snapshots can be taken at the index level, or for the entire cluster
  • use snapshots for backups, and replication for high availability and performance

increasing query throughput with replication

  • replica shards of a replication group can serve different search requests simultaneously
    • this increases the number of requests that can be handled at the same time
  • ElasticSearch intelligently routes requests to the best shard
  • CPU parallelization (CPU has multiple cores now and can run queries on different threads at the same time) improves performance if multiple replica shards are stored on the same node

Master-eligible node

  • the node may be elected as the cluster’s master node
  • a master node is responsible for creating and deleting indices, among others
  • may be used for having dedicated master nodes
    • useful for large clusters
    • meaning that this master node will not be serving requests, only focusing on its own tasks

Data node

  • enables a node to store data
  • storing data includes performing queries related to that data, such as search queries
  • for relatively small clusters, this role is almost always enabled

Managing Documents

  • Delete index
1
DELETE /index_name
  • create index
1
2
3
4
5
6
PUT /index_name
{
"settings": {
"number_of_shards": 2
}
}

indexing documents

1
2
3
4
5
6
POST /products/_doc
{
"name": "Coffee Maker",
"price": 53,
"in_stack": 10
}

Retrieving document by ID

1
GET /products/_doc/document_ID

Updating document

1
2
3
4
5
6
7
POST /products/_update/document_ID
{
"doc": {
"name": "new name",
"new field": "this is a new field"
}
}

Upserts

  • insert the new document if not exists, and run the script if the document exists
1
2
3
4
5
6
7
8
9
10
11
POST /products/_update/101
{
"script": {
"source": "ctx._source.in_stock++
},
"upsert": {
"name": "name",
"price": 399,
"in_stock": 5
}
}

Delete document

1
DELETE /products/_doc/101

Routing

  • routing is the process of resolving a shard for a document
  • the default routing strategy ensures that documents are distributed evenly

Optimistic concurrency control

  • prevent overwriting documents inadvertently dur to concurrent operations
  • primary terms
    • a way to distinguish between old and new primary shards
    • essentially a counter for how many times the primary shard has changed
    • the primary term is appended to write operations
  • sequence numbers
    • appended to write operations together with the primary term
    • essentially a counter that is incremented for each write operation
    • the primary shard increases the sequence number
    • enables ElasticSearch to order write operations
  • sending write requests to ElasticSearch concurrently may overwrite changes made by other concurrent processes
  • we use the primary terms and sequence number fields
  • ElasticSearch will reject a write operation if it contains the wrong primary term or sequence number

Updating multiple document

  • the query creates a snapshot to do optimistic concurrency control
  • search queries and bulk requests are sent to replication groups sequentially
    • ElasticSearch retries these queries up to ten times
    • if the queries still fail, the whole query is aborted
    • any changes already made to documents, are NOT rolled back
  • the API returns information about failures

Mapping and Analysis

  • a field’s values are stored in one of several data structures
    • the data structure depends on the field’s data type
  • ensures efficient data access

Inverted indices

  • mapping between terms (tags) and which documents contain them

  • outside the context of analyzers, we use the terminology ‘terms’

  • an inverted index is created for EACH text field

  • values for a text field are analyzed and the results are stored within an inverted index

  • each field has a dedicated inverted index

  • an inverted index is a mapping between terms and which documents contain them

  • terms are sorted alphabetically for performance reasons

  • created and maintained by Apache Lucene

  • inverted indices enable fast searches

Note: for array of strings

  • In ElasticSearch, there is no dedicated array data type, any field can contain zero or more values by default, however, all values in the array must be of the same data type
  • when adding a field dynamically, the first value in the array determines the field type
  • meaning for array of strings, ElasticSearch would have created a inverted index mapping table for it

keyword data type

  • keyword fields are analyzed with the keyword analyzer

  • the keyword analyzer is an no-op analyzer

    • it outputs the unmodified string as a single token
    • this token is then placed into the inverted index
  • used for exact matching of values

  • typically used for filtering, aggregations, and sorting

  • for full-text searches, use the text data type instead

Arrays

  • there is no such thing as an array data type
  • any field may contain zero or more values
    • no configuration or mapping needed
    • simply supply an array when indexing a document
  • how is the array stored in the ElasticSearch internally?
    • e.g. if it is an array of strings
    • the strings are simply concatenated before being analyzed
    • and the resulting tokens are stored within an inverted index as normal String data type

Dates

  • specified in one of the three ways
    • specially formatted strings
    • milliseconds since the epoch
    • seconds since the epoch
  • epoch refers to the 1st of Jan 1970
  • custom formats are supported

How date fields are stored

  • stored internally as milliseconds since the epoch
  • any valid value that you supply at index time is converted to a long value internally
  • dates are converted to the UTC timezone
  • the same date conversion happens for search queries too

Missing fields

  • all fields are ElasticSearch are optional
  • you can leave out a field when indexing documents
  • unlike relational databases when you need to allow NULL values
  • some integrity checks need to be done at the application level
    • e.g. having required fields
  • adding a field mapping does not make a field required
  • searches automatically handle missing fields

Stemming and stop words

Stemming

  • reduces words to their root form
    • e.g. loved -> love
    • drinking -> drink

stop words

  • words that are filtered out during the text analysis
    • common words such as ‘a’, ‘the’, ‘at’, ‘of’, ‘on’ etc…
  • they provide little to no value to the relevance scoring
  • fairly common to remove such words
    • less common in ElasticSearch today than in the past
    • the relevance algorithm has been improved significantly
  • not removed by default

Analyzer

Standard analyzer

  • splits text at word boundaries and removes punctuation
    • done by the standard tokenizer
  • lowercase letter with the lowercase token filter
  • contains the stop token filter (for removing stop words, disabled by default)

Simple analyzer

whitespace analyzer

keyword analyzer

pattern analyzer

  • a regular expression is used to match token separators
    • it should match whatever should split the text into tokens
  • this analyzer is very flexible
  • the default pattern matches all non-word characters
  • lowercase letters by default

Introduction to Searching

1
2
3
4
5
6
GET /product/default/_search
{
"query": {
"match_all": {}
}
}
  • search queries will hit the coordinating node first, and this node will broadcast the query to all the other nodes, they will fetch the result and combine them together and return it.

term level queries

  • search for exact matches, case sensitive, searching the inverted index, not the original document
  • term level queries are more suited for searching static values, like enums