Elasticsearch
Introduction
ElasticSearch is a highly scalable open source search engine with a REST API that is hard not to loveElasticsearch is a distributed RESTful search engine built for the cloud. Features include:
- Distributed and Highly Available Search Engine.
- Each index is fully sharded with a configurable number of shards.
- Each shard can have one or more replicas.
- Read / Search operations performed on any of the replica shards.
- Multi Tenant.
- Support for more than one index.
- Index level configuration (number of shards, index storage, …).
- Various set of APIs
- HTTP RESTful API
- Native Java API.
- All APIs perform automatic node operation rerouting.
- Document oriented
- No need for upfront schema definition.
- Schema can be defined for customization of the indexing process.
- Reliable, Asynchronous Write Behind for long term persistency.
- (Near) Real Time Search.
- Built on top of Lucene
- Each shard is a fully functional Lucene index
- All the power of Lucene easily exposed through simple configuration / plugins.
- Per operation consistency
- Single document level operations are atomic, consistent, isolated and durable.
- Open Source under the Apache License, version 2 (“ALv2”)
Elasticsearch is a highly available and distributed search engine. Each index is broken down into shards, and each shard can have one or more replica. By default, an index is created with 5 shards and 1 replica per shard (5/1). There are many topologies that can be used, including 1/10 (improve search performance), or 20/1 (improve indexing performance, with search executed in a map reduce fashion across shards).
Indexing
Elasticsearch is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead.This is like retrieving pages in a book related to a keyword by scanning the index at the back of a book, as opposed to searching every word of every page of the book.
This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).
Elasticsearch uses Apache Lucene to create and manage this inverted index.
How Elasticsearch represents data
In Elasticsearch, a Document is the unit of search and index.An index consists of one or more Documents, and a Document consists of one or more Fields.
In database terminology, a Document corresponds to a table row, and a Field corresponds to a table column.
Schema
Unlike Solr, Elasticsearch is schema-free. Well, kinda.Whilst you are not required to specify a schema before indexing documents, it is necessary to add mapping declarations if you require anything but the most basic fields and operations.
This is no different from specifying a schema!
The schema declares:
- what fields there are
- which field should be used as the unique/primary key
- which fields are required
- how to index and search each field
To create a mapping, you will need the Put Mapping API, or you can add multiple mappings when you create an index.
Query DSL
The Query DSL is Elasticsearch's way of making Lucene's query syntax accessible to users, allowing complex queries to be composed using a JSON syntax.Like Lucene, there are basic queries such as term or prefix queries and also compound queries like the bool query.
The main structure of a query is roughly:
curl -X POST "http://localhost:9200/blog/_search?pretty=true" -d ‘
{"from": 0,
"size": 10,
"query" : QUERY_JSON,
FILTER_JSON,
FACET_JSON,
SORT_JSON
}’
{"from": 0,
"size": 10,
"query" : QUERY_JSON,
FILTER_JSON,
FACET_JSON,
SORT_JSON
}’
Summary
Elasticsearch is an open-source, broadly-distributable, readily-scalable, enterprise-grade search engine. Accessible through an extensive and elaborate API, Elasticsearch can power extremely fast searches that support your data discovery applications.
One of the options for indexing documents is the bulk indexing api. All we have is 10 JSON documents with very basic strings as properties. An "action_and_meta" tag is required for every document when bulk indexing to ensure Elasticsearch knows what we want to do. We have several options when bulk indexing (delete, update, and create), but we're just going to index our JSON documents. The name of our index is disney, and the type of documents we have is characters. We could use create, but that will fail if the document with the same index and type already exists. To make sure we don't run into problems with you already having a disney index with these documents, we will index every document. If you're looking to index from an existing dataset, we will go over some of those methods in the next episode.
index- database
type- table in a db
document- row in a db table
field of document- column
mapping- schema
What is Elasticsearch
· Open source search server based on Apache Lucene
· Written in java
· Cross-platform
· Big focus on scalability-distributed from the ground up
· Designed to take data from any souce, analyze it and search through it.
· Communication with the search server is done through a HTTP REST API
· Schema-less JSON documents
· Near real-time search
There is only a small latency i.e. fo one second from a document is indexed untill it is searchable.
if one make changed into the index that change is propagated through clustr in one second and this is very impersive for a large cluster. this is because of distributed nature of the elasticsearch which makes it scalable.
Terminology
1. Elasticsearch Cluster
· A cluster is collection of nodes(server)
· it may consists of one or more node depending on the scale or as many as node require.
· Nodes inside cluster contain all data which makes it extremely scalable.
· A cluster provides indexing and search capability across all nodes. You don’t need worry about which node has the documents you are searching for.
· Cluster has unique name, by which they can identified. By default, name is elastic search
2. Node
· A single server that is part of a cluster
· Stores searchable data
o Stores all data if there is only one node in the cluster, or part of teh data if there are multiple nodes.
· Nodes participates in a cluster's indexing and search capabilities
· Identified by a name (defaults to a random Marvel character)
· A node joins a cluster named "elasticsearch" by default
· Starting a single node on a network will by default create a new single-node cluster named "elasticsearch"
3. Index = Equivalent to Database
· A collection of documents (e.g. product, account, movie)
o Each of the above example would be a type
· Corresponds to a database withing a relational database system.
· Identified by a name, which must be lowercased
o Used when indexing, searching, updating and deleting documents within the index.
· You can define as many index as you want within a cluster.
4. Type = corresponds to a table
· Represents a class/category of similar document, e.g. "User"
· Consists of a name and mapping
· Simplified, you can think of a type as a table within a relational database
· An index can have one or more types defined, each with their own mapping
· Stored within a metadata field named _type because Lucene has no concept of document types.
o Searching for specific document types applies a filter on this field
5. Mapping
· Similar to a database schema for a table in a relational database
· Describes the fields that a document of a given type may have
o Includes the data type for each field, e.g. string, integer, date,...
o Also includes information that it is optional to define a mapping explicitly.
· Dynamic mapping means that it is optional to define a mapping explicitly.
6. Document = corresponds to a row in a database table
· A basic unit of information that can be indexed
· Consists of fields, which are key/value pairs
o A value can be a string, date, object, etc.
· Corresponds to an object in object oriented programming language
o A document can be a single user, order, product, etc.
· Documents are expressed in JSON
· Any no of documents can be stored within an index.
7. Shards
· An index can be divided into multiple piece called shards
o Useful if an index contains more data than the hardware of a node can store (e.g. 1 TB data on a 500 GB disk.)
· A shard is a fully functional and independent index
o can be stored on any node in a cluster.
· The number of shards can be specified when creating an index
· Allows to scale horizontally by content volume (index space)
· Allows to distribute and parallelize operation across shards, which increases performance.
8. Replicas
· A replica is a copy of shard
· Provides high availability in case a shards or node fails
o A replica never resides on the same nodes the original shards
· Allows scaling search volume, because search queries can be executed on all replicas in parallel
· By default, Elasticsearch adds 5 primary shards and 1 replicas for each index.
index- database
type- table in a db
document- row in a db table
field of document- column
mapping- schema
No comments:
Post a Comment