SpringBoot --- 整合Elasticsearch

  • 1.elasticsearch install
  • 1.1 windows版本下载,解压
  • 1.2 启动
  • 1.3 访问
  • 1.4 关键字介绍
  • 2.elasticsearch ui
  • 2.1 elasticsearch-head
  • 2.2 elasticHD
  • 2.3 kibana
  • 3.Springboot整合Elasticsearch
  • 3.1 官方文档
  • 3.2 依赖
  • 3.3 properties
  • 3.4 代码
  • 3.5 报警
  • 4.集群搭建
  • 4.1 修改配置文件
  • 4.2 还可以修改密码(此步骤没什么用)
  • 4.3 启动三个节点,查看状态
  • 4.4 启动kibana,监控
  • 5.elasticsearch api
  • 5.1 CURD之Create
  • 5.2 CURD之Update
  • 5.3 CURD之Delete
  • 5.4 CURD之Retrieve
  • 5.5 match查询
  • 5.6 term查询
  • 5.6 排序查询
  • 5.7 分页查询
  • 6.Elasticsearch之布尔查询
  • 6.1 must关键字查询
  • 6.2 should关键字查询
  • 6.3 must_not关键字查询
  • 6.4 filter关键字查询
  • 7.Elasticsearch之查询结果过滤
  • 8.Elasticsearch之高亮查询
  • 9.Elasticsearch之聚合查询
  • 9.1 avg
  • 9.2 max
  • 9.3 min
  • 9.4 sum
  • 9.5 range分组查询
  • 10.Elasticsearch之Mapping & Dynamic Mapping
  • 10.1 mapping
  • 10.2 dynamic mapping
  • 1 dynamic mapping
  • 2 explicit mapping
  • 3 strict mapping
  • 4 小结
  • 10.3 对象属性
  • 10.4 控制当前字段是否被索引
  • 10.5 对Null值实现搜索
  • 11.elasticsearch之setting
  • 12.elasticsearch字段的数据类型
  • 13.cluster node
  • 13.Analyzer进行分词
  • 13.1 分析器
  • 1. 标准分析器:standard analyzer
  • 2. 简单分析器:simple analyzer
  • 3. 空白分析器:whitespace analyzer
  • 4. 停用词分析器:stop analyzer
  • 5. 关键词分析器:keyword analyzer
  • 6. 模式分析器:pattern analyzer
  • 7. 语言和多语言分析器:chinese
  • 8. 雪球分析器:snowball analyzer
  • 13.2 字符过滤器
  • 1. HTML字符过滤器
  • 2. 映射字符过滤器
  • 3. 模式替换过滤器
  • 13.3 分词器
  • 1.标准分词器:standard tokenizer
  • 2. 关键词分词器:keyword tokenizer
  • 3. 字母分词器:letter tokenizer
  • 4. 小写分词器:lowercase tokenizer
  • 5. 空白分词器:whitespace tokenizer
  • 6. 模式分词器:pattern tokenizer
  • 7. UAX URL电子邮件分词器:UAX RUL email tokenizer
  • 8. 路径层次分词器:path hierarchy tokenizer
  • 13.4 分词过滤器
  • 1. 自定义分词过滤器
  • 2. 自定义小写分词过滤器
  • 3. 多个分词过滤器
  • 13.5 IK分词器
  • 1. 下载
  • 2. 介绍
  • 3. 测试
  • 14.正排索引和倒排索引
  • 15.数据建模
  • 16.集群的内部安全通信

1.elasticsearch install

项目

Elasticsearch

Solr

实时索引

不会产生线程阻塞,效能高于solr

会有io阻塞

动态添加数据

对效能没有影响

效能会变得低下

分布式

本身自带分布式

利用zookeeper进行分布式管理

数据格式

仅支持json

xml,json,csv等等

地位

更适合新兴的实时搜索应用

传统应用的有力解决方案

官网地址:elasticsearch官网.最好下载一些低版本的,高版本整合会报警

  • https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.16.2-windows-x86_64.zip.
  • 9200 是ES对外部RESTFUL接口
  • 9300 是ES内部使用的端口

1.1 windows版本下载,解压

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_Elastic

1.2 启动

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_elasticsearch_02

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_elasticsearch_03

1.3 访问

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_es_04

1.4 关键字介绍

Name

Desc

index

相当于Mysql中的一个库

document

相当于Mysql中的一行数据

field

相当于Mysql中的column

shards

分片存储

replicas

进行备份

1.Doc中的元数据

{
	"_id": "1",
	"_index": "lsp",
	"_score": 1,
	-"_source": {
		"age": 30,
		"desc": "皮肤黑、武器长、性格直",
		"from": "gu",
		"name": "顾老二",
		-"tags": [
			"黑",
			"长",
			"直"
		]
}

各部分的含义:

  • _index: 文档所属的索引名
  • _type:文档所属的类型名
  • _id:文档唯一标识
  • _source:文档的原始JSON数据
  • @version:文档的版本信息(可用于并发搜索时,解决文档冲突)
  • _score:相关性打分(根据检索结果打分)

2.index索引

每个索引都有自己的Mapping定义,用于包含所有的文档字段名和字段类型。
 * Shard体现物理空间的概念
 * 索引中的数据分散在Shard上
 * Mapping定义文档字段的类型
 * Setting定义不同的数据分布

3.type

  • 5.x及以前版本一个index有一个或者多个type
  • 6.X版本一个index只有一个type
  • 7.X版本移除了type,type相关的所有内容全部变成Deprecated,为了兼容升级和过渡,所有的7.X版本es数据写入后type字段都默认被置为_doc
  • 8.X版本完全废弃type

2.elasticsearch ui

2.1 elasticsearch-head

下载

  • 安装node.js安装教程. npm -v node -v
  • 安装grunt npm install -g grunt-cli grunt -v
  • 下载elasticsearch-head,安装,启动下载地址:https://github.com/mobz/elasticsearch-head. 安装:cd到此文件夹,然后 npm install

启动:npm run start/grunt server 界面比较老旧,不时尚

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_es_05

2.2 elasticHD

1.下载

下载地址:https://github.com/360EntSecGroup-Skylar/ElasticHD/releases.

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_鲨鱼抓包 android_06

2.启动 可以直接双击启动 也可以cd到安装目录,执行 ElasticHD -p 127.0.0.1:9800

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_elasticsearch_07

2.3 kibana

下载地址:https://www.elastic.co/start.

切记:要和上面的Elasticsearch版本匹配

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_elasticsearch_08

鼠标放在windows上,会显示下载地址,直接修改版本就好了

  • https://artifacts.elastic.co/downloads/kibana/kibana-7.15.2-linux-x86_64.tar.gz
  • https://artifacts.elastic.co/downloads/kibana/kibana-7.16.2-windows-x86_64.zip

1.双击Kibana.bat启动 默认对应elasticsearch:9200

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_spring boot_09

2.访问 http://localhost:5601,输入前面的密码

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_es_10

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_spring boot_11

如何进行监控:Kibana监控Es Cluster.

3.Springboot整合Elasticsearch

3.1 官方文档

链接: Spring Data Elasticsearch - Reference Documentation.

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_鲨鱼抓包 android_12

3.2 依赖

<dependency>
   <groupId>org.springframework.boot</groupId>
   <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>

ElasticsearchRestTemplate封装了RestHighLevelClient,源码如下👇

public class ElasticsearchRestTemplate extends AbstractElasticsearchTemplate {
    private static final Logger LOGGER = LoggerFactory.getLogger(ElasticsearchRestTemplate.class);
    private final RestHighLevelClient client;
    private final ElasticsearchExceptionTranslator exceptionTranslator;

    public ElasticsearchRestTemplate(RestHighLevelClient client) {
        Assert.notNull(client, "Client must not be null!");
        this.client = client;
        this.exceptionTranslator = new ElasticsearchExceptionTranslator();
        this.initialize(this.createElasticsearchConverter());
    }

    public ElasticsearchRestTemplate(RestHighLevelClient client, ElasticsearchConverter elasticsearchConverter) {
        Assert.notNull(client, "Client must not be null!");
        this.client = client;
        this.exceptionTranslator = new ElasticsearchExceptionTranslator();
        this.initialize(elasticsearchConverter);
    }
}

3.3 properties

spring.data.elasticsearch.client.reactive.endpoints=127.0.0.1:9200
#没有这个index,就创建
spring.data.elasticsearch.repositories.enabled=true

spring.data.elasticsearch.cluster-nodes=127.0.0.1:9300

3.4 代码

  • @Document表示这是一个Elasticsearch Data,
  • indexName对应Elasticsearch Index
  • type对应Elasticsearch type
@Document(indexName = "product",type = "article")
@Data
public class Article {
    @Id
    private String id;
    private String title;
    @Field(type = FieldType.Nested, includeInParent = true)
    private List<Author> authors;
}

@Data
public class Author {
    private String name;
}
@Repository
public interface ArticleRepository extends ElasticsearchRepository<Article,String> {

    //下面的这两个查询的作用是一样的。一个采用默认的实现方式,一个采用自定义的实现方式
    Page<Article> findByAuthorsName(String name, Pageable pageable);
    
    @Query("{\"bool\": {\"must\": [{\"match\": {\"authors.name\": \"?0\"}}]}}")
    Page<Article> findByAuthorsNameUsingCustomQuery(String name, Pageable pageable);

    //搜索title字段
    Page<Article> findByTitleIsContaining(String word,Pageable pageable);
    
    Page<Article> findByTitle(String title,Pageable pageable);
}
@Autowired
    private ArticleRepository articleRepository;
    @Autowired
    private ElasticsearchRestTemplate elasticsearchRestTemplate;

    //检查相应的索引是否存在,如果spring.data.elasticsearch.repositories.enabled=True,则会自动创建索引
    private boolean checkIndexExists(Class<?> cls){
        boolean isExist = elasticsearchRestTemplate.indexOps(cls).exists();
        //获取索引名
        String indexName = cls.getAnnotation(Document.class).indexName();
        System.out.printf("index %s is %s\n", indexName, isExist ? "exist" : "not exist");
        return isExist;
    }
    @Test
    void test() {
        checkIndexExists(Article.class);
    }


    @Test
     void save(){
        Article article = new Article();
        articel.setTitle("Spring Data Elasticsearch");
        article.setAuthors(asList(new Author("LaoAlex"),new Author("John")));
        articleRepository.save(article);

        article = new Article();
        articel.setTitle("Spring Data Elasticsearch2");
        article.setAuthors(asList(new Author("LaoAlex"),new Author("King")));
        articleRepository.save(article);

        article = new Article();
        articel.setTitle("Spring Data Elasticsearch3");
        article.setAuthors(asList(new Author("LaoAlex"),new Author("Bill")));
        articleRepository.save(article);
    }		
   
    @Test
    void queryAuthorName() throws JsonProcessingException {
        Page<Article> articles = articleRepository.findByAuthorsName("LaoAlex", PageRequest.of(0,10));
        //将对象转为Json字符串
        ObjectWriter objectWriter = new ObjectMapper().writer().withDefaultPrettyPrinter();
        String json = objectWriter.writeValueAsString(articles);
        System.out.println(json);
    }

    //使用自定义查询
    @Test
    void queryAuthorNameByCustom() throws JsonProcessingException {
        Page<Article> articles = articleRepository.findByAuthorsNameUsingCustomQuery("John",PageRequest.of(0,10));
        //将对象转为Json字符串
        ObjectWriter objectWriter = new ObjectMapper().writer().withDefaultPrettyPrinter();
        String json = objectWriter.writeValueAsString(articles);
        System.out.println(json);
    }

    //使用Template进行关键字查询
    @Test
    void queryTileContainByTemplate() throws JsonProcessingException {
        Query query = new NativeSearchQueryBuilder().withFilter(regexpQuery("title",".*elasticsearch2.*")).build();
        SearchHits<Article> articles = elasticsearchRestTemplate.search(query, Article.class, IndexCoordinates.of("product"));
        //将对象转为Json字符串
        ObjectWriter objectWriter = new ObjectMapper().writer().withDefaultPrettyPrinter();
        String json = objectWriter.writeValueAsString(articles);
        System.out.println(json);
    }


    @Test
    void update() throws JsonProcessingException {
        Page<Article> articles = articleRepository.findByTitle("Spring Data Elasticsearch",PageRequest.of(0,10));
        //将对象转为Json字符串
        ObjectWriter objectWriter = new ObjectMapper().writer().withDefaultPrettyPrinter();
        String json = objectWriter.writeValueAsString(articles);
        System.out.println(json);

        Article article = articles.getContent().get(0);
        System.out.println(article);
        article.setAuthors(null);
        articleRepository.save(article);
    }


    @Test
    void delete(){
        Page<Article> articles = articleRepository.findByTitle("Spring Data Elasticsearch",PageRequest.of(0,10));
        Article article = articles.getContent().get(0);
        articleRepository.delete(article);
    }

3.5 报警

1.no id property found for class

//报警就是es用来封装的实体类出了问题,两个办法解决
//1.将主键栏位改为id
@Data
@Document(indexName = "spring.student")
public class Student {

    private int id;
    private String stuName;
    private String stuAddress;
    private String gender;
}


//2.如果主键栏位不是id,给主键栏位添加注解@Id
import org.springframework.data.annotation.Id;
import org.springframework.data.elasticsearch.annotations.Document;

@Data
@Document(indexName = "spring.student")
public class Student {

    @Id
    private int stuId;
    private String stuName;
    private String stuAddress;
    private String gender;
}

4.集群搭建

4.1 修改配置文件

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
# 三台都是这个名字
cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
# 集群中节点名称 (3个节点以此为:node-1,node-2,node-3)
node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# By default Elasticsearch is only accessible on localhost. Set a different
# address here to expose this node on the network:
#
network.host: 127.0.0.1
#
# By default Elasticsearch listens for HTTP traffic on the first free port it
# finds starting at 9200. Set a specific HTTP port here:
# 逐一修改三台的端口,分别是(9201 9301) ,(9202 9302) ,(9203 9303)
http.port: 9201
transport.tcp.port: 9301
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
discovery.zen.ping.unicast.hosts: ["127.0.0.1:9301","127.0.0.1:9302","127.0.0.1:9303"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#三个都一样,node-1为主节点
cluster.initial_master_nodes: ["node-1"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
#
# ---------------------------------- Security ----------------------------------
#
#                                 *** WARNING ***
#
# Elasticsearch security features are not enabled by default.
# These features are free, but require configuration changes to enable them.
# This means that users don’t have to provide credentials and can get full access
# to the cluster. Network connections are also not encrypted.
#
# To protect your data, we strongly encourage you to enable the Elasticsearch security features. 
# Refer to the following documentation for instructions.
#
# https://www.elastic.co/guide/en/elasticsearch/reference/7.16/configuring-stack-security.html

#allow origin
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-headers: Authorization

4.2 还可以修改密码(此步骤没什么用)

都说默认密码如下:我是登不进去

  • username=elastic
  • password=changeme

cd到bin目录,找到elasticsearch-setup-passwords(或者直接在地址栏cmd) D:\Tools\es_cluster\elasticsearch-7.16.2\bin>elasticsearch-setup-passwords interactive

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_鲨鱼抓包 android_13

噼里啪啦一顿改,必须全都要改

  • elas123

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_elasticsearch_14

4.3 启动三个节点,查看状态

1.节点启动成功,并不代表集群成功

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_鲨鱼抓包 android_15

2.呼叫下面的api,查看集群状态

  • http://localhost:9203/_cat/health?v

3.查看集群状态

  • http://ip:port/_cluster/health
  • http://ip:port/_cat/nodes
  • http://ip:port/ _cat/shards
GET http://localhost:9203/_cluster/health

{
  "cluster_name": "my-application",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 8,
  "active_shards": 16,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

颜色的含义:

  • green - 主分片与副本都正常分配
  • yellow - 主分片全部分配,有副本分片未能正常分配
  • red - 有主分片未能分配

4.4 启动kibana,监控

1.修改配置文件

# Kibana is served by a back end server. This setting specifies the port to use.
server.port: 5601

# Specifies the address to which the Kibana server will bind. IP addresses and host names are both valid values.
# The default is 'localhost', which usually means remote machines will not be able to connect.
# To allow connections from remote users, set this parameter to a non-loopback address.
server.host: "127.0.0.1"

# Enables you to specify a path to mount Kibana at if you are running behind a proxy.
# Use the `server.rewriteBasePath` setting to tell Kibana if it should remove the basePath
# from requests it receives, and to prevent a deprecation warning at startup.
# This setting cannot end in a slash.
#server.basePath: ""

# Specifies whether Kibana should rewrite requests that are prefixed with
# `server.basePath` or require that they are rewritten by your reverse proxy.
# This setting was effectively always `false` before Kibana 6.3 and will
# default to `true` starting in Kibana 7.0.
#server.rewriteBasePath: false

# Specifies the public URL at which Kibana is available for end users. If
# `server.basePath` is configured this URL should end with the same basePath.
#server.publicBaseUrl: ""

# The maximum payload size in bytes for incoming server requests.
#server.maxPayload: 1048576

# The Kibana server's name.  This is used for display purposes.
#server.name: "your-hostname"

# The URLs of the Elasticsearch instances to use for all your queries.
#elasticsearch.hosts: ["http://localhost:9200"]
elasticsearch.hosts: ["http://localhost:9201","http://localhost:9202","http://localhost:9203"]

2.启动

此处缺一张图片kibana

5.elasticsearch api

造数据

PUT users/_doc/1
{
  "name":"张飞",
  "age":30,
  "from": "China",
  "desc": "皮肤黑、武器重、性格直",
  "tags": ["黑", "重", "直"]
}

PUT users/_doc/2
{
  "name":"赵云",
  "age":18,
  "from":"China",
  "desc":"帅气逼人,一身白袍",
  "tags":["帅", "白"]
}

PUT users/_doc/3
{
  "name":"关羽",
  "age":22,
  "from":"England",
  "desc":"大刀重,骑赤兔马,胡子长",
  "tags":["重", "马","长"]
}


PUT users/_doc/4
{
  "name":"刘备",
  "age":29,
  "from":"Child",
  "desc":"大耳贼,持双剑,懂谋略",
  "tags":["剑", "大"]
}

PUT users/_doc/5
{
  "name":"貂蝉",
  "age":25,
  "from":"England",
  "desc":"闭月羞花,沉鱼落雁",
  "tags":["闭月","羞花"]
}

5.1 CURD之Create

Notice:当执行PUT命令时,如果数据不存在,则新增该条数据,如果数据存在则修改该条数据 下面是两种创建方法

//当执行PUT命令时,如果数据不存在,则新增该条数据,如果数据存在则修改该条数据
POST users/_doc
{
  "user": "Mike",
  "post_date": "2020-10-24T14:39:30",
  "message": "trying out kibana"
}
post的id会是随机的,建议还是下面的put好


PUT users/_doc/1
{
  "user": "Jack",
  "post_date": "2020-10-24T14:39:30",
  "message": "trying out Elasticsearch"
}

PUT users/_doc/2
{
  "user": "Ludy",
  "post_date": "2020-10-24T14:39:30",
  "message": "trying out Elasticsearch"
}
查询某条数据x
GET users/_doc/x

5.2 CURD之Update

POST users/_doc/3/_update
{
  "doc": {
    "post_date": "2020-10-24T14:39:30",
    "message": "trying out Elasticsearch"
  }
}

5.3 CURD之Delete

DELETE users/_doc/4

5.4 CURD之Retrieve

使用elasticHD进行查询,Demo:

  • 索引spring.student
  • 索引spring.test
  • type默认都是doc
  • GET /Spring.student/_search 此时查询全部
  • GET /Spring.student/_search?q=id:1 此时查询一个index
  • 鲨鱼抓包 android 鲨鱼抓包软件颜色含义_elasticsearch_16

  • GET /spring.student,spring.test/_search?q=id:1 此时查询多个index
  • 鲨鱼抓包 android 鲨鱼抓包软件颜色含义_spring boot_17

5.5 match查询

GET users/_doc/_search
{
  "query": {
    "match": {
      "post_date": "2020-10-24T14:39:30"
    }
  }
}

5.6 term查询

GET users/_doc/_search
{
  "query": {
    "term": {
      "t1": "Beautiful girl!"
    }
  }
}

5.6 排序查询

可排序的属性

  • 数字
  • 日期
GET users/_doc/_search
{
  "query": {
    "match": {
      "post_date": "2020-10-24T14:39:30"
    }
  },
  "sort": [
    {
      "id": {
        "order": "desc"
      }
    }
  ]
}

5.7 分页查询

GET users/_doc/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "age": {
        "order": "desc"
      }
    }
  ], 
  "from": 2,
  "size": 1
}

6.Elasticsearch之布尔查询

关键字

代表的含义

must

and

should

or

must_not

not

filter

与must组合使用

range

条件筛选范围。

gt

大于,相当于关系型数据库中的>。

gte

大于等于,相当于关系型数据库中的>=。

lt

小于,相当于关系型数据库中的<。

lte

小于等于,相当于关系型数据库中的<=。

6.1 must关键字查询

//一个条件
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "from": "gu"
          }
        }
      ]
    }
  }
}
//两个条件
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "from": "gu"
          }
        },
        {
          "match": {
            "age": 30
          }
        }
      ]
    }
  }
}

6.2 should关键字查询

GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "from": "gu"
          }
        },
        {
          "match": {
            "tags": "闭月"
          }
        }
      ]
    }
  }
}

6.3 must_not关键字查询

//三个条件都满足
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "from": "gu"
          }
        },
        {
          "match": {
            "tags": "可爱"
          }
        },
        {
          "match": {
            "age": 18
          }
        }
      ]
    }
  }
}

6.4 filter关键字查询

//要查询from为gu,age大于25的数据怎么查
GET lqz/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "from": "gu"
          }
        }
      ],
      "filter": {
        "range": {
          "age": {
            "gt": 25
          }
        }
      }
    }
  }
}

7.Elasticsearch之查询结果过滤

//所有的结果中,我只需要查看name和age两个属性
GET lqz/_doc/_search
{
  "query": {
    "match": {
      "name": "顾老二"
    }
  },
  "_source": ["name", "age"]
}

8.Elasticsearch之高亮查询

GET lqz/_doc/_search
{
  "query": {
    "match": {
      "name": "石头"
    }
  },
  "highlight": {
    "fields": {
      "name": {}
    }
  }
}

使用b标签自定义高亮

GET lqz/chengyuan/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "highlight": {
    "pre_tags": "<b class='key' style='color:red'>",
    "post_tags": "</b>",
    "fields": {
      "from": {}
    }
  }
}

9.Elasticsearch之聚合查询

聚合函数查询

  • avg
  • max
  • min
  • sum

聚合函数,其语法被封装在aggs中,而my_xxx则是为查询结果起个别名,封装了计算出的值

"aggregations" : {
    "my_avg" : {
      "value" : 27.0
    }
  }
  "aggregations" : {
    "my_max" : {
      "value" : 30.0
    }
  }
  .......

9.1 avg

GET users/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "aggs": {
    "my_avg": {
      "avg": {
        "field": "age"
      }
    }
  },
  "_source": ["name", "age"]
}
GET lqz/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "aggs": {
    "my_avg": {
      "avg": {
        "field": "age"
      }
    }
  },
  "size": 0, 
  "_source": ["name", "age"]
}

9.2 max

GET lqz/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "aggs": {
    "my_max": {
      "max": {
        "field": "age"
      }
    }
  },
  "size": 0
}

9.3 min

GET lqz/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "aggs": {
    "my_min": {
      "min": {
        "field": "age"
      }
    }
  },
  "size": 0
}

9.4 sum

GET lqz/_doc/_search
{
  "query": {
    "match": {
      "from": "gu"
    }
  },
  "aggs": {
    "my_sum": {
      "sum": {
        "field": "age"
      }
    }
  },
  "size": 0
}

9.5 range分组查询

GET lqz/_doc/_search
{
  "size": 0, 
  "query": {
    "match_all": {}
  },
  "aggs": {
    "age_group": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 15,
            "to": 20
          },
          {
            "from": 20,
            "to": 25
          },
          {
            "from": 25,
            "to": 30
          }
        ]
      }
    }
  }
}

两个条件,即分组,又要求平均值

GET lqz/_doc/_search
{
  "size": 0, 
  "query": {
    "match_all": {}
  },
  "aggs": {
    "age_group": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 15,
            "to": 20
          },
          {
            "from": 20,
            "to": 25
          },
          {
            "from": 25,
            "to": 30
          }
        ]
      },
      "aggs": {
        "my_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

10.Elasticsearch之Mapping & Dynamic Mapping

10.1 mapping

GET index_name/_mapping

//先来感受一下
PUT users/_doc/1
{
  "user": "Jack",
  "post_date": "2020-10-24T14:39:30",
  "message": "trying out Elasticsearch"
}


GET users/_mapping
{
  "users" : {
    "mappings" : {
      "properties" : {
        "message" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "post_date" : {
          "type" : "date"
        },
        "user" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

每个索引都有一个映射类型(这话必须放在elasticsearch6.x版本后才能说) 映射类型有:

  • 元字段(meta-fields):元字段用于自定义如何处理文档关联的元数据,例如包括文档的_index、_type、_id和_source字段。
  • 字段或属性(field or properties):映射类型包含与文档相关的字段或者属性的列表。

字段的 mapping 可以设置很多参数,如下:

  • analyzer:指定分词器,只有 text 类型的数据支持。
  • enabled:如果设置成 false,表示数据仅做存储,不支持搜索和聚合分析(数据保存在 _source 中)。 默认值为 true。
  • index:字段是否建立倒排索引。 如果设置成 false,表示不建立倒排索引(节省空间),同时数据也无法被搜索,但依然支持聚合分析,数据也会出现在 _source 中。 默认值为 true。
  • norms:字段是否支持算分。 如果字段只用来过滤和聚合分析,而不需要被搜索(计算算分),那么可以设置为 false,可节省空间。 默认值为 true。
  • doc_values:如果确定不需要对字段进行排序或聚合,也不需要从脚本访问字段值,则可以将其设置为 false,以节省磁盘空间。 默认值为 true。
  • fielddata:如果要对 text 类型的数据进行排序和聚合分析,则将其设置为 true。 默认为 false。
  • store:默认值为 false,数据存储在 _source 中。 默认情况下,字段值被编入索引以使其可搜索,但它们不会被存储。这意味着可以查询字段,但无法检索原始字段值。 在某些情况下,存储字段是有意义的。例如,有一个带有标题、日期和非常大的内容字段的文档,只想检索标题和日期,而不必从一个大的源字段中提取这些字段。
  • boost:可增强字段的算分。
  • coerce:是否开启数据类型的自动转换,比如字符串转数字。 默认是开启的。
  • dynamic:控制 mapping 的自动更新,取值有 true,false,strict。
  • eager_global_ordinals
  • fields:多字段特性。 让一个字段拥有多个子字段类型,使得一个字段能够被多个不同的索引方式进行索引。
  • copy_to
  • format
  • ignore_above
  • ignore_malformed
  • index_options
  • index_phrases
  • index_prefixes
  • meta
  • normalizer
  • null_value:定义 null 的值。
  • position_increment_gap
  • properties
  • search_analyzer
  • similarity
  • term_vector

10.2 dynamic mapping

Dynamic Mapping的机制

  • 我们无需手动定义Mappings。ES会自动根据文档信息,推算出字段的类型。
  • 但是有时候会推算出不对,例如地理位置信息
  • 当类型如果设置不对时,会导致一些功能无法正常运行,例如Range查询。
  • 动态映射 dynamic mapping
  • 静态映射 explicit mapping
  • 严格映射 strict mappings
1 dynamic mapping
//创建索引nolan
PUT nolan
{
  "mappings": {
      "properties": {
        "name": {
          "type": "text"
        },
        "age": {
          "type": "long"
        }
      }
  }
}
//查询索引nolan
GET nolan
{
  "nolan" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "name" : {
          "type" : "text"
        }
      }
    },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "provided_name" : "nolan",
        "creation_date" : "1650261517356",
        "number_of_replicas" : "1",
        "uuid" : "dXUnwua2TDCI2K9hcSL98A",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  }
}
//插入数据
PUT nolan/_doc/1
{
  "name": "小黑",
  "age": 18,
  "sex": "不详"
}
//查询索引
{
  "nolan" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "name" : {
          "type" : "text"
        },
        "sex" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "provided_name" : "nolan",
        "creation_date" : "1650261517356",
        "number_of_replicas" : "1",
        "uuid" : "dXUnwua2TDCI2K9hcSL98A",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  }
}
上面的例子,你会发现:
elasticsearch帮我们动态的新增了一个sex的映射
elasticsearch默认是允许添加新的字段的,也就是dynamic:true。
//其实创建索引的时候,是这样的
PUT nolan
{
  "mappings": {
      "dynamic":true,
      "properties": {
        "name": {
          "type": "text"
        },
        "age": {
          "type": "long"
        }
      }
  }
}
2 explicit mapping

将dynamic值设置为false

//创建索引nolan1
PUT nolan1
{
  "mappings": {
    "dynamic": "false",
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "long"
      }
    }
  }
}
//插入数据
PUT nolan1/_doc/1
{
  "name": "小黑",
  "age":18
}
PUT nolan1/_doc/2
{
  "name": "小白",
  "age": 16,
  "sex": "不详"
}
//查询mapping
GET nolan1
{
  "nolan1" : {
    "aliases" : { },
    "mappings" : {
      "dynamic" : "false",
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "name" : {
          "type" : "text"
        }
      }
    },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "provided_name" : "nolan1",
        "creation_date" : "1650262553126",
        "number_of_replicas" : "1",
        "uuid" : "OuUWSsQ-SUaGr2ged3FRbQ",
        "version" : {
          "created" : "7160299"
        }
      }
    }
  }
}
可以看到elasticsearch并没有为新增的sex建立映射关系
当elasticsearch察觉到有新增字段时,因为dynamic:false的关系
会忽略该字段,但是仍会存储该字段。
3 strict mapping

将dynamic的状态改为strict

//创建索引nolan2
PUT nolan2
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "long"
      }
    }
  }
}
//插入数据,当执行第二笔的时候会报警
PUT nolan1/_doc/1
{
  "name": "小黑",
  "age":18
}
PUT nolan1/_doc/2
{
  "name": "小白",
  "age": 16,
  "sex": "不详"
}
遇到新字段,就会抛出异常
4 小结

Name

Setting

Value

动态映射

dynamic: true

动态添加新的字段(或缺省)

静态映射

dynamic: false

忽略新的字段。在原有的映射基础上,当有新的字段时,不会主动的添加新的映射关系,只作为查询结果出现在查询中。

严格模式

dynamic: strict

如果遇到新的字段,就抛出异常

一般静态映射用的较多。就像HTML的img标签一样,你可以在需要的时候添加id或者class属性。

10.3 对象属性

//属性嵌套
PUT noaln2/_doc/1
{
  "name":"tom",
  "age":18,
  "info":{
    "addr":"北京",
    "tel":"10010"
  }
}

PUT noaln2/_doc/21
{
  "name":"jim",
  "age":21,
  "info":{
    "addr":"东莞",
    "tel":"10086"
  }
}
//创建索引nolan2
PUT nolan2
{
  "mappings": {
   "dynamic": false,
   "properties": {
     "name": {
       "type": "text"
     },
     "age": {
       "type": "text"
     },
     "info": {
       "properties": {
         "addr": {
           "type": "text"
         },
         "tel": {
           "type" : "text"
         }
       }
     }
   }
  }
}
GET nolan2/_doc/_search
{
  "query": {
    "match": {
      "info.tel": "10086"
    }
  }
}

10.4 控制当前字段是否被索引

关键字index

  • age属性不会被索引
PUT nolan3
{
  "mappings": {
     "dynamic": false,
     "properties": {
       "name": {
         "type": "text",
         "index": true
       },
       "age": {
         "type": "long",
         "index": false
       }
     }
  }
}

10.5 对Null值实现搜索

1.Keyword类型支持设定 null_value

PUT users
{
  "mappings" : {
	  "properties" : {
	    "firstName" : {
	     "type" : "text"
	    },
	    "lastName" : {
	     "type" : "text"
	    },
	    "mobile" : {
	     "type" : "keyword",
	     "null_value": "NULL"
	    }
	   }
  }
}

2.ignore_above

//创建索引
PUT nolan
{
  "mappings": {
      "properties":{
        "t1":{
          "type":"keyword",
          "ignore_above": 5
        },
        "t2":{
          "type":"keyword",
          "ignore_above": 10 ①
        }
      }
  }
}

//插入数据
PUT nolan/_doc/1
{
  "t1":"elk",         ②
  "t2":"elasticsearch"     ③
}

//查询④
GET nolan/_doc/_search
{
  "query":{
    "term": {
      "t1": "elk"
    }
  }
}

//查询⑤
GET nolan/_doc/_search
{
  "query": {
    "term": {
      "t2": "elasticsearch"
    }
  }
}
  • 该字段将忽略任何超过10个字符的字符串
  • 文档已成功建立索引,也就是说能被查询,并且有结果返回
  • 该字段将不会建立索引,以该字段作为查询条件,将不会有结果返回。
  • 有结果返回。
  • 则将不会有结果返回,因为t2字段对应的值长度超过了ignove_above设置的值。

11.elasticsearch之setting

设置主、复制分片

PUT nolan
{
  "mappings": {
	"properties": {
	   "name": {
	     "type": "text"
	   }
    }
  }, 
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 5
  }
}
  • number_of_shards是主分片数量(每个索引默认5个主分片)
  • number_of_replicas是复制分片,默认一个主分片搭配一个复制分片。

12.elasticsearch字段的数据类型

  • 简单类型*
  • Numeric
  • Boolean
  • Date
  • Text
  • Keyword
  • Binary
  • 等等

复杂类型

  • Object
  • Arrays
  • Nested:一种对象数据类型。
  • Join:为同一索引中的文档定义父/子关系。

特殊类型

  • Geo-point
  • Geo-shape
  • Percolator

13.cluster node

1.Master eligible nodes 和Master node 每个节点启动后,默认就是Master eligible节点 Master-eligible可以参加选主进程,成为Master节点 当第一个节点启动时,它会将自己选举成Master节点

只有Master节点可以修改集群的状态信息 集群状态(Cluster State),维护了一个集群中,必要的信息

  • 所有的节点信息
  • 所有的索引和其相关的Mapping与Setting信息
  • 分片的路由信息

2.Data Node & Coordinationg Node

  • Data Node 可以保存数据的节点。负责保存分片数据。在数据的扩展上起到了至关重要的作用
  • Coordinationg Node 负责接收Client的请求,将请求分发到合适的节点,最终把结果汇集到一起

3.分片(Primary Shard & Replica Shard)

  • 主分片 用以解决数据水平扩展的问题,通过主分片,可以将数据分布到集群内的所有节点上。 一个分片是一个运行的Lucene实例 主分片数在索引创建时指定,后续不允许修改,除非Reindex
  • 副本 用于解决数据高可用的问题。是主分片的拷贝 副本分片数,可以动态调整 增加副本数,还可以在一定程度上提高服务的可用性(读取的吞吐)

13.Analyzer进行分词

数据被发送到elasticsearch后,会进行的一系列操作

  • 字符过滤:使用字符过滤器转变字符。
  • 文本切分为分词:将文本(档)分为单个或多个分词。
  • 分词过滤:使用分词过滤器转变每个分词。
  • 分词索引:最终将分词存储在Lucene倒排索引中

13.1 分析器

在elasticsearch中,一个分析器可以包括:

  • 可选的字符过滤器
  • 一个分词器
  • 0个或多个分词过滤器
1. 标准分析器:standard analyzer

标准分析器(standard analyzer):是elasticsearch的默认分析器,该分析器综合了大多数欧洲语言来说合理的默认模块,包括标准分词器、标准分词过滤器、小写转换分词过滤器和停用词分词过滤器。

POST _analyze
{
  "analyzer": "standard",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
// 分词结果如下
{
  "tokens" : [
    {
      "token" : "to",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "that",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亚",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}
2. 简单分析器:simple analyzer

简单分析器(simple analyzer):简单分析器仅使用了小写转换分词,这意味着在非字母处进行分词,并将分词自动转换为小写。这个分词器对于亚种语言来说效果不佳,因为亚洲语言不是根据空白来分词的,所以一般用于欧洲言中

POST _analyze
{
  "analyzer": "simple",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
// 分词结果如下
{
  "tokens" : [
    {
      "token" : "to",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "that",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}
3. 空白分析器:whitespace analyzer

空白格分析器(whitespace analyzer):这玩意儿只是根据空白将文本切分为若干分词,真是有够偷懒!

POST _analyze
{
  "analyzer": "whitespace",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
// 分词结果如下
{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be,",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "————",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 11
    }
  ]
}
4. 停用词分析器:stop analyzer

停用词分析(stop analyzer)和简单分析器的行为很像,只是在分词流中额外的过滤了停用词

POST _analyze
{
  "analyzer": "stop",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
{
  "tokens" : [
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}
5. 关键词分析器:keyword analyzer

关键词分析器(keyword analyzer)将整个字段当做单独的分词,如无必要,我们不在映射中使用关键词分析器。

POST _analyze
{
  "analyzer": "keyword",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
// 分词结果如下
{
  "tokens" : [
    {
      "token" : "To be or not to be,  That is a question ———— 莎士比亚",
      "start_offset" : 0,
      "end_offset" : 49,
      "type" : "word",
      "position" : 0
    }
  ]
}
6. 模式分析器:pattern analyzer

模式分析器(pattern analyzer)允许我们指定一个分词切分模式。但是通常更佳的方案是使用定制的分析器,组合现有的模式分词器和所需要的分词过滤器更加合适。

PUT pattern_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer":{
          "type":"pattern",
          "pattern":"\\W|_",
          "lowercase":true
        }
      }
    }
  }
}
POST pattern_test/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}
// 分词结果如下
{
  "tokens" : [
    {
      "token" : "john",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "smith",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "foo",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "bar",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "com",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "word",
      "position" : 4
    }
  ]
}
7. 语言和多语言分析器:chinese

elasticsearch为很多世界流行语言提供良好的、简单的、开箱即用的语言分析器集合:阿拉伯语、亚美尼亚语、巴斯克语、巴西语、保加利亚语、加泰罗尼亚语、中文、捷克语、丹麦、荷兰语、英语、芬兰语、法语、加里西亚语、德语、希腊语、北印度语、匈牙利语、印度尼西亚、爱尔兰语、意大利语、日语、韩国语、库尔德语、挪威语、波斯语、葡萄牙语、罗马尼亚语、俄语、西班牙语、瑞典语、土耳其语和泰语。

POST _analyze
{
  "analyzer": "chinese",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
{
  "tokens" : [
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亚",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}

也可以是别语言:

POST _analyze
{
  "analyzer": "french",
  "text":"Je suis ton père"
}
POST _analyze
{
  "analyzer": "german",
  "text":"Ich bin dein vater"
}
8. 雪球分析器:snowball analyzer

雪球分析器(snowball analyzer)除了使用标准的分词和分词过滤器(和标准分析器一样)也是用了小写分词过滤器和停用词过滤器,除此之外,它还是用了雪球词干器对文本进行词干提取。

POST _analyze
{
  "analyzer": "snowball",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
// 分词结果如下
{
  "tokens" : [
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亚",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}

13.2 字符过滤器

Name

Value

HTML字符过滤器

HTML Strip Char Filter

映射字符过滤器

Mapping Char Filter

模式替换过滤器

Pattern Replace Char Filter

1. HTML字符过滤器

HTML字符过滤器(HTML Strip Char Filter)从文本中去除HTML元素。

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text":"<p>I'm so <b>happy</b>!</p>"
}
//结果如下
{
  "tokens" : [
    {
      "token" : """

I'm so happy!

""",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}
2. 映射字符过滤器

映射字符过滤器(Mapping Char Filter)接收键值的映射,每当遇到与键相同的字符串时,它就用该键关联的值替换它们。

PUT pattern_test4
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      },
      "char_filter":{
          "my_char_filter":{
            "type":"mapping",
            "mappings":["刘备 => 666","关羽 => 888"]
          }
        }
    }
  }
}
POST pattern_test4/_analyze
{
  "analyzer": "my_analyzer",
  "text": "刘备爱惜关羽,可是后来关羽大意失荆州"
}
//结果如下
{
  "tokens" : [
    {
      "token" : "666爱惜888,可是后来888大意失荆州",
      "start_offset" : 0,
      "end_offset" : 19,
      "type" : "word",
      "position" : 0
    }
  ]
}
3. 模式替换过滤器

模式替换过滤器(Pattern Replace Char Filter)使用正则表达式匹配并替换字符串中的字符。但要小心你写的抠脚的正则表达式。因为这可能导致性能变慢!

PUT pattern_test5
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}
POST pattern_test5/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}
//结果如下
{
  "tokens" : [
    {
      "token" : "My",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "credit",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "card",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "123_456_789",
      "start_offset" : 18,
      "end_offset" : 29,
      "type" : "<NUM>",
      "position" : 4
    }
  ]
}

13.3 分词器

由于elasticsearch内置了分析器,它同样也包含了分词器。分词器,顾名思义,主要的操作是将文本字符串分解为小块,而这些小块这被称为分词token。

1.标准分词器:standard tokenizer

标准分词器(standard tokenizer)是一个基于语法的分词器,对于大多数欧洲语言来说还是不错的,它同时还处理了Unicode文本的分词,但分词默认的最大长度是255字节,它也移除了逗号和句号这样的标点符号。

POST _analyze
{
  "tokenizer": "standard",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "莎",
      "start_offset" : 45,
      "end_offset" : 46,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "士",
      "start_offset" : 46,
      "end_offset" : 47,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "比",
      "start_offset" : 47,
      "end_offset" : 48,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "亚",
      "start_offset" : 48,
      "end_offset" : 49,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    }
  ]
}
2. 关键词分词器:keyword tokenizer

关键词分词器(keyword tokenizer)是一种简单的分词器,将整个文本作为单个的分词,提供给分词过滤器,当你只想用分词过滤器,而不做分词操作时,它是不错的选择。

POST _analyze
{
  "tokenizer": "keyword",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
{
  "tokens" : [
    {
      "token" : "To be or not to be,  That is a question ———— 莎士比亚",
      "start_offset" : 0,
      "end_offset" : 49,
      "type" : "word",
      "position" : 0
    }
  ]
}
3. 字母分词器:letter tokenizer

字母分词器(letter tokenizer)根据非字母的符号,将文本切分成分词。

POST _analyze
{
  "tokenizer": "letter",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}
4. 小写分词器:lowercase tokenizer

小写分词器(lowercase tokenizer)结合了常规的字母分词器和小写分词过滤器(跟你想的一样,就是将所有的分词转化为小写)的行为。通过一个单独的分词器来实现的主要原因是,一次进行两项操作会获得更好的性能。

POST _analyze
{
  "tokenizer": "lowercase",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
{
  "tokens" : [
    {
      "token" : "to",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "that",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 10
    }
  ]
}
5. 空白分词器:whitespace tokenizer

空白分词器(whitespace tokenizer)通过空白来分隔不同的分词,空白包括空格、制表符、换行等。但是,我们需要注意的是,空白分词器不会删除任何标点符号。

POST _analyze
{
  "tokenizer": "whitespace",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
{
  "tokens" : [
    {
      "token" : "To",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "be",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "or",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "not",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "be,",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "That",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "a",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "question",
      "start_offset" : 31,
      "end_offset" : 39,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "————",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "莎士比亚",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 11
    }
  ]
}
6. 模式分词器:pattern tokenizer

模式分词器(pattern tokenizer)允许指定一个任意的模式,将文本切分为分词。

POST pattern_test2/_analyze
{
  "tokenizer": "my_tokenizer",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}
PUT pattern_test2
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer":{
          "type":"pattern",
          "pattern":","
        }
      }
    }
  }
}
{
  "tokens" : [
    {
      "token" : "To be or not to be",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "  That is a question ———— 莎士比亚",
      "start_offset" : 19,
      "end_offset" : 49,
      "type" : "word",
      "position" : 1
    }
  ]
}
7. UAX URL电子邮件分词器:UAX RUL email tokenizer
POST _analyze
{
  "tokenizer": "uax_url_email",
  "text":"作者:张开来源:未知原文:邮箱:xxxxxxx@xx.com版权声明:本文为博主原创文章,转载请附上博文链接!"
}
{
  "tokens" : [
    {
      "token" : "作",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "者",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "张",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "开",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "来",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "源",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "未",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "知",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "原",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "文",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "",
      "start_offset" : 13,
      "end_offset" : 64,
      "type" : "<URL>",
      "position" : 10
    },
    {
      "token" : "邮",
      "start_offset" : 64,
      "end_offset" : 65,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "箱",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "xxxxxxx@xx.com",
      "start_offset" : 67,
      "end_offset" : 81,
      "type" : "<EMAIL>",
      "position" : 13
    },
    {
      "token" : "版",
      "start_offset" : 81,
      "end_offset" : 82,
      "type" : "<IDEOGRAPHIC>",
      "position" : 14
    },
    {
      "token" : "权",
      "start_offset" : 82,
      "end_offset" : 83,
      "type" : "<IDEOGRAPHIC>",
      "position" : 15
    },
    {
      "token" : "声",
      "start_offset" : 83,
      "end_offset" : 84,
      "type" : "<IDEOGRAPHIC>",
      "position" : 16
    },
    {
      "token" : "明",
      "start_offset" : 84,
      "end_offset" : 85,
      "type" : "<IDEOGRAPHIC>",
      "position" : 17
    },
    {
      "token" : "本",
      "start_offset" : 86,
      "end_offset" : 87,
      "type" : "<IDEOGRAPHIC>",
      "position" : 18
    },
    {
      "token" : "文",
      "start_offset" : 87,
      "end_offset" : 88,
      "type" : "<IDEOGRAPHIC>",
      "position" : 19
    },
    {
      "token" : "为",
      "start_offset" : 88,
      "end_offset" : 89,
      "type" : "<IDEOGRAPHIC>",
      "position" : 20
    },
    {
      "token" : "博",
      "start_offset" : 89,
      "end_offset" : 90,
      "type" : "<IDEOGRAPHIC>",
      "position" : 21
    },
    {
      "token" : "主",
      "start_offset" : 90,
      "end_offset" : 91,
      "type" : "<IDEOGRAPHIC>",
      "position" : 22
    },
    {
      "token" : "原",
      "start_offset" : 91,
      "end_offset" : 92,
      "type" : "<IDEOGRAPHIC>",
      "position" : 23
    },
    {
      "token" : "创",
      "start_offset" : 92,
      "end_offset" : 93,
      "type" : "<IDEOGRAPHIC>",
      "position" : 24
    },
    {
      "token" : "文",
      "start_offset" : 93,
      "end_offset" : 94,
      "type" : "<IDEOGRAPHIC>",
      "position" : 25
    },
    {
      "token" : "章",
      "start_offset" : 94,
      "end_offset" : 95,
      "type" : "<IDEOGRAPHIC>",
      "position" : 26
    },
    {
      "token" : "转",
      "start_offset" : 96,
      "end_offset" : 97,
      "type" : "<IDEOGRAPHIC>",
      "position" : 27
    },
    {
      "token" : "载",
      "start_offset" : 97,
      "end_offset" : 98,
      "type" : "<IDEOGRAPHIC>",
      "position" : 28
    },
    {
      "token" : "请",
      "start_offset" : 98,
      "end_offset" : 99,
      "type" : "<IDEOGRAPHIC>",
      "position" : 29
    },
    {
      "token" : "附",
      "start_offset" : 99,
      "end_offset" : 100,
      "type" : "<IDEOGRAPHIC>",
      "position" : 30
    },
    {
      "token" : "上",
      "start_offset" : 100,
      "end_offset" : 101,
      "type" : "<IDEOGRAPHIC>",
      "position" : 31
    },
    {
      "token" : "博",
      "start_offset" : 101,
      "end_offset" : 102,
      "type" : "<IDEOGRAPHIC>",
      "position" : 32
    },
    {
      "token" : "文",
      "start_offset" : 102,
      "end_offset" : 103,
      "type" : "<IDEOGRAPHIC>",
      "position" : 33
    },
    {
      "token" : "链",
      "start_offset" : 103,
      "end_offset" : 104,
      "type" : "<IDEOGRAPHIC>",
      "position" : 34
    },
    {
      "token" : "接",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "<IDEOGRAPHIC>",
      "position" : 35
    }
  ]
}
8. 路径层次分词器:path hierarchy tokenizer

路径层次分词器(path hierarchy tokenizer)允许以特定的方式索引文件系统的路径,这样在搜索时,共享同样路径的文件将被作为结果返回。

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text":"/usr/local/python/python2.7"
}
{
  "tokens" : [
    {
      "token" : "/usr",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local/python",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/usr/local/python/python2.7",
      "start_offset" : 0,
      "end_offset" : 27,
      "type" : "word",
      "position" : 0
    }
  ]
}

13.4 分词过滤器

1. 自定义分词过滤器
PUT pattern_test3
{
  "settings": {
    "analysis": {
      "filter": {
        "my_test_length":{
          "type":"length",
          "max":8,
          "min":2
        }
      }
    }
  }
}
POST pattern_test3/_analyze
{
  "tokenizer": "standard",
  "filter": ["my_test_length"],
  "text":"a Small word and a longerword"
}
//结果如下:
{
  "tokens" : [
    {
      "token" : "Small",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "word",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "and",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
2. 自定义小写分词过滤器
PUT lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        },
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}
POST lowercase_example/_analyze
{
  "tokenizer": "standard",
  "filter": ["greek_lowercase"],
  "text":"Ένα φίλτρο διακριτικού τύπου πεζά s ομαλοποιεί το κείμενο διακριτικού σε χαμηλότερη θήκη"
}
{
  "tokens" : [
    {
      "token" : "ενα",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "φιλτρο",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "διακριτικου",
      "start_offset" : 11,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "τυπου",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "πεζα",
      "start_offset" : 29,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "s",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "ομαλοποιει",
      "start_offset" : 36,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "το",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "κειμενο",
      "start_offset" : 50,
      "end_offset" : 57,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "διακριτικου",
      "start_offset" : 58,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "σε",
      "start_offset" : 70,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "χαμηλοτερη",
      "start_offset" : 73,
      "end_offset" : 83,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "θηκη",
      "start_offset" : 84,
      "end_offset" : 88,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}
3. 多个分词过滤器
POST _analyze
{
  "tokenizer": "standard",
  "filter": ["length","lowercase"],
  "text":"a Small word and a longerword"
}
{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "small",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "word",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "and",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "longerword",
      "start_offset" : 19,
      "end_offset" : 29,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

13.5 IK分词器

1. 下载

1.打开Github官网,搜索elasticsearch-analysis-ik,单击medcl/elasticsearch-analysis-ik

https://github.com/medcl/elasticsearch-analysis-ik.

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_es_18

2.ik版本要和es的版本匹配

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_elasticsearch_19

3.在es的安装目录,找到plugins,并新建ik子目录,将ik解压后放入此目录

4.重启es和kibana

2. 介绍

Name

Function

IKAnalyzer.cfg.xml

用来配置自定义的词库

main.dic

ik原生内置的中文词库,大约有27万多条,只要是这些单词,都会被分在一起。

surname.dic

中国的姓氏。

suffix.dic

特殊(后缀)名词,例如乡、江、所、省等等。

preposition.dic

中文介词,例如不、也、了、仍等等。

stopword.dic

英文停用词库,例如a、an、and、the等。

quantifier.dic

单位名词,如厘米、件、倍、像素等。

3. 测试

分解

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "上海自来水来自海上"
}
GET _analyze
{
  "analyzer": "ik_smart",
  "text": "今天是个好日子"
}

查询

GET ik1/_search
{
  "query": {
    "match_phrase": {
      "content": "今天"
    }
  }
}
GET ik1/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": {
        "query": "今天好日子",
        "slop": 2
      }
    }
  }
}

14.正排索引和倒排索引

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_spring boot_20

鲨鱼抓包 android 鲨鱼抓包软件颜色含义_elasticsearch_21

15.数据建模

16.集群的内部安全通信

加密数据

  • 避免数据抓包,敏感信息泄露
  • 验证身份,避免Impostor Node
  • Data/Cluster state

为节点创建证书 TLS协议要求Trusted Certificate Authority(CA)签发的X.509证书

  • Certificate 节点加入需要使用相同的CA签发的证书
  • Full Verification 节点加入集群需要相同CA签发的证书,还需要验证Host name 或者IP地址
  • No Verification 任何节点都可以加入,开发环境用于诊断目的
#生成证书

#为您的Elasticearch集群创建一个证书颁发机构。例如,使用elasticsearch-certutil ca命令:
bin/elasticsearch-certutil ca

#为群集中的每个节点生成证书和私钥。例如,使用elasticsearch-certutil cert 命令:
bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12

#将证书拷贝到 config/certs目录下
elastic-certificates.p12

bin/elasticsearch -E node.name=node0 -E cluster.name=es -E path.data=node0_data -E http.port=9200 -E xpack.security.enabled=true -E xpack.security.transport.ssl.enabled=true -E xpack.security.transport.ssl.verification_mode=certificate -E xpack.security.transport.ssl.keystore.path=certs/elastic-certificates.p12 -E xpack.security.transport.ssl.truststore.path=certs/elastic-certificates.p12

bin/elasticsearch -E node.name=node1 -E cluster.name=es -E path.data=node1_data -E http.port=9201 -E xpack.security.enabled=true -E xpack.security.transport.ssl.enabled=true -E xpack.security.transport.ssl.verification_mode=certificate -E xpack.security.transport.ssl.keystore.path=certs/elastic-certificates.p12 -E xpack.security.transport.ssl.truststore.path=certs/elastic-certificates.p12

#不提供证书的节点,无法加入
bin/elasticsearch -E node.name=node2 -E cluster.name=es -E path.data=node2_data -E http.port=9202 -E xpack.security.enabled=true -E xpack.security.transport.ssl.enabled=true -E xpack.security.transport.ssl.verification_mode=certificate

elasticsearch.yml 配置

#xpack.security.transport.ssl.enabled: true
#xpack.security.transport.ssl.verification_mode: certificate

#xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
#xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12