es浏览器排序 es文件浏览器怎么排序

转载

blueice 2024-07-30 13:57:36

文章标签 es浏览器排序倒排索引 analyzer Elastic 文章分类 架构后端开发

Elasticsearch认证复习准备

https://www.elastic.co/guide/cn/elasticsearch/guide/current/getting-started.html

##倒排索引

Elasticsearch 使用一种称为倒排索引的结构，它适用于快速的全文搜索。一个倒排索引由文档中所有不重复词的列表构成，对于其中每个词，有一个包含它的文档列表。

例如，假设我们有两个文档，每个文档的 content 域包含如下内容：

1、The quick brown fox jumped over the lazy dog
2、Quick brown foxes leap over lazy dogs in summer

为了创建倒排索引，我们首先将每个文档的 content 域拆分成单独的词（我们称它为词条或 tokens ），创建一个包含所有不重复词条的排序列表，然后列出每个词条出现在哪个文档。结果如下所示：

es浏览器排序 es文件浏览器怎么排序_倒排索引

注： Elasticsearch 中的文档是有字段和值的结构化 JSON 文档。事实上，在 JSON 文档中，每个被索引的字段都有自己的倒排索引。这个倒排索引相比特定词项出现过的文档列表，会包含更多其它信息。它会保存每一个词项出现过的文档总数，在对应的文档中一个具体词项出现的总次数，词项在文档中的顺序，每个文档的长度，所有文档的平均长度，等等。这些统计信息允许 Elasticsearch 决定哪些词比其它词更重要，哪些文档比其它文档更重要。

附加：每个字段都有自己的倒排索引，每个字段的倒排索引都由很多“段”（段是Lucene的概念，每个段其实也是一个倒排索引）组成，之所以由很多段组成是因为段是不可以变的，也就是说段是不能被修改的，在保障段不变的前提下（或者说索引不变的前提下）实现倒排索引的更新，就是用更多的段来保障（当我们更新或者索引数据时，实际上是在创造新的段或者说新的索引，而不是在重写整个索引。当搜索时将会遍历所有的段并合并结果），因此可以想象，我们索引的新文档将可以很快被搜索到，到底有多快呢？ES中写入和打开一个

新的段的过程叫refresh（默认每个分片每秒自动刷新，因此我们索引的文档一定会在1秒之内可见，如果你想更快的看到结果需要主动调用refresh API）。refresh操作并没有进行刷盘的操作（fsync操作），只是写到了文件系统缓存，为了防止掉电丢失数据，还得需要translog保证（translog提供所有没被刷新到磁盘操作的持久化记录，当ES重启时会用来进行恢复，translog还被用来提供实时的CRUD操作，当通过ID操作文档时，ES会先扫描translog，然后才会去扫描索引）。那么translog到底有多持久呢（多安全？默认 translog 是每 5 秒被 fsync 刷新到硬盘，或者在每次写请求完成之后执行(e.g. index, delete, update, bulk)，因此如果没有将translog刷到磁盘中，你的客户端是不会收到200OK的返回，因此ES不会丢数据，如果丢数据了，你应该检查这个translog刷新间隔：GET /YOUR_INDEX_NAME/_settings）。

##分析器

分析器构造成：字符过滤器、分词器、Token过滤器。

字符过滤器：

首先，字符串按顺序通过每个字符过滤器。他们的任务是在分词前整理字符串。一个字符过滤器可以用来去掉HTML，或者将 & 转化成 and。

分词器：

其次，字符串被分词器分为单个的词条。一个简单的分词器遇到空格和标点的时候，可能会将文本拆分成词条。

Token 过滤器：

最后，词条按顺序通过每个 token 过滤器。这个过程可能会改变词条（例如，小写化 Quick ），删除词条（例如，像 a， and， the 等无用词），或者增加词条（例如，像 jump 和 leap 这种同义词）。

Elasticsearch提供了开箱即用的字符过滤器、分词器和token 过滤器。这些可以组合起来形成自定义的分析器以用于不同的目的。

##索引管理
创建索引：
PUT /my_index
{
    "settings": { ... any settings ... },
    "mappings": {
        "type_one": { ... any mappings ... },
        "type_two": { ... any mappings ... },
        ...
    }
}
PUT /my_index_test
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "man": {
      "properties": {
        "word_count": {
          "type": "integer"
        },
        "author": {
          "type": "keyword"
        },
        "title": {
          "type": "text"
        },
        "publish_date": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss || yyyy-MM-dd || epoch_millis"
        }
      }
    },
    "woman": {
      "properties": {
        "word_count": {
          "type": "integer"
        },
        "author": {
          "type": "keyword"
        },
        "title": {
          "type": "text"
        },
        "publish_date": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss || yyyy-MM-dd || epoch_millis"
        }
      }
    }
  }
}
 
查看索引信息：
GET /_cat/indices/my_index_test?v
 
修改索引settings：
PUT /my_temp_index/_settings
{
    "number_of_replicas": 1
}
 
删除索引：
DELETE /my_index
 
 
##配置定义分析器
配置分析器：创建了一个新的分析器，叫做 es_std ， 并使用预定义的西班牙语停用词列表
PUT /spanish_docs
{
    "settings": {
        "analysis": {
            "analyzer": {
                "es_std": {
                    "type":      "standard",
                    "stopwords": "_spanish_"
                }
            }
        }
    }
}
 
 
 
 
测试分析器：
GET /spanish_docs/_analyze?analyzer=es_std&pretty
{
  "text": "El veloz zorro marrón"
}
 
GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}
 
GET /website/_analyze
{
  "field": "title",
  "text": "Black-cats"
}
 
自定义分析器：
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}
 
 
 
 
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}
测试分析器：
GET /my_index/_analyze?analyzer=my_analyzer
{
"text":"The quick & brown fox"
}
 
#查看映射
GET /website/_mapping/blog
 
 
##更新映射
PUT /website/_mapping/blog
{
  "properties" : {
    "tag" : {
      "type" :    "string",
      "index":    "not_analyzed"
    }
  }
}
 
##父子文档映射
PUT /company   #这个没有定义mapping，下边的定义了
{
  "mappings": {
    "branch": {},
    "employee": {
      "_parent": {
        "type": "branch"
      }
    }
  }
}
 
PUT /company
{
  "mappings": {
    "branch": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "country": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    },
    "employee": {
      "_parent": {   #指定employee类型的父亲是branch
        "type": "branch"
      },
      "properties": {
        "dob": {
          "type": "date"
        },
        "hobby": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。