es 不使用分词查询 es查询所有分词器

转载

网线小游侠 2024-05-28 17:23:08

文章标签 es 不使用分词查询 elasticsearch ik分词器分词器 analyzer 文章分类 架构后端开发

前面章节已经安装了分词器，但是关于分词器的具体使用方式，一直没有仔细研究，今天大概研究了下，记录下来作为备忘。

英文分词

英文分词是按照空格来分的，请求参数如下：

POST http://10.140.188.135:9200/_analyze
{
    "text": "hello word"
}

返回内容：
{
    "tokens": [
        {
            "token": "hello",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "word",
            "start_offset": 6,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

无论内容是否正确，都是按照空格来分词：

POST http://10.140.188.135:9200/_analyze
{
    "text": "nihao word"
}

返回内容：
{
    "tokens": [
        {
            "token": "nihao",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "word",
            "start_offset": 6,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

中文分词

默认的分词如下：

POST http://10.140.188.135:9200/_analyze
{
    "text": "生活如此美丽"
}

返回内容：
{
    "tokens": [
        {
            "token": "生",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "活",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "如",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "此",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "美",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "丽",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        }
    ]
}

默认分词把每一个汉字当成一个词来处理，这显然不是我们需要的。所以我们可以执行分词方式。

指定分词器

之前我们安装了IK分词器，IK分词器有两种分词方式，ik_max_word和ik_smart，区别是ik_max_word会做最细粒度的分词，而ik_smart会做最粗粒度的分词。具体可以通过下面自定义分词来理解。

ik_max_word方式分词：

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_max_word",
    "text": "生活如此美好"  
}

返回数据：
{
    "tokens": [
        {
            "token": "生活",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "如此",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "美好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

ik_smart分词方式：

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_smart",
    "text": "生活如此美好"  
}

返回数据：
{
    "tokens": [
        {
            "token": "生活",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "如此",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "美好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

可以看到，这句话两种分词方式结果一样。下面换一句就能看到区别了：

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_max_word",
    "text": "中华人民共和国"  
}

返回数据：
{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 8
        }
    ]
}


POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_smart",
    "text": "中华人民共和国"  
}

返回数据：
{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

可以看到，ik_max_word的方式，分词更为细致。而ik_smart分词的粒度更粗。

自定义分词

对于专业名字，需要自定义来配置，ik分词器支持自定义分词。具体方法如下。

到ik分词器安装目录下的config文件夹下，可以看到有dic后缀的文件，还有一个IKAnalyzer.cfg.xml文件，其中IKAnalyzer.cfg.xml为配置文件，dic为配置的分词文件。

新建一个mytest.dic文件，输入需要分词的内容（每一行表示一个分词）：

生活如
此美好
活如
此美

修改IKAnalyzer.cfg.xml：

<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">mytest.dic</entry>

重启es服务后，再查询结果如下：

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_max_word",
    "text": "生活如此美好"  
}

返回数据：
{
    "tokens": [
        {
            "token": "生活如",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "生活",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "活如",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "如此",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "此美好",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "此美",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "美好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 6
        }
    ]
}

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_smart",
    "text": "生活如此美好"  
}

返回数据：
{
    "tokens": [
        {
            "token": "生活如",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "此美好",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

可以看到，已经按照我们的意愿进行了分词，具体使用那种分词方式，就根据自己的需求确定了。

（在config目录中，原始的文件都是默认的分词的配置，如果把里面的“中华人民共和国”删除，则不会出现上面“中华人民共和国”作为一个整体的分词结果）。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。