数据库
ElasticSearch

ES - 分词

安装IK分词器插件
测试分词器
自定义词库
Elasticsearch-Rest-Client

一个tokenizer（分词器）接收一个字符流，将之分割为独立的tokens（词元，通常是独立的单词），然后输出tokens流。

例如：whitespace tokenizer遇到空白字符时分割文本。它会将文本"Quick brown fox!"分割为[Quick,brown,fox!]

ElasticSearch提供了很多内置的分词器（标准分词器），可以用来构建custom analyzers（自定义分词器）。

官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis.html

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 Brown-Foxes bone."
}

1
2
3
4
5

执行结果：

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "bone",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

对于中文，我们需要安装额外的分词器

# 安装IK分词器插件

下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases

需要版本号对应

创建目录elasticsearch/plugins/ik，并在目录下解压，之后重启
在命令行执行elasticsearch-plugin list命令，确认ik插件安装成功

# 测试分词器

使用默认分词器

GET _analyze
{
   "text":"我是中国人"
}

1
2
3
4

使用分词器

GET _analyze
{
   "analyzer": "ik_smart", 
   "text":"我是中国人"
}

1
2
3
4
5

另外一个分词器

GET _analyze
{
   "analyzer": "ik_max_word", 
   "text":"我是中国人"
}

1
2
3
4
5

# 自定义词库

修改/elasticsearch/plugins/ik/config中的IKAnalyzer.cfg.xml

根据自己需求修改，如果是配置自己的扩展字典（本地），则在ik/config目录下新建文件

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict"></entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">http://192.168.56.10/es/fenci.txt</entry> 
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

1
2
3
4
5
6
7
8
9
10
11
12
13

修改完成后，需要重启elasticsearch容器，否则修改不生效。

更新完成后，es只会对于新增的数据用更新分词。

历史数据是不会重新分词的。如果想要历史数据重新分词，需要执行：

POST my_index/_update_by_query?conflicts=proceed

# Elasticsearch-Rest-Client

java-rest-high：https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high.html

安装：https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-getting-started-initialization.html

使用api：https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-document-index.html

在 GitHub 上编辑此页

最后更新: 2022/08/07, 3:08:00

← ES - 索引管理 MongoDB - 知识体系→