ES作为最强大的全文检索工具(没有之一),中英文分词几乎是必备功能,下面简单说明下分词器安装步骤(详细步骤网上很多,本文只提供整体思路和步骤):
1. 下载中文/拼音分词器
IK中文分词器:https://github.com/medcl/elasticsearch-analysis-ik
拼音分词器:https://github.com/medcl/elasticsearch-analysis-pinyin
(竟然都是同一个作者的杰作,还有mmseg和简繁转换的类库,依然默默 watch)
2. 安装
- 通过releases找到和es对应版本的zip文件,或者source文件(自己通过mvn package打包);当然也可以下载最新master的代码。
- 进入elasticsearch安装目录/plugins;mkdir pinyin;cd pinyin;
- cp 刚才打包的zip文件到pinyin目录;unzip解压
- 部署后,记得重启es节点
3. 配置
** settings配置 **
PUT my_index/_settings
"index" : {
"number_of_shards" : "3",
"number_of_replicas" : "1",
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "ik_max_word"
},
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"keep_separate_first_letter" : "false",
"lowercase" : "true",
"type" : "pinyin",
"limit_first_letter_length" : "16",
"keep_original" : "true",
"keep_full_pinyin" : "true"
}
}
}
}
** mapping 配置 **
PUT my_index/index_type/_mapping
"ep" : {
"_all" : {
"analyzer" : "ik_max_word"
},
"properties" : {
"name" : {
"type" : "text",
"analyzer" : "ik_max_word",
"include_in_all" : true,
"fields" : {
"pinyin" : {
"type" : "text",
"term_vector" : "with_positions_offsets",
"analyzer" : "pinyin_analyzer",
"boost" : 10.0
}
}
}
}
}
4. 测试
通过_analyze测试下分词器是否能正常运行:
GET my_index/_analyze
{
"text":["刘德华"],
"ananlyzer":"pinyin_analyzer"
}
向index中put中文数据:
POST my_index/index_type -d'
{
"name":"刘德华"
}
'
中文分词测试(通过查询字符串)
curl http://localhost:9200/my_index/index_type/_search?q=name:刘
curl http://localhost:9200/my_index/index_type/_search?q=name:刘德
拼音测试 (通过查询字符串)
curl http://localhost:9200/my_index/index_type/_search?q=name.pinyin:liu
curl http://localhost:9200/my_index/index_type/_search?q=name.pinyin:ldh
curl http://localhost:9200/my_index/index_type/_search?q=name.pinyin:de+hua