原创:郑佳伟
最近在做搜索推荐的东西,所以整理一些相关的内容和大家分享。
在搜索中有个很重要的工具叫ElasticSearch简称ES,这个工具主要用来做倒排索引,也就是根据想要搜索的内容在ES中进行查找,返回对应的文章,至于为什么不在数据库中查找,因为在数据库中查找速度太慢,具体原因可以参考 《为什么需要 Elasticsearch》 https://zhuanlan.zhihu.com/p/73585202。
在正常的应用中,数据是存放在数据库中的,而在ES中查找,则需要将数据库中的内容导入的ES中,文章则介绍一种工具Monstache,并且会将Monstache的安装和使用方法,docker和docker-compose分享出来,让大家直接使用。
搜索内容入库步骤
- Mongo中的数据同步到ElasticSearch(简称ES)
- 在ES中重新建立索引
前期准备
创建Mongo副本集 docker 部署 MongoDB 副本集
-
ES环境搭建+分词器安装
方法:获取镜像(包含ES7.5和hanlp分词器)
注意: ES版本和hanlp版本需对应
tomczhen/elasticsearch-with-hanlp:7.5.0 #镜像名称
Mongo中的数据同步到ElasticSearch
简单方法
这个过程比较繁琐,为了照顾没有耐心看完文章的小伙伴,所以在文章开始,直接给出docker-compose:
安装docker-compose可以参考lianglin:docker-compose的安装,使用的话可以参考纯洁的微笑:Docker(四):Docker 三剑客之 Docker Compose
version: '3.7'
services:
monstache:
#镜像来源
image: zhengjiawei001/monstache
container_name: zjw_monstache
restart: always
command: bash -c "source /etc/profile && cd /usr/local/monstache && monstache --mongo-url mongodb://10.30.89.124:27011 --elasticsearch-url http://10.30.89.124:9200 -f config.toml"
networks:
default:
external:
name: serving-database_default
注1:可以通过命令行的方法在 config.toml中添加参数,例如 --mongo-url mongodb://10.30.89.124:27011 添加mongo的地址,--elasticsearch-url http://10.30.89.124:9200 添加ES的地址。
所有的参数说明:
monstache_1 | Usage of monstache:
monstache_1 | -change-stream-namespace value
monstache_1 | A list of change stream namespaces
monstache_1 | -cluster-name string
monstache_1 | Name of the monstache process cluster
monstache_1 | -config-database-name string
monstache_1 | The MongoDB database name that monstache uses to store metadata
monstache_1 | -debug
monstache_1 | True to enable verbose debug information
monstache_1 | -delete-index-pattern string
monstache_1 | An Elasticsearch index-pattern to restric the scope of stateless deletes
monstache_1 | -delete-strategy value
monstache_1 | Stategy to use for deletes. 0=stateless,1=stateful,2=ignore
monstache_1 | -direct-read-bounded
monstache_1 | True to limit direct reads to the docs present at query start time
monstache_1 | -direct-read-concur int
monstache_1 | Max number of direct-read-namespaces to read concurrently. By default all givne are read concurrently
monstache_1 | -direct-read-dynamic-exclude-regex string
monstache_1 | A regex to use for excluding namespaces when using dynamic direct reads
monstache_1 | -direct-read-dynamic-include-regex string
monstache_1 | A regex to use for including namespaces when using dynamic direct reads
monstache_1 | -direct-read-namespace value
monstache_1 | A list of direct read namespaces
monstache_1 | -direct-read-no-timeout
monstache_1 | True to set the no cursor timeout flag for direct reads
monstache_1 | -direct-read-split-max int
monstache_1 | Max number of times to split a collection for direct reads
monstache_1 | -direct-read-stateful
monstache_1 | True to mark direct read namespaces as complete and not sync them in future runs
monstache_1 | -disable-change-events
monstache_1 | True to disable listening for changes. You must provide direct-reads in this case
monstache_1 | -disable-delete-protection
monstache_1 | True to disable delete protection and allow multiple deletes in Elasticsearch per event in MongoDB
monstache_1 | -disable-file-pipeline-put
monstache_1 | True to disable auto-creation of the ingest plugin pipeline
monstache_1 | -dropped-collections
monstache_1 | True to delete indexes from dropped collections (default true)
monstache_1 | -dropped-databases
monstache_1 | True to delete indexes from dropped databases (default true)
monstache_1 | -elasticsearch-client-timeout int
monstache_1 | Number of seconds before a request to Elasticsearch is timed out
monstache_1 | -elasticsearch-max-bytes int
monstache_1 | Number of bytes to hold before flushing to Elasticsearch
monstache_1 | -elasticsearch-max-conns int
monstache_1 | Elasticsearch max connections
monstache_1 | -elasticsearch-max-docs int
monstache_1 | Number of docs to hold before flushing to Elasticsearch
monstache_1 | -elasticsearch-max-seconds int
monstache_1 | Number of seconds before flushing to Elasticsearch
monstache_1 | -elasticsearch-password string
monstache_1 | The elasticsearch password for basic auth
monstache_1 | -elasticsearch-pem-file string
monstache_1 | Path to a PEM file for secure connections to elasticsearch
monstache_1 | -elasticsearch-retry
monstache_1 | True to retry failed request to Elasticsearch
monstache_1 | -elasticsearch-url value
monstache_1 | A list of Elasticsearch URLs
monstache_1 | -elasticsearch-user string
monstache_1 | The elasticsearch user name for basic auth
monstache_1 | -elasticsearch-validate-pem-file
monstache_1 | Set to boolean false to not validate the Elasticsearch PEM file (default true)
monstache_1 | -elasticsearch-version string
monstache_1 | Specify elasticsearch version directly instead of getting it from the server
monstache_1 | -enable-easy-json
monstache_1 | True to enable easy-json serialization
monstache_1 | -enable-http-server
monstache_1 | True to enable an internal http server
monstache_1 | -enable-oplog
monstache_1 | True to enable direct tailing of the oplog
monstache_1 | -enable-patches
monstache_1 | True to include an json-patch field on updates
monstache_1 | -env-delimiter string
monstache_1 | A delimiter to use when splitting environment variable values (default ",")
monstache_1 | -exit-after-direct-reads
monstache_1 | True to exit the program after reading directly from the configured namespaces
monstache_1 | -f string
monstache_1 | Location of configuration file
monstache_1 | -fail-fast
monstache_1 | True to exit if a single _bulk request fails
monstache_1 | -file-downloaders int
monstache_1 | GridFs download go routines
monstache_1 | -file-highlighting
monstache_1 | True to enable the ability to highlight search times for a file query
monstache_1 | -file-namespace value
monstache_1 | A list of file namespaces
monstache_1 | -graylog-addr string
monstache_1 | Send logs to a Graylog server at this address
monstache_1 | -gzip
monstache_1 | True to enable gzip for requests to Elasticsearch
monstache_1 | -http-server-addr string
monstache_1 | The address the internal http server listens on
monstache_1 | -index-as-update
monstache_1 | True to index documents as updates instead of overwrites
monstache_1 | -index-files
monstache_1 | True to index gridfs files into elasticsearch. Requires the elasticsearch mapper-attachments (deprecated) or ingest-attachment plugin
monstache_1 | -index-oplog-time
monstache_1 | True to add date/time information from the oplog to each document when indexing
monstache_1 | -index-stats
monstache_1 | True to index stats in elasticsearch
monstache_1 | -mapper-plugin-path string
monstache_1 | The path to a .so file to load as a document mapper plugin
monstache_1 | -max-file-size int
monstache_1 | GridFs file content exceeding this limit in bytes will not be indexed in Elasticsearch
monstache_1 | -merge-patch-attribute string
monstache_1 | Attribute to store json-patch values under
monstache_1 | -mongo-config-url string
monstache_1 | MongoDB config server connection URL
monstache_1 | -mongo-oplog-collection-name string
monstache_1 | Override the collection name which contains the mongodb oplog
monstache_1 | -mongo-oplog-database-name string
monstache_1 | Override the database name which contains the mongodb oplog
monstache_1 | -mongo-url string
monstache_1 | MongoDB server or router server connection URL
monstache_1 | -namespace-drop-exclude-regex string
monstache_1 | A regex which is matched against a drop operation's namespace (<database>.<collection>). Only drop operations which do not match are synched to elasticsearch
monstache_1 | -namespace-drop-regex string
monstache_1 | A regex which is matched against a drop operation's namespace (<database>.<collection>). Only drop operations which match are synched to elasticsearch
monstache_1 | -namespace-exclude-regex string
monstache_1 | A regex which is matched against an operation's namespace (<database>.<collection>). Only operations which do not match are synched to elasticsearch
monstache_1 | -namespace-regex string
monstache_1 | A regex which is matched against an operation's namespace (<database>.<collection>). Only operations which match are synched to elasticsearch
monstache_1 | -oplog-date-field-format string
monstache_1 | Format to use for the oplog date
monstache_1 | -oplog-date-field-name string
monstache_1 | Field name to use for the oplog date
monstache_1 | -oplog-ts-field-name string
monstache_1 | Field name to use for the oplog timestamp
monstache_1 | -patch-namespace value
monstache_1 | A list of patch namespaces
monstache_1 | -pipe-allow-disk
monstache_1 | True to allow MongoDB to use the disk for pipeline options with lots of results
monstache_1 | -post-processors int
monstache_1 | Number of post-processing go routines
monstache_1 | -pprof
monstache_1 | True to enable pprof endpoints
monstache_1 | -print-config
monstache_1 | Print the configuration and then exit
monstache_1 | -prune-invalid-json
monstache_1 | True to omit values which do not serialize to JSON such as +Inf and -Inf and thus cause errors
monstache_1 | -relate-buffer int
monstache_1 | Number of relates to queue before skipping and reporting an error
monstache_1 | -relate-threads int
monstache_1 | Number of threads dedicated to processing relationships
monstache_1 | -replay
monstache_1 | True to replay all events from the oplog and index them in elasticsearch
monstache_1 | -resume
monstache_1 | True to capture the last timestamp of this run and resume on a subsequent run
monstache_1 | -resume-from-earliest-timestamp
monstache_1 | Automatically select an earliest timestamp to resume syncing from
monstache_1 | -resume-from-timestamp int
monstache_1 | Timestamp to resume syncing from
monstache_1 | -resume-name string
monstache_1 | Name under which to load/store the resume state. Defaults to 'default'
monstache_1 | -resume-strategy value
monstache_1 | Strategy to use for resuming. 0=timestamp,1=token
monstache_1 | -resume-write-unsafe
monstache_1 | True to speedup writes of the last timestamp synched for resuming at the cost of error checking
monstache_1 | -routing-namespace value
monstache_1 | A list of namespaces that override routing information
monstache_1 | -stats
monstache_1 | True to print out statistics
monstache_1 | -stats-duration string
monstache_1 | The duration after which stats are logged
monstache_1 | -stats-index-format string
monstache_1 | time.Time supported format to use for the stats index names
monstache_1 | -time-machine-direct-reads
monstache_1 | True to index the results of direct reads into the any time machine indexes
monstache_1 | -time-machine-index-prefix string
monstache_1 | A prefix to preprend to time machine indexes
monstache_1 | -time-machine-index-suffix string
monstache_1 | A suffix to append to time machine indexes
monstache_1 | -time-machine-namespace value
monstache_1 | A list of direct read namespaces
monstache_1 | -tpl
monstache_1 | True to interpret the config file as a template
monstache_1 | -v True to print the version number
monstache_1 | -verbose
monstache_1 | True to output verbose messages
monstache_1 | -worker string
monstache_1 | The name of this worker in a multi-worker configuration
monstache_1 | -workers value
monstache_1 | A list of worker names
其中还有一个非常重要的参数是direct-read-namespaces,用来制定待同步的集合,因为这个镜像是根据自己的需求打包的,默认同步的是zhihu_new数据库中zhuanlan和articles两张表)
注2:参考文章里面ES的地址参数是--elasticsearch-urls,但现在使用的是--elasticsearch-url,可能是版本问题导致的。
详细方法
如果想要自己搭建实现,按照以下步骤操作即可:
一、搭建Monstache环境
1.安装Go, 并配置环境变量
(1)下载Go安装包并解压
wget <https://dl.google.com/go/go1.14.4.linux-amd64.tar.gz>
tar -C /usr/local -xzf go1.14.4.linux-amd64.tar.gz
(2)配置环境变量
使用vim /etc/profile命令打开环境变量配置文件,并将如下内容写入该文件中。其中GOPROXY用来指定阿里云Go模块代理。
export GOROOT=/usr/local/go
export GOPATH=/home/go/
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
export GOPROXY=https://mirrors.aliyun.com/goproxy/
(3)应用环境变量配置
source /etc/profile
2.安装Monstache
(1)进入安装路径
cd /usr/local/
(2)从Git库中下载安装包
git clone <https://github.com/rwynn/monstache.git>
note: 如果github下载比较难,可以使用码云,下载会容易很多
(3)进入monstache目录
cd monstache
(4)切换版本
文档以 rel5 版本为例
git checkout rel5
(5)安装Monstache
go install
(6)查看Monstache版本
monstache -v
# 期望输出
# 5.5.5
二、将Mongo中的数据同步到ElasticSearch
方法:使用Monstache实时同步
(1)进入Monstache安装目录,打开配置文件
cd /usr/local/monstache/
vim config.toml
(2)参考以下示例,修改配置文件
note:
- =mongourl,elasticsearch-urls, direct-read-namespaces可能需要更改
- 若 direct-read-namespaces 中的内容有更新,则需同步修改[[mapping]]
# connection settings
# connect to MongoDB using the following URL
mongo-url = "mongodb://10.30.89.124:27011" # 需更新
# connect to the Elasticsearch REST API at the following node URLs
elasticsearch-urls = ["<http://10.30.89.124:9200>"] # 需更新
# frequently required settings
# if you need to seed an index from a collection and not just listen and sync changes events
# you can copy entire collections or views from MongoDB to Elasticsearch
direct-read-namespaces = ["zhihu_new.zhuanlan","zhihu_new.articles"] #需更新
# if you want to use MongoDB change streams instead of legacy oplog tailing use change-stream-namespaces
# change streams require at least MongoDB API 3.6+
# if you have MongoDB 4+ you can listen for changes to an entire database or entire deployment
# in this case you usually don't need regexes in your config to filter collections unless you target the deployment.
# to listen to an entire db use only the database name. For a deployment use an empty string.
#change-stream-namespaces = ["mydb.col"]
# additional settings
# if you don't want to listen for changes to all collections in MongoDB but only a few
# e.g. only listen for inserts, updates, deletes, and drops from mydb.mycollection
# this setting does not initiate a copy, it is only a filter on the change event listener
#namespace-regex = '^mydb\\.col$'
# compress requests to Elasticsearch
#gzip = true
# generate indexing statistics
#stats = true
# index statistics into Elasticsearch
#index-stats = true
# use the following PEM file for connections to MongoDB
#mongo-pem-file = "/path/to/mongoCert.pem"
# disable PEM validation
#mongo-validate-pem-file = false
# use the following user name for Elasticsearch basic auth
elasticsearch-user = "elastic"
# use the following password for Elasticsearch basic auth
elasticsearch-password = "<your_es_password>"
# use 4 go routines concurrently pushing documents to Elasticsearch
elasticsearch-max-conns = 4
# use the following PEM file to connections to Elasticsearch
#elasticsearch-pem-file = "/path/to/elasticCert.pem"
# validate connections to Elasticsearch
#elastic-validate-pem-file = true
# propogate dropped collections in MongoDB as index deletes in Elasticsearch
dropped-collections = true
# propogate dropped databases in MongoDB as index deletes in Elasticsearch
dropped-databases = true
# do not start processing at the beginning of the MongoDB oplog
# if you set the replay to true you may see version conflict messages
# in the log if you had synced previously. This just means that you are replaying old docs which are already
# in Elasticsearch with a newer version. Elasticsearch is preventing the old docs from overwriting new ones.
#replay = false
# resume processing from a timestamp saved in a previous run
resume = true
# do not validate that progress timestamps have been saved
#resume-write-unsafe = false
# override the name under which resume state is saved
#resume-name = "default"
# use a custom resume strategy (tokens) instead of the default strategy (timestamps)
# tokens work with MongoDB API 3.6+ while timestamps work only with MongoDB API 4.0+
resume-strategy = 0
# exclude documents whose namespace matches the following pattern
#namespace-exclude-regex = '^mydb\\.ignorecollection$'
# turn on indexing of GridFS file content
#index-files = true
# turn on search result highlighting of GridFS content
#file-highlighting = true
# index GridFS files inserted into the following collections
#file-namespaces = ["users.fs.files"]
# print detailed information including request traces
verbose = true
# enable clustering mode
cluster-name = 'es-cn-mp91kzb8m00******'
# do not exit after full-sync, rather continue tailing the oplog
#exit-after-direct-reads = false
[[mapping]] # mapping 内容需更新
namespace = "zhihu_new.articles"
index = "articles"
type = "collection"
[[mapping]] # mapping 内容需更新
namespace = "zhihu_new.zhuanlan"
index = "zhuanlan"
type = "collection"
具体参数可参考下表:
(3)运行Monstache,使Mongo数据与ES同步
monstache -f config.toml
#说明:通过-f参数,您可以显式运行Monstache,系统会打印所有调试日志(包括对ES的请求追踪)。
# 在迁移数据过程中,时间消耗:0.303G : 13分钟左右
(4)参考镜像(可选)
为了简化上面的安装步骤,可以直接使用打包好的镜像
镜像名称:zhengjiawei001/monstache
docker run -it --network=serving-database_default zhengjiawei001/monstache /bin/bash
# network 需和 mongo 的网桥相同(文档事示例为 serving-database_default)
进入容器后:
source /etc/profile # 应用环境变量配置
cd /usr/local/monstache/ #进入monstache目录
vim config.toml #在配置文件中修改es的地址和mongo的地址以及要迁移的数据库
monstache -f config.toml #运行Monstache
在ES中重新建立索引
使用工具:kibana
ES默认索引是字索引,现改为词索引。
(1)重建索引(增加分词器)
GET zhuanlan/_mapping #查看之前的索引 并复制
PUT zhuanlan_new #在type :'text'的后面,都指定分词器是hanlp。"analyzer" : "hanlp"
{
"mappings" : {
"properties" : {
"accept_submission" : {
"type" : "boolean"
},
"articles_count" : {
"type" : "long"
},
"column_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"analyzer" : "hanlp"
},
"comment_permission" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"analyzer" : "hanlp"
},
"created" : {
"type" : "long"
},
"description" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"analyzer" : "hanlp"
},
"followers" : {
"type" : "long"
},
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"image_url" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"intro" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"analyzer" : "hanlp"
},
"type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"analyzer" : "hanlp"
},
"updated" : {
"type" : "long"
},
"url" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"url_token" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
},
"analyzer" : "hanlp"
}
}
}
}
(2)将旧索引copy至新索引
POST /_reindex?wait_for_completion=false
{
"source": {
"index": "articles",
"size":5000
},
"dest": {
"index": "articles_new",
"op_type": "create"
},
"conflicts": "proceed"
}
参数说明:
* 若 reindex 时间过⻓,建议加上 wait_for_completion=false 参数返回taskId。
这样可以使用taskId查看当前任务进程。
* op_type:控制着写入数据的冲突处理方式。若 op_type 设置为 create【默认值】,
如果document已经在新索引中存在,则报错(version confilct),否则将创建。
若将 op_type 设置为 index,表示所有数据将重新创建索引。
* conflicts,当出现报错(version confilct)时,_reindex会终止(数据可能没有 reindex 完成)。
此时,可将 conflicts 设置为 proceed,使得旧索引继续copy至新索引。
* size:批次大小配置,默认一批1000条数据。
输出结果:
{
"task" : "bEECFrEzTv-zaWKADdtRWw:29118"
}
# 29118 任务id
note: 若source index 很大(几百万数据量),这可能需要较长的时间来完成 _reindex 的工作。在这期间不必一直等待结果,此时可以去做其它事情。如果中途需要查看进度,可以通过 _tasks API 进行查看。
GET /_tasks/bEECFrEzTv-zaWKADdtRWw:29118
当执行完毕时,completed为true
(3)删除旧索引
DELETE /zhuanlan
(4)使用别名
POST /_aliases
{
"actions":[
{
"add":{
"index":"zhuanlan_new",
"alias":"zhuanlan"
}
}
]
}
(5)测试一下,使用词语进行搜索
GET _search
{
"query":{
"term":{
"content":{
"value":"运营"
}
}
}
}
返回搜索结果