面试的时候,通常被问到Solr或Elaticsearch时,都会问到他的底层框架Lucene,讲的越详细,面试官越喜欢。下面我们就来剖析一下Lucene原理,以及利用他来做一个仿百度的搜索引擎网站。
Lucene为什么这么快
倒排索引
直奔主题,倒排索引就是Lucene的核心!根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为倒排索引(invertedindex)。不同于传统的顺排索引(知道那个在文章的哪个位置),倒排索引是根据一个词,知道有哪个几篇文章有这个词。
单词——文档矩阵
Lucene之所以那么快,是因为在搜索前,Lucene已经帮我们生成倒排索引,相比此前的数据库like的模糊搜索效率更高!
两个概念,document 和 field
document
用户提供的源是一条条记录,它们可以是文本文件、字符串或者数据库表的一条记录等等。一条记录经过索引之后,就是以一个Document的形式存储在索引文件中的。用户进行搜索,也是以Document列表的形式返回。
field
一个Document可以包含多个信息域,例如一篇文章可以包含“标题”、“正文”、“最后修改时间”等信息域,这些信息域就是通过Field在Document中存储的。 Field有两个属性可选:存储和索引。通过存储属性你可以控制是否对这个Field进行存储;通过索引属性你可以控制是否对该Field进行索引。这看起来似乎有些废话,事实上对这两个属性的正确组合很重要。
lucene的工作方式 lucene提供的服务实际包含两部分:一入一出。所谓入是写入,即将你提供的源(本质是字符串)写入索引或者将其从索引中删除;所谓出是读出,即向用户提供全文搜索服务,让用户可以通过关键词定位源
写入流程
源字符串首先经过analyzer处理,包括:分词,分成一个个单词;去除stopword(可选)。 将源中需要的信息加入Document的各个Field中,并把需要索引的Field索引起来,把需要存储的Field存储起来。 将索引写入存储器,存储器可以是内存或磁盘。
读出流程
用户提供搜索关键词,经过analyzer处理。 对处理后的关键词搜索索引找出对应的Document。 用户根据需要从找到的Document中提取需要的Field。
Lucene打分公式
Lucene的打分公式决定搜索出来的文件的排序,然鹅,Lucene的打分公式非常复杂:
我们只需要记住几个公式名词:
TF:单个文章的词频,词在文档中出现的词频
IDF:逆词频,词在这篇文档中出现过次数/词在所有文章出现的次数
参考链接:https://www.cnblogs.com/forfuture1978/archive/2010/03/07/1680007.html
基于Lucene开发的搜索引擎网站:
爬数据:
我这里爬的是一个军事网站的数据:
wget -o /tmp/wget.log -P /root/data --no-parent --no-verbose -m -D www.tiexue.net -N --convert-links --random-wait --no-check-certificate -A html,HTML http://www.tiexue.net/
我把他放在E盘。
下面6个Class我都放在同一包下:
LuceneController:
package com.michael.lucene;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.servlet.ModelAndView;
import org.wltea.analyzer.lucene.IKAnalyzer;
@Controller
public class LuceneController {
CreateIndex createIndex = new CreateIndex();
@RequestMapping(value="/index")
public ModelAndView index(String searchWord,int num){
ModelAndView mav = new ModelAndView();
mav.setViewName("index");
if(null==searchWord) {
return mav;
}
try {
Directory directory = FSDirectory.open(new File(CreateIndex.indexDir));
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
// Analyzer analyzer = new IKAnalyzer();
MultiFieldQueryParser mqp = new MultiFieldQueryParser(Version.LUCENE_4_9, new String[] {"title" , "content" , "url" }, new StandardAnalyzer(Version.LUCENE_4_9));
Query query = mqp.parse(searchWord);
// QueryParser queryParser = new QueryParser(Version.LUCENE_4_9 , "content", new StandardAnalyzer(Version.LUCENE_4_9));
// Query query = queryParser.parse(searchWord);
TopDocs search = indexSearcher.search(query, 10);
int count = search.totalHits;
ScoreDoc[] scoreDocs = search.scoreDocs;
System.out.println(search.totalHits);
PageUtils<HtmlBean> page = new PageUtils<HtmlBean>(num,10,count);
List<HtmlBean> ls = new ArrayList<>();
for (ScoreDoc scoreDoc : scoreDocs) {
Document document = indexReader.document(scoreDoc.doc);
SimpleHTMLFormatter sf = new SimpleHTMLFormatter("<font color=\"red\">","</font>");
QueryScorer qs = new QueryScorer(query,"title");
Highlighter highlighter = new Highlighter(sf,qs);
String title = document.get("title");
String content = highlighter.getBestFragment(new IKAnalyzer(), "content" ,document.get("content"));
String url = document.get("url");
HtmlBean htmlBean = new HtmlBean();
htmlBean.setTitle(title);
htmlBean.setContent(content);
htmlBean.setUrl(url);
ls.add(htmlBean);
}
page.setList(ls);
mav.addObject("page", page);
} catch (Exception e) {
e.printStackTrace();
}
return mav;
}
@RequestMapping("/createIndex")
public String name() {
File file = new File(CreateIndex.indexDir);
if (file.exists()) {
file.delete();
file.mkdirs();
}
createIndex.createHtmlIndex();
return "create";
}
}
CreateIndex:
package com.michael.lucene;
import java.io.File;
import java.io.IOException;
import java.util.Collection;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.filefilter.TrueFileFilter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import org.springframework.web.util.HtmlUtils;
import java.io.File;
public class CreateIndex {
public static final String indexDir = "E:\\Lucene\\index";
public static final String dataDir = "E:\\Lucene\\data";
public static final String htmlDataDir = "E:\\www.tiexue.net";
public void createHtmlIndex() {
try {
Directory directory = FSDirectory.open(new File(indexDir));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
File file = new File(htmlDataDir);
Collection<File> files = FileUtils.listFilesAndDirs(file, TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE);
for (File f : files) {
HtmlBean htmlBean = HtmlBeanUtil.parseHtml(f);
if(null==htmlBean) {
continue;
}
Document document = new Document();
document.add(new StringField("title", htmlBean.getTitle(), Store.YES));
document.add(new TextField("content", htmlBean.getContent(), Store.YES));
document.add(new StringField("url", htmlBean.getUrl(), Store.YES));
indexWriter.addDocument(document);
}
indexWriter.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void createIndex() {
try {
Directory directory = FSDirectory.open(new File(indexDir));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
File file = new File(dataDir);
File[] files = file.listFiles();
for (File f : files) {
Document document = new Document();
document.add(new StringField("fileName", f.getName(), Store.YES));
document.add(new TextField("content", FileUtils.readFileToString(f), Store.YES));
document.add(new LongField("lastModify", f.lastModified(), Store.YES));
indexWriter.addDocument(document);
}
indexWriter.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
SearchIndex:(测试类)
package com.michael.lucene;
import java.io.File;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Scorer;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
public class SearchIndex {
@Test
public void search() {
try {
Directory directory = FSDirectory.open(new File(CreateIndex.indexDir));
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser(Version.LUCENE_4_9 , "content", new StandardAnalyzer(Version.LUCENE_4_9));
Query query = queryParser.parse("军事");
TopDocs search = indexSearcher.search(query, 10);
ScoreDoc[] scoreDocs = search.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
Integer docId = scoreDoc.doc;
Document document = indexReader.document(docId);
System.out.println(document.get("title"));
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
HtmlBean:(实体)
package com.michael.lucene;
public class HtmlBean {
String title;
String content;
String url;
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
}
HtmlBeanUtil:(工具类)
package com.michael.lucene;
import java.io.File;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.HTMLElementName;
import net.htmlparser.jericho.Source;
public class HtmlBeanUtil {
public static final String htmlDataDir = "E:\\www.tiexue.net\\index.html";
public static HtmlBean parseHtml(File file) {
HtmlBean htmlBean = new HtmlBean();
try {
Source source = new Source(file);
Element title = source.getFirstElement(HTMLElementName.TITLE);
if(null==title||null==title.getTextExtractor()) {
return null;
}
String content = source.getTextExtractor().toString();
String path = file.getAbsolutePath();
htmlBean.setTitle(title.getTextExtractor().toString());
htmlBean.setContent(content);
htmlBean.setUrl("http://"+path.substring(3));
System.out.println("生成的title:"+htmlBean.getTitle());
System.out.println("生成的content:"+htmlBean.getContent());
System.out.println("生成的url:"+htmlBean.getUrl());
} catch (Exception e) {
e.printStackTrace();
return null;
}
return htmlBean;
}
@Test
public void search() {
try {
Source source = new Source(new File(htmlDataDir));
Element title = source.getFirstElement(HTMLElementName.TITLE);
String content = source.getTextExtractor().toString();
String path = new File(htmlDataDir).getAbsolutePath();
System.out.println(title);
System.out.println(content);
System.out.println(path);
} catch (Exception e) {
e.printStackTrace();
}
}
}
PageUtils:
package com.michael.lucene;
import java.util.List;
public class PageUtils<T> {
private int currentPage;// 当前页
private int pageSize = 10;// 每页显示记录数 常量
private int totalRecord;// 总记录数
private int totalPage;// 总页数
private int firstPage;// 第一页
private int lastPage;// 最后一页
private int prePage;// 上一页
private int nextPage;// 下一页
private int position;// 从第几条信息记录 开始查询
// private Properties properties;
/**
* 记录列表
*/
private List<?> list;
// public void initPageSize() {
// properties = new Properties();
// InputStream loadFile = this.getClass().getResourceAsStream(
// "/com/dada/config/conn.properties");
// try {
// properties.load(loadFile);
//
// // 从配置文件读取 每页显示记录数 常量
// pageSize = Integer.parseInt(properties.getProperty("pageSize")
// .trim());
// System.out.println("pagesize:" + pageSize);
// } catch (IOException e) {
// e.printStackTrace();
// }
// }
public PageUtils(int totalRecord) {
// initPageSize();// 一定放在此构造方法的第一行
this.totalRecord = totalRecord;
}
public PageUtils(int currentPage, int totalRecord) {
// initPageSize();// 一定放在此构造方法的第一行
this.totalRecord = totalRecord;
this.currentPage = currentPage;
// initPageSize();
}
public PageUtils(int currentPage,int pageSize, int totalRecord) {
// initPageSize();// 一定放在此构造方法的第一行
this.totalPage = (int)Math.ceil(totalRecord * 1.0 / pageSize);
this.currentPage = currentPage;
this.pageSize = pageSize;
this.totalRecord = totalRecord;
// initPageSize();
}
public int getCurrentPage() {
if (this.currentPage < 1)
this.currentPage = 1;
if (this.currentPage > this.getTotalPage())
this.currentPage = this.getTotalPage();
return currentPage;
}
public void setCurrentPage(int currentPage) {
this.currentPage = currentPage;
}
public int getPageSize() {
return pageSize;
}
public void setPageSize(int pageSize) {
this.pageSize = pageSize;
}
public int getTotalRecord() {
return totalRecord;
}
public void setTotalRecord(int totalRecord) {
this.totalRecord = totalRecord;
}
public int getTotalPage() {
if (this.getTotalRecord() % pageSize == 0)
return this.getTotalRecord() / pageSize;
return this.getTotalRecord() / pageSize + 1;
}
public void setTotalPage(int totalPage) {
this.totalPage = totalPage;
}
public int getFirstPage() {
return 1;
}
public void setFirstPage(int firstPage) {
this.firstPage = firstPage;
}
public int getLastPage() {
return this.getTotalPage();
}
public void setLastPage(int lastPage) {
this.lastPage = lastPage;
}
public int getPrePage() {
if (this.getCurrentPage() - 1 <= 0)
return 1;
return this.getCurrentPage() - 1;
}
public void setPrePage(int prePage) {
this.prePage = prePage;
}
public int getNextPage() {
if (this.getCurrentPage() + 1 >= this.getTotalPage())
return this.getTotalPage();
return this.getCurrentPage() + 1;
}
public void setNextPage(int nextPage) {
this.nextPage = nextPage;
}
public int getPosition() {
return (this.getCurrentPage() - 1) * pageSize + 1;
}
public void setPosition(int position) {
this.position = position;
}
public List<?> getList() {
return list;
}
public void setList(List<?> list) {
this.list = list;
}
}
页面:
create.jsp:
<%@ page language="java" contentType="text/html; charset=UTF-8"
pageEncoding="UTF-8"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Insert title here</title>
</head>
<body>
<a href="createIndex">生成索引</a>
</body>
</html>
index.jsp:
<%@ page language="java" contentType="text/html; charset=UTF-8"
pageEncoding="UTF-8"%>
<%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>山寨版百度</title>
</head>
<body>
<form action="index" method="post">
<input name="searchWord" maxlength="50" value="军事">
<input type="submit" value="百度一下">
<input name="num" type="hidden" value="10">
</form>
<br>
<br>
百度为您找到相关结果约${page.totalRecord} 个
<br>
<br>
<c:forEach items="${page.list}" var="hb">
<a href="${hb.url}" target="_blank">${hb.title}</a>
<p>
${hb.content}
</p>
${hb.url}
<br>
<br>
</c:forEach>
</body>
</html>
跑起来!
首先http://localhost:8080/Lucene/createIndex
将之前准备好的E:\www.tiexue.net里的页面生成索引放到自定义的文件夹里E:\Lucene\index
可以看到生成的索引
然后我们再click入http://localhost:8080/Lucene/index?num=1
再点击“百度一下”按钮:
一个基于Lucene的山寨版搜索引擎就这样完成了。
深入了解:
之前我们说一入一出,我们看看写入
在CreateIndex类里面:
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
先定义一个分词器Analyzer,把它塞到配置类里面,配置类再塞到IndexWriter
利用java的读文件功能将我们指定的文件读出来遍历
for (File f : files) {
HtmlBean htmlBean = HtmlBeanUtil.parseHtml(f);
if(null==htmlBean) {
continue;
}
Document document = new Document();
document.add(new StringField("title", htmlBean.getTitle(), Store.YES));
document.add(new TextField("content", htmlBean.getContent(), Store.YES));
document.add(new StringField("url", htmlBean.getUrl(), Store.YES));
indexWriter.addDocument(document);
}
之前说的field和document就在这里体现,我们将我们定义的title,content,url这些field塞到document里面,再将document塞到IndexWriter,IndexWriter利用分词器分析,再根据每个词归档形成倒排索引。
这就是生成索引的方式。
我们再看看 出:
Directory directory = FSDirectory.open(new File(CreateIndex.indexDir));
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
先把索引数据读出来,放到IndexSearcher里面,然后
MultiFieldQueryParser mqp = new MultiFieldQueryParser(Version.LUCENE_4_9, new String[] {"title" , "content" , "url" }, new StandardAnalyzer(Version.LUCENE_4_9));
Query query = mqp.parse(searchWord);
利用MultiFieldQueryParser 去解析输入进来的searchWord,并定义"title" , "content" , "url"我们要找的field。
TopDocs search = indexSearcher.search(query, 10);
就这样,我们轻而易举地找出前十条数据。后面就是对这些数据进行封装整理输出到页面。
总结:
solr和elasticsearch无非利用Lucene里“入”和“出”的接口实现搜索功能,就像我们这个搜索的小项目一样,而solr利用zookeeper实现分布式,elasticsearch利用自己的分布式框架。
希望文章能给你帮助~