对lucene in action 和其他书里面对于mergeFactor讲解的质疑
最近要做搜索了,而且公司用的就是lucene,所以自己先学习一番,看了lucene in action和今天买的一本lucene2.0+heritrix,上面对mergeFactor都是这样说的“每向索引添加mergeFactor个document时,就会有一个新的segment在磁盘建立起来......"。而对于minMergeDocs都是一笔带过,说是限制内存中文档的数量。 于是我就开始奇怪了,这两个值这么一来不就冲突了吗,两个值一样的功能,于是乎我就做了几个试验,我有81个document,然后我把mergeFactor设置为5,把minMergeDocs设置为8,把maxMergeDocs设置为45。按照书上的讲,这样每5个doc就会生成一个segment,事实怎么样呢[code]package org.apache.lucene.demo;
[code]
/**
*Copyright2004TheApacheSoftwareFoundation
*
*LicensedundertheApacheLicense,Version2.0(the"License");
*youmaynotusethisfileexceptincompliancewiththeLicense.
*YoumayobtainacopyoftheLicenseat
*
*http://www.apache.org/licenses/LICENSE-2.0
*
*Unlessrequiredbyapplicablelaworagreedtoinwriting,software
*distributedundertheLicenseisdistributedonan"ASIS"BASIS,
*WITHOUTWARRANTIESORCONDITIONSOFANYKIND,eitherexpressorimplied.
*SeetheLicenseforthespecificlanguagegoverningpermissionsand
*limitationsundertheLicense.
*/import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.index.IndexWriter;
import java.io.File;
importjava.io.FileNotFoundException;
importjava.io.IOException;
import java.util.Date;class IndexFiles {
publicstaticvoidmain(String[]args)throwsIOException{
Stringusage="java"+IndexFiles.class+"<root_directory>";
if(args.length==0){
System.err.println("Usage:"+usage);
System.exit(1);
}Date start = new Date();
try{
FileINDEX_DIR=newFile(args[0]);
if(INDEX_DIR.exists()){
INDEX_DIR.delete();
}
IndexWriterwriter=newIndexWriter("index",
newStandardAnalyzer(),true);
writer.setUseCompoundFile(false);
writer.mergeFactor=5;
writer.maxMergeDocs=40;
writer.minMergeDocs=8;
indexDocs(writer, INDEX_DIR);// writer.optimize(); writer.close();
Date end = new Date();
System.out.print(end.getTime() - start.getTime()); System.out.println(" total milliseconds");
} catch (IOException e) {
System.out.println("caughta"+e.getClass()
+"\nwithmessage:"+e.getMessage());
}
}public static void indexDocs(IndexWriter writer, File file)
throwsIOException{
//donottrytoindexfilesthatcannotberead
if(file.canRead()){
if(file.isDirectory()){
String[]files=file.list();
//anIOerrorcouldoccur
if(files!=null){
for(inti=0;i<files.length;i++){
indexDocs(writer,newFile(file,files[i]));
}
}
} else {try {
if(file.getName().endsWith(".txt")){
System.out.println("adding"+file);
writer.addDocument(FileDocument.Document(file));
}
}
//atleastonwindows,sometemporaryfilesraisethis
//exceptionwithan"accessdenied"message
//checkingifthefilecanbereaddoesn'thelp
catch(FileNotFoundExceptionfnfe){
;
}
}
}
}
}
[/code]debug他在 writer.addDocument(FileDocument.Document(file)); writer.addDocument(FileDocument.Document(file));这里设上断点,然后发现在第5个document添加的时候并没有出现segment生成,而是在第8个document添加的时候出现了第一个segment的生成。接下来再做一个试验把这两个值倒过来,然后你就会发现这次,在第5个document添加的时候出现了第一个segment的生成。
所以我认为,mergeFactor只是控制segment合并的,并不控制多少个document生成一个segement,而minMergeDocs是控制多少个document生成一个segement。
另外附上我自己写的一个计算产生segement数量的算法,写得比较匆忙,可能有不对的地方,另外有一条分支没有验证就是当maxMergeDocs<minMergeDocs时,我试验他就生成了一个segment不知道为啥。
package com.sina.easy.util;
public class CountSegmentNum {
private int docNum = 0;
private int mergefactor = 10;
private int maxMergeDocs = Integer.MAX_VALUE;
private int minMergeDocs = 10;
private int segmentNum = 0;
public CountSegmentNum(int docNum, int mergefactor, int maxMergeDocs,
int minMergeDocs) {
this.docNum = docNum;
this.mergefactor = mergefactor;
this.maxMergeDocs = maxMergeDocs;
this.minMergeDocs = minMergeDocs;
}
public void countNum() {
int i = 1;
int tempmerfactormulti = mergefactor;
while (true) {
if (docNum == 0) {
return;
}
if (docNum < minMergeDocs) {
segmentNum++;
return;
}
if (maxMergeDocs >= docNum) {
int x = docNum / minMergeDocs;
int z = x % mergefactor;
if (x >= mergefactor) {
segmentNum++;
}
segmentNum += z;
docNum = docNum % minMergeDocs;
}else{
if(maxMergeDocs<minMergeDocs)
{
segmentNum = 1; //这条分支没有详细验证,不过实际应用应该没人这么用
return;
}
if(maxMergeDocs< tempmerfactormulti*minMergeDocs){
int nowmerfactor = tempmerfactormulti;
for(;i>=1;i--){
nowmerfactor = tempmerfactormulti/mergefactor;
segmentNum+=docNum/(nowmerfactor*minMergeDocs);
docNum = docNum%(nowmerfactor*minMergeDocs);
}
}else{
tempmerfactormulti = tempmerfactormulti*mergefactor;
i++;
}
}
}
}
public int getSegmentNum() {
return segmentNum;
}
public static void main(String[] args) {
CountSegmentNum csn = new CountSegmentNum(81, 5, 60, 4);
csn.countNum();
System.out.println(csn.getSegmentNum());
}
}