主流全文索引工具的比较（ Lucene, Sphinx, solr, elastic search)

momomoniqwer

2012-06-15

关注关注

前几天的调研（Rails3下的fulltextsearch(全文本搜索，全文匹配？）），我发现了两个不错的候选：

1.lucene(solr,elasticsearch都是基于它）

2.sphinx

两者都有很不错的口碑。所以今天更加进一步的调查。把看到的有价值的文章记录在这里：

1.http://stackoverflow.com/questions/737275/comparison-of-full-text-search-engine-lucene-sphinx-postgresql-mysql

------------

回答1.Resultrelevancerankingisthedefault.Youcansetupyourownsortingshouldyouwish,andgivespecificfieldshigherweightings.

结果相关度是排序的默认条件。你也可以自行指定，也可以配置不同列的权重。

Indexingspeedissuper-fast,becauseittalksdirectlytothedatabase.AnyslownesswillcomefromcomplexSQLqueriesandun-indexedforeignkeysandothersuchproblems.I'venevernoticedanyslownessinsearchingeither.

由于直接跟数据库对话，它建立索引的速度超快，除非你的SQL语句非常复杂，或者某个列没有使用索引。我的项目中没遇到这些问题。

Thesearchservicedaemon(searchd)isprettylowonmemoryusage-andyoucansetlimitsonhowmuchmemorytheindexerprocessusestoo.

搜索服务进程占用资源极小，你也可以指定内存大小的分配。

Scalabilityiswheremyknowledgeismoresketchy-butit'seasyenoughtocopyindexfilestomultiplemachinesandrunseveralsearchddaemons.ThegeneralimpressionIgetfromothersthoughisthatit'sprettydamngoodunderhighload,soscalingitoutacrossmultiplemachinesisn'tsomethingthatneedstobedealtwith.

扩展性:我对它了解的不多。但是很容易把一份索引COPY到多个服务器上，然后再跑多个搜索进程。从其他人那里了解的情况是：在高压高并发下，单极表现就足够好了！所以没必要考虑把它做成分布式。。。

There'snosupportfor'did-you-mean',etc-althoughthesecanbedonewithothertoolseasilyenough.Sphinxdoesstemwordsthoughusingdictionaries,so'driving'and'drive'(forexample)wouldbeconsideredthesameinsearches.

它不支持查询纠正（“你是不是想搜索OOXX”？）Sphinx使用字典进行分词，所以driving和drive返回的搜索结果是一样的。

------------

回答2.

Idon'tknowSphinx,butasforLucenevsadatabasefull-textsearch,IthinkthatLuceneperformanceisunmatched.Youshouldbeabletodoalmostanysearchinlessthan10ms,nomatterhowmanyrecordsyouhavetosearch,providedthatyouhavesetupyourLuceneindexcorrectly.

我没用过Shpinx，但是跟数据库相比的话，lucene的能力是无可匹敌的。你几乎可以在10ms内做任何搜索，不管目标的数据量有多大。（前提是你正确的建立好了索引）

Herecomesthebiggesthurdlethough:personally,IthinkintegratingLuceneinyourprojectisnoteasy.Sure,itisnottoohardtosetitupsoyoucandosomebasicsearch,butifyouwanttogetthemostoutofit,withoptimalperformance,thenyoudefinitelyneedagoodbookaboutLucene.

这里有个最大的门槛：个人以为，在项目中集成lucene并不容易。当然了，建立具备基本功能的原型并不难，但是你想要优化的话，你手边有一本非常好的书才行。

AsforCPU&RAMrequirements,performingasearchinLucenedoesn'ttaskyourCPUtoomuch,thoughindexingyourdatais,althoughyoudon'tdothattoooften(maybeonceortwiceaday),sothatisn'tmuchofahurdle.

检索时它对CPU和内存的需求很小，建立索引时却不小，不过估计你每天重建索引的次数也不多，所以估计问题不大。

http://stackoverflow.com/a/2288211/445908

elasticsearch的作者的回答：

AsthecreatorofElasticSearch,maybeIcangiveyousomereasoningonwhyIwentaheadandcreateditinthefirstplace.

做为ElasticSearch的作者，也许我可以解释一下我建立这个项目的缘由。

UsingpureLuceneischallenging.Therearemanythingsthatyouneedtotakecareforifyouwantittoreallyperformwell,andalso,itsalibrary,sonodistributedsupport,itsjustanembeddedJavalibrarythatyouneedtomaintain.

使用Lucene有一定的挑战性。想要用好它的话，你需要时刻留心很多东西。而且它只是一个jar包，不支持分布式。

IntermsofLuceneusability,waybackwhen(almost6yearsnow),IcreatedCompass.ItsaimwastosimplifyusingLuceneandmakeeverydayLucenesimpler.WhatIcameacrosstimeandtimeagainistherequirementtobeabletohaveCompassdistributed.IstartedtoworkonitfromwithinCompass,byintegratingwithdatagridsolutionslikeGigaSpaces,CoherenceandTerracotta,butitsnotenough.

在使用性方面。。。还得谈及6年前我建立Compass这个项目。它的目的是把使用和维护Lucene变得简单些。而我当时屡次遇到这样的需求：让Compass支持分布式。于是我以此为起点开始工作，集成了datagrid解决方案（例如Gigaspaces....)但是这些努力还不够。

Atitscore,adistributedLucenesolutionneedstobesharded.Also,withtheadvancementofHTTPandJSONasubiquitousAPIs,itmeansthatasolutionthatmanydifferentsystemswithdifferentlanguagescaneasilybeused.

分布式Lucene解决方案的核心是：数据库需要水平分区的。(sharded,词条见：shard)同时，还要使用HTTP和JSON的形式来做为API。这样的优势是：不论什么语言，都可以轻易的调用它。

ThisiswhyIwentaheadandcreatedElasticSearch.Ithasaveryadvancedistributedmodel,speaksnativelyJSON,andexposesmanyadvancesearchfeatures,allseamlesslyexpressedthroughJSONDSL.

这就是我建立ElasticSearch的原因。它用有先进的分布式模型，本地语言是JSON,还提供了很多查询特性。这些都使用JSON形式的DSL来访问。

Solrisalsoasolutionforexposinganindexing/searchserveroverHTTP,butIwouldarguethatElasticSearchprovidesamuchsuperiordistributedmodelandeaseofuse(thoughcurrentlylackingonsomeofthesearchfeatures,butnotforlong,andinanycase,theplanistogetallCompassfeaturesintoElasticSearch).Ofcourse,Iambiased,sinceIcreatedElasticSearch,soyoumightneedtocheckforyourself.

Solr也是一个通过HTTP访问的检索/查询解决方案，但是我觉得ElasticSearch提供了更好的分布式模型，也更容易使用（尽管目前看来，ElasticSearch还缺少某些检索特性，但是在不远的将来，我保证，会把Compass所有的特性都移植到ElasticSearch中去）当然了，做为作者，我的话肯定会偏向于ElasticSearch，所以你最好亲自试一下。

AsforSphinx,Ihavenotusedit,soIcan'tcomment.WhatIcanreferyouistothisthreadatSphinxforumwhichIthinkprovesthesuperiordistributedmodelofElasticSearch.

对于Sphinx，我没用过。所以也就不评价它了。不过从Sphinx论坛的这篇文章看来，ElasticSearch提供了更好的分布式模型。

Ofcourse,ElasticSearchhasmanymorefeaturesthenjustbeingdistributed.Itisactuallybuiltwithcloudinmind.Youcancheckthefeaturelistonthesite.

当然了，除了更好的分布式模型，ElasticSearch还提供了很多其他的特性。因为它的诞生就是基于分布式的理念（builtwithcloudinmind)，你可以试一下站点中的特性里面所列举的特性。

http://stackoverflow.com/q/1284083/445908

I'vebeenusingSolrsuccessfullyforalmost2yearsnow,andhaveneverusedSphinx,soI'mobviouslybiased.However,I'lltrytokeepitobjectivebyquotingthedocsorotherpeople.I'llalsotakepatchestomyanswer:-)

过去两年我一直在用Solr，用的很好。从没用过Sphinx.所以我个人观点肯定不太客观。不过，我引用一下其他人的看法

Similarities:相同点：

BothSolrandSphinxsatisfyallofyourrequirements.They'refastanddesignedtoindexandsearchlargebodiesofdataefficiently.

两者都满足你的需求。它们都很快，面向于大数据量下的高效率的建立索引，搜索。

Bothhavealonglistofhigh-trafficsitesusingthem(Solr,Sphinx)

都有很长的大数据量网站列表

Bothoffercommercialsupport.(Solr,Sphinx)

都有商业支持。

BothofferclientAPIbindingsforseveralplatforms/languages(Sphinx,Solr)

都支持对不同语言的CLIENTAPI。

Bothcanbedistributedtoincreasespeedandcapacity(Sphinx,Solr)

都支持分布式。

Herearesomedifferences:几点不同：

Solr,beinganApacheproject,isobviouslyApache2-licensed.SphinxisGPLv2.Thismeansthatifyoueverneedtoembedorextend(notjust"use")Sphinxinacommercialapplication,you'llhavetobuyacommerciallicense(rationale)

Solr是apache的项目，是apache2的license.Sphinx是GPL,也就是说，如果你想把Sphinx放到某个商业性的项目中，你就得买个商业许可证。

SolriseasilyembeddableinJavaapplications.

Solr很容易就可以集成到JAVA项目中。

SolrisbuiltontopofLucene,whichisaproventechnologyover8yearsoldwithahugeuserbase(thisisonlyasmallpart).WheneverLucenegetsanewfeatureorspeedup,Solrgetsittoo.ManyofthedevscommittingtoSolrarealsoLucenecommitters.

Solr是基于Lucene的，后者已经8岁了，有着庞大的用户群体。Lucene有啥功能，Solr就能享受到啥功能。而且Solr的开发人员很多也参与了Lucene的开发。

SphinxintegratesmoretightlywithRDBMSs,especiallyMySQL.

SolrcanbeintegratedwithHadooptobuilddistributedapplications

SolrcanbeintegratedwithNutchtoquicklybuildafully-fledgedwebsearchenginewithcrawler.

Sphinx跟RDBMS（特别是MYSQL）绑定的特别紧密。而且Solr可以和Hadoop集成，成为分布式系统。也可以和Nutch集成，成为一个功能完备的搜索引擎，以及网络爬虫(crawler)

SolrcanindexproprietaryformatslikeMicrosoftWord,PDF,etc.Sphinxcan't.

Solr可以检索WORD，PDF。Sphinx不行

Solrcomeswithaspell-checkeroutofthebox.

Solr还带有拼写检查器。

Solrcomeswithfacetsupportoutofthebox.FacetinginSphinxtakesmorework.

Solr默认有facet支持。而Shphinx中就得做一些额外的工作才行

Sphinxdoesn'tallowpartialindexupdatesforfielddata.

Sphinx不支持针对fielddata的partialindex的更新

InSphinx,alldocumentidsmustbeuniqueunsignednon-zerointegernumbers.Solrdoesn'tevenrequireanuniquekeyformanyoperations,anduniquekeyscanbeeitherintegersorstrings.

Sphinx中，所有的documentid必须是unique,unsigned,non-zero整数（估计是用C语言的名词来解释）。Solr的很多操作，甚至不需要uniquekey。而且uniquekey可以是整数，也可以是字符串。

Solrsupportsfieldcollapsing(currentlyasanadditionalpatchonly)toavoidduplicatingsimilarresults.Sphinxdoesn'tseemtoprovideanyfeaturelikethis.

Solr支持fieldcollapsing来避免相似搜索结果的重复性。Sphinx没这个功能。

WhileSphinxisdesignedtoonlyretrievedocumentids,inSolryoucandirectlygetwholedocumentswithprettymuchanykindofdata,makingitmoreindependentofanyexternaldatastoreanditsavestheextraroundtrip.

Sphinx只是查询documentid,而solr则可以查询出整个的document.

Solr,exceptwhenusedembedded,runsinaJavawebcontainersuchasTomcatorJetty,whichrequireadditionalspecificconfigurationandtuning(oryoucanusetheincludedJettyandjustlaunchitwithjava-jarstart.jar).Sphinxhasnoadditionalconfiguration.

Solr跑在javaweb容器中，例如Tomcat或Jetty.所以我们就可以进行配置和调试，优化。Sphinx则没有额外的配置选项。

http://www.wikivs.com/wiki/Lucene_vs_Sphinx

有一点比较重要：sphinx不支持liveindexupdate.支持的话也非常有限。

有一个PPT，可以增加知识：

http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql

sphinx lucene mysql索引 elastic 全文索引 solr

安科网

主流全文索引工具的比较（ Lucene, Sphinx, solr, elastic search)

momomoniqwer

momomoniqwer

相关推荐

主流全文索引工具的比较（ Lucene, Sphinx, solr, elastic search)

sphinx 简介

Sphinx : 高性能SQL全文检索引擎

Sphinx logo 全文检索引擎 Sphinx

全文检索:sphinx elasticsearch xunsearch 比较

如何使用 Sphinx 给 Python 代码写文档

如何使用Sphinx给Python代码写文档

使用Python进行语音识别---将音频转为文字

使用sphinx search打造你自己的中文搜索引擎

sphinx 增量索引

sphinx安装使用

常用编译配置

初识sphinx搜索引擎

给全文搜索引擎Manticore (Sphinx) search 增加中文分词

全文搜索引擎

全文搜索引擎介绍(sphinx)

linux 下安装sphinx

php + MongoDB + Sphinx 实现全文检索

Mac下使用Homebrew安装Sphinx和MySQL

dojo 1.8 文档生成

momomoniqwer