新一代聚类搜索引擎
目前包括百度,google,搜搜,Yahoo等搜索引擎提供的是通用搜索方式,我们试想一下,如果将搜索出来的结果自动分类,那应该是多么美妙的一件事情,如您搜索“Ajax”,会自动按如下分类,如下图:

目前已有此类的开源项目,Carrot2,使用起来非常简单,但由于中文的聚类算法和英文的聚类算法存在比较大的差异,所以更多的时间是花在中文算法的聚类上,Carrot2的官方地址:http://project.carrot2.org/
目前搜索引擎逐步在细分市场,目前市面上还存在多个垂直搜索引擎,人肉搜索(其实主要也是人的相关性研究),如Google的生活搜索等,的确,现在的搜索引擎产品正慢慢的更加贴近人性化设计。
Carrot2自带的一个文档聚类的例子源代码如下:
try {
/*
* Initialize local controller. Normally you'd run this only once
* for an entire application (controller is thread safe).
*/
final LocalController controller = initLocalController();
/*
* Once we have a controller we can run queries. Change the query
* to something that is relevant to the data in your index.
*/
// Data for clustering, containing documents consisting of
// titles and bodies of documents.
String [][] documents = new String [] [] {
{ "Data Mining - Wikipedia", "http://en.wikipedia.org/wiki/Data_mining" },
{ "KD Nuggets", "http://www.kdnuggets.com/" },
{ "The Data Mine", "http://www.the-data-mine.com/" },
{ "DMG", "http://www.dmg.org/" },
{ "Data Mining", "http://www.gr-fx.com/graf-fx.htm" },
{ "Data Mining Benchmarking Association (DMBA)", "http://www.dmbenchmarking.com/" },
{ "Data Mining", "http://www.computerworld.com/databasetopics/businessintelligence/datamining" },
{ "National Center for Data Mining (NCDM) - University of Illinois at Chicago", "http://www.ncdm.uic.edu/" },
};
// Although the query will not be used to fetch any data, if the data
// that you're submitting for clustering is a response to some
// search engine-like query, please provide it, as the clustering
// algrithm may use it to improve the clustering quality.
final String query = "data mining";
// The documents are provided for clustering in the
// PARAM_SOURCE_RAW_DOCUMENTS parameter, which should point to
// a List of RawDocuments.
List documentList = new ArrayList(documents.length);
for (int i = 0; i < documents.length; i++)
{
documentList.add(new RawDocumentSnippet(
new Integer(i), // unique id of the document, can be a plain sequence id
documents[i][0], // document title
documents[i][1], // document body
"dummy://" + i, // URL (not required for clustering)
0.0f) // document score, can be 0.0
);
}
final HashMap params = new HashMap();
params.put(
ArrayInputComponent.PARAM_SOURCE_RAW_DOCUMENTS,
documentList);
final ProcessingResult pResult = controller.query("direct-feed-lingo", query, params);
final ArrayOutputComponent.Result result = (ArrayOutputComponent.Result) pResult.getQueryResult();
/*
* Once we have the buffered snippets and clusters, we can display
* them somehow. We'll reuse the simple text-dumping method
* available in {@link Test}.
*/
Example.displayResults(result);
} catch (Exception e) {
// There shouldn't be any, but just in case.
System.err.println("An exception occurred: " + e.toString());
e.printStackTrace();
} 相关推荐
章鱼之家 2020-10-29
liangwenrong 2020-07-31
IceStreamLab 2020-06-26
yanweiqi 2020-06-25
章鱼之家 2020-06-14
章鱼之家 2020-06-08
yanweiqi 2020-06-01
IceStreamLab 2020-05-31
athrenzala 2020-05-30
athrenzala 2020-05-28
chenluhan 2020-05-28
yanweiqi 2020-05-09
etedyh 2020-05-10
athrenzala 2020-04-17
huhu 2020-03-01
qiburen 2020-03-20
WEB程序员 2020-03-18
yanweiqi 2020-03-03
柳永法的个人 2020-03-03