nutch 2.1 分布式hbase部署

官方文档:http://wiki.apache.org/nutch/Nutch2Tutorial?action=show&redirect=GORA_HBase

现在网上针对nutch2.0以上版本的部署内容很残缺。经过两天奋战,终于把nutch2.1在hbase上部署成功了!在此与网友分享。

准备两台机器:

cr5(master):192.168.8.185,cr8(slave):192.168.8.188

这两台机器必须保证相互的ssh是通的(具体可以问谷歌)

修改两台机器的/etc/hostname文件

cr5
或者
cr8

修改两台机器的/etc/hosts文件

192.168.8.185   cr5
192.168.8.188   cr8

我准备在cr5机上运行进程:

Hadoop:NameNode,SecondaryNameNode,JobTracker

Hbase:HMaster

在cr8机上运行进程:

Hadoop:DataNode,TaskTracker

Hbase:HQuorumPeer,HRegionServer

接下来我们开始部署hadoop和hbase

官网上有很多hadoop和hbase的版本,并不是所有的版本都可以被nutch2.1支持的。

官方文档有这么一句话:

•InstallandconfigureHBase.Youcangetithere(N.B.Gora0.2usesHBase0.90.4,howeverthesetupisknowntoworkwithmorerecentversionsoftheHBase0.90.xbranch)

保险起见还是采用推荐的hbase0.90.x版本吧。

我选择的是hadoop-1.0.4和hbase-0.90.6

那如果采用其他版本在运行nutch的时候会报以下异常

Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V

我觉得是因为gora的原因,因为gora的版本已经很久没有更新。

一、配置hadoop

1.wget命令下载对应的hadoop版本.tar.gz

2.tarzxvfhadoop版本.tar.gz解压hadoop

3.cdconf下修改配置文件

a.hadoop-env.sh

     
export JAVA_HOME=/opt/jdk1.6.0_21

b.core-site.xml

     
<?xml version="1.0"?>
       <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
         <!-- Put site-specific property overrides in this file. -->
         <configuration>
            <property>
                <name>fs.default.name</name>
                <value>hdfs://cr5:9000/</value>
            </property>
            <property>
                <name>hadoop.tmp.dir</name>
                <value>/home/kfs/ww/data/hadoop_tmp</value>
                <description>此处设置hadoop根目录</description>
            </property>
         </configuration>

c.hdfs-site.xml

     
<?xml version="1.0"?>
       <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
       <!-- Put site-specific property overrides in this file. -->
          <configuration>
             <property>
                <name>dfs.replication</name>
                <value>1</value>
                <description>副本个数</description>
             </property>
          </configuration>

d.mapred-site.xml

    
<?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      <!-- Put site-specific property overrides in this file. -->
        <configuration>
           <property>
                <name>mapred.job.tracker</name>
                <value>cr5:9001</value>
                <description>jobtracker 标识:端口号</description>
           </property>
        </configuration>

e.masters

    
cr5

f.slaves

    
cr8

配置完成后,将cr5的hadoop复制到cr8下面

在cr5的hadoop/bin下面运行

./hadoop-namenodeformat

./hadoop-datanodeformat

然后启动hadoop

./start-all.sh

检查启动成功与否

查看hadoop/logs下面的×.log日志确保没有异常

然后通过

http://localhost:50030

http://localhost:50070

来查看信息

二、配置hbase

1.wget命令下载对应的hbase版本.tar.gz

2.tarzxvfhbase版本.tar.gz解压hadoop

3.cdconf下修改配置文件

a.hbase-site.xml

    
<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>hdfs://cr5:9000/hbase</value>
        </property>

        <property>
                <name>hbase.cluster.distributed</name>
                <value>true</value>
        </property>

        <property>
                <name>hbase.zookeeper.quorum</name>
                <value>cr8</value>
        </property>

        <property>
                <name>hbase.zookeeper.property.dataDir</name>
                <value>/home/kfs/ww/data/zookeeper_data</value>
        </property>

        <property>
                <name>hbase.zookeeper.property.clientPort</name>
                <value>2181</value>
        </property>

        <property>
                <name>hbase.tmp.dir</name>
                <value>/home/kfs/ww/data/hbase_tmp</value>
        </property>
</configuration>

注意:这里的hdfs://cr5:9000/hbase和hadoop配置需对应

b.hadoop-env.sh

     
export JAVA_HOME=/opt/jdk1.6.0_21
export HBASE_CLASSPATH=~/ww/hbase-0.90.6/conf
export HBASE_MANAGES_ZK=true

c.regionservers

cr8

hbase配置完成

当然还有后续的工作

1.删除hbase中的hadoop-core-版本.jar,然后把hadoop中的hadoop-core-版本.jar和commons-collections-3.2.1.jar拷贝到hbase的lib中。

否则hbase的HMaster无法启动!

2.关闭防火墙

到hbase/bin下通过./start-hbase.sh启动hbase

验证启动与否查看log是否有异常

或者http://localhost:60010查看具体信息

三、nutch配置

部署到eclipse中就不加累述了,主要是配置~

1.wget命令下载对应的hadoop版本.tar.gz

2.tarzxvfhadoop版本.tar.gz解压hadoop

3.cdconf下修改配置文件

a.gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

b.nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>http.agent.name</name>
		<value>test-nutch</value>
	</property>

	<property>
		<name>http.robots.agents</name>
		<value>test-nutch,*</value>
	</property>

	<property>
		<name>http.agent.name.check</name>
		<value>true</value>
	</property>

	<!-- property> <name>plugin.includes</name> <value>.*</value> <description>Enable 
		all plugins during unit testing.</description> </property -->

	<property>
		<name>distributed.search.test.port</name>
		<value>60000</value>
		<description>TCP port used during junit testing.</description>
	</property>

	<property>
		<name>http.accept.language</name>
		<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
		<description>Value of the “Accept-Language” request header field.
			This
			allows selecting non-English language as default one to retrieve.
			It
			is a useful setting for search engines build for certain national
			group.
		</description>
	</property>

	<property>
		<name>parser.character.encoding.default</name>
		<value>utf-8</value>
		<description>The character encoding to fall back to when no other
			information
			is available
		</description>
	</property>

	<property>
		<name>storage.data.store.class</name>
		<value>org.apache.gora.hbase.store.HBaseStore</value>
		<description>The Gora DataStore class for storing and retrieving data.
			Currently the following stores are available: ….
		</description>
	</property>
	
	<property>
		<name>hadoop.tmp.dir</name>
		<value>C:/data/hadoop_tmp</value>
		<description>此处设置hadoop根目录</description>
	</property>

</configuration>

c.nutch-site.xml

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

d.hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

	<property>
		<name>hbase.master</name>
		<value>cr5:60000</value>
	</property>

	<property>
		<name>hbase.zookeeper.quorum</name>
		<value>cr8</value>
	</property>

	<property>
		<name>hbase.zookeeper.property.clientPort</name>
		<value>2181</value>
	</property>

</configuration>

e.ivy.xml

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2.1" conf="*->default" />

f.新建urls文件夹,然后再文件夹中创建seed.txt,在seed.txt中写入需要抓取的链接

g.regex-urlfilter.txt加入抓取条件正则

配置完成,后续工作

nutch中的hbase-版本.jar需和部署的hbase的版本统一

运行nutch

配置Arguments信息

1.Proguamarguments

urls -depth 3 topN 5
这里的urls就是nutch配置中生成的url种子文件夹

2.VMarguments

-Xms256m -Xmx512m -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

大功告成~~~

四、运行过程中异常处理

1.pointorg.apache.nutch.net.URLNormalizernotfound.请参见http://youkimra.iteye.com/blog/1039903

2.ERRORorg.apache.hadoop.mapred.TaskTracker:Cannotstarttasktrackerbecausejava.io.IOException:Failedtosetpermissionsofpath:\tmp\hadoop-admin\mapred\local\ttprivateto0700

请参见:http://download.csdn.net/detail/java2000_wl/4326323

3.nutch中有一些plugin的类缺少包,遇到问题补全包即可

转载请注明来自:http://wangwei3.iteye.com/blog/1818599

相关推荐