Hadoop学习 之单节点集群配置

Hadoop学习 之单节点集群配置

1.官网下载Hadoop
    http://hadoop.apache.org/releases.html 下载hadoop-2.7.3.tar.gz 在hadoop工作目录 eg:  ~/SoftWare/BigData/Hadoop
2 cd eg:  ~/SoftWare/BigData/Hadoop/
   tar -zxvf adoop-2.7.3.tar.gz
3.检查JDK是否安装,如未安装,则安装JDK7+,并设定$JAVA_HOME, $PATH, $CLASSPATH
4 检查ssh, rsync是否安装
    若未安装,则安装
          $ sudo apt-get install ssh
          $ sudo apt-get install rsync
5 修改${HADOOP_HOME}/etc/hadoop/hadoop-env.sh中JAVA_HOME设置,使用实际的绝对路径
  eg:export JAVA_HOME=/home/username/SoftWare/Java/jdk1.8.0_65
6 Standalone Operation下执行一个mapreduce计算
  $ mkdir input
  $ cp etc/hadoop/*.xml input
  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
  $ cat output/*
  可以看到
  $dfsadmin   1

此时表明Standalone Operation下,Hadoop运行成功

7 Pseudo-Distributed Operation
  1)修改${HADOOP_HOME}/etc/hadoop/core-site.xml
     <configuration>
      <property>
          <!-- 指定HDFS老大(namenode)的通信地址 -->
          <name>fs.defaultFS</name>    
          <value>hdfs://localhost:9000</value>
      </property>
      <property>
          <!-- 指定hadoop运行时产生文件的存储路径 -->
          <name>hadoop.tmp.dir</name>
          <value>/home/username/SoftWare/BigData/Hadoop/tmp</value>
      </property>
    </configuration>
  2)修改${HADOOP_HOME}/etc/hadoop/hdfs-site.xml
     <configuration>
       <property>
        <!-- 设置hdfs副本数量 -->
        <name>dfs.replication</name>
        <value>1</value>
      </property>
    </configuration>
  3)检查SSH可否无密码登陆localhost
     $ ssh localhost
  4) 若无需密码SSH可以登陆localhost,则忽略此步,若需要密码,则执行
     $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
     $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
     $ chmod 0600 ~/.ssh/authorized_keys
  5)HDFS的启动与执行
     HDFS的初次执行需要先格式化,执行${HADOOP_HOME}/bin/hdfs namenode -format
     启动HDFS  ${HADOOP_HOME}/sbin/start-dfs.sh
     使用JPS查看HDFS是否启动成功
     $ jps
    
      63842 SecondaryNameNode
      63381 NameNode
      8470 Jps
      63565 DataNode
     看到NameNode,DataNode,SecondaryNameNode进程已经启动,表示HDFS启动成功
     此时可通过 http://localhost:50070/ 查看namenode, 通过 http://localhost:50090 查看datanode
    6) 上传并测试一个文件
         本地创建一个文件words.txt, 内容如下
          Hello World!
          Hello China!
          Hello Jim
         Hello Tom
         The People's Republic Of China!
         上传words.txt至HDFS根目录 ${HADOOP_HOME}/bin/hadoop fs -put words.txt  /
         此时可在http://localhost:50070/explorer.html#/  查看到上传的文件
    7)运行一个例子$ ${HADOOP_HOME}/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
    查看结果$  ${HADOOP_HOME}/bin/hdfs dfs -cat output/*
        
China!  2
Hello   4
Jim     1
Of      1
People's        1
Republic        1
The     1
Tom     1
World!  1


   8)Pseudo-Distributed Operation下mapreduce可以运行在Yarn上,若需运行在Yarn上
     则修改${HADOOP_HOME}/etc/hadoop/mapred-site.xml
 <configuration>
        <property>
           <!-- 通知框架MR使用YARN -->
           <name>mapreduce.framework.name</name>
           <value>yarn</value>
       </property>
     </configuration>
     修改${HADOOP_HOME}/etc/hadoop/yarn-site.xml

      <configuration>

<!-- Site specific YARN configuration properties -->
 <property>
            <!-- reducer取数据的方式是mapreduce_shuffle -->
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>
 <property>
            <!--表示该节点上YARN可使用的物理内存总量,默认是8192(MB),注意,如果你的节点内存资源不够8GB,则需要调减小这个值,而YARN不会智能的探测节点的物理内存总量。  MB为单位-->
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>3072</value>
 </property>
 <property>
            <!--nodemanager可供分配的最小内存  MB为单位-->
  <name>yarn.nodemanager.minmum-allocation-mb</name>
  <value>2048</value>
 </property>
 <property>
            <!--单个任务可申请的最多物理内存量,默认是8192(MB)  MB为单位-->
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
 </property>
 <property>
  <!--用于磁盘空间检查,低于某一值时,会导致mapreduce无法正常运行-->
  <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
  <value>99</value>
 </property>
</configuration>

    9)启动yarn, ${HADOOP_HOME}/sbin/start_yarn.sh

    10)使用jps检查yarn是否启动成功
        $ jps
       
13761 SecondaryNameNode
13410 NameNode
13923 ResourceManager
16744 Jps
14057 NodeManager
13567 DataNode

此时表明yarn启动成功
可通过http://localhost:8088/cluster查看cluster信息

11)运行一个mapreduce程序
    ${HADOOP_HOME}/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /input /output_wordcount
17/05/13 10:38:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/05/13 10:38:06 INFO input.FileInputFormat: Total input paths to process : 1
17/05/13 10:38:06 INFO mapreduce.JobSubmitter: number of splits:1
17/05/13 10:38:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494642975142_0001
17/05/13 10:38:07 INFO impl.YarnClientImpl: Submitted application application_1494642975142_0001
17/05/13 10:38:07 INFO mapreduce.Job: The url to track the job: http://tizen-HP-Compaq-Pro-6380-MT:8088/proxy/application_1494642975142_0001/
17/05/13 10:38:07 INFO mapreduce.Job: Running job: job_1494642975142_0001
17/05/13 10:38:13 INFO mapreduce.Job: Job job_1494642975142_0001 running in uber mode : false
17/05/13 10:38:13 INFO mapreduce.Job:  map 0% reduce 0%
17/05/13 10:38:18 INFO mapreduce.Job:  map 100% reduce 0%
17/05/13 10:38:23 INFO mapreduce.Job:  map 100% reduce 100%
17/05/13 10:38:24 INFO mapreduce.Job: Job job_1494642975142_0001 completed successfully
17/05/13 10:38:24 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=113
                FILE: Number of bytes written=237983
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=180
                HDFS: Number of bytes written=71
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2214
                Total time spent by all reduces in occupied slots (ms)=2302
                Total time spent by all map tasks (ms)=2214
                Total time spent by all reduce tasks (ms)=2302
                Total vcore-milliseconds taken by all map tasks=2214
                Total vcore-milliseconds taken by all reduce tasks=2302
                Total megabyte-milliseconds taken by all map tasks=2267136
                Total megabyte-milliseconds taken by all reduce tasks=2357248
        Map-Reduce Framework
                Map input records=5
                Map output records=13
                Map output bytes=130
                Map output materialized bytes=113
                Input split bytes=102
                Combine input records=13
                Combine output records=9
                Reduce input groups=9
                Reduce shuffle bytes=113
                Reduce input records=9
                Reduce output records=9
                Spilled Records=18
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=88
                CPU time spent (ms)=1410
                Physical memory (bytes) snapshot=445538304
                Virtual memory (bytes) snapshot=3855974400
                Total committed heap usage (bytes)=290979840
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=78
        File Output Format Counters
                Bytes Written=71


      表明运行成功  可通过${HADOOP_HOME}/bin/hadoop fs -cat /output_wordcount/* 查看到
     
China!  2
Hello   4
Jim     1
Of      1
People's        1
Republic        1
The     1
Tom     1
World!  1


注:使用Yarn时运行一个MapReduce任务 出现
2017-05-13 10:38:07,465 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue, it is likely set too low. skipping enforcement to allow at least one application to start
2017-05-13 10:38:07,465 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue for user, it is likely set too low. skipping enforcement to allow at least one application to start

必须在${HADOOP_HOME}/etc/hadoop/yarn-site.xml 添加
 <property>
  <!--用于磁盘空间检查,低于某一值时,会导致mapreduce无法正常运行-->
  <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
  <value>99</value>
 </property>

相关推荐