Spark-Mllib中各分类算法的java实现

HeavyIndustry

2017-01-13

一.简述

Spark是当下非常流行的数据分析框架，而其中的机器学习包Ｍllib也是其诸多亮点之一，相信很多人也像我那样想要快些上手spark。下面我将列出实现mllib分类的简明代码，代码中将简述训练集和样本集的结构，以及各分类算法的参数含义。分类模型包括朴素贝叶斯，ＳＶＭ，决策树以及随机森林。

二.实现代码

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.util.LinkedList;
import java.util.List;
 
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
 
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
 
import org.apache.spark.mllib.classification.SVMModel;
import org.apache.spark.mllib.classification.SVMWithSGD;
 
import java.util.HashMap;
import java.util.Map;
import org.apache.spark.mllib.tree.DecisionTree;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
 
import org.apache.spark.mllib.tree.RandomForest;
import org.apache.spark.mllib.tree.model.RandomForestModel;
 
public class test {
    public static void main(String[] arg){
       //生成spark对象
        SparkConf conf = new SparkConf();
        conf.set("spark.testing.memory","2147480000");  // spark的运行配置，意指占用内存2G
        JavaSparkContext sc = new JavaSparkContext("local[*]", "Spark", conf);      //第一个参数为本地模式，[*]尽可能地获取多的cpu；第二个是spark应用程序名，可以任意取;第三个为配置文件
         
        //训练集生成
        LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(2.0, 3.0, 3.0));//规定数据结构为LabeledPoint，1.0为类别标号，Vectors.dense(2.0, 3.0, 3.0)为特征向量
        LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {2, 1,1}, new double[] {1.0, 1.0,1.0}));//特征值稀疏时，利用sparse构建
       List l = new LinkedList();//利用List存放训练样本
        l.add(neg);
        l.add(pos);
        JavaRDD<LabeledPoint>training = sc.parallelize(l); //ＲＤＤ化，泛化类型为LabeledPoint 而不是List
        final NaiveBayesModel nb_model = NaiveBayes.train(training.rdd());        
         
        //测试集生成
        double []  d = {1,1,2};
        Vector v =  Vectors.dense(d);//测试对象为单个vector，或者是ＲＤＤ化后的vector
 
        //朴素贝叶斯
      System.out.println(nb_model.predict(v));// 分类结果
      System.out.println(nb_model.predictProbabilities(v)); // 计算概率值
 
       
      //支持向量机
      int numIterations = 100;//迭代次数
      final SVMModel svm_model = SVMWithSGD.train(training.rdd(), numIterations);//构建模型
      System.out.println(svm_model.predict(v));
 
      //决策树
      Integer numClasses = 2;//类别数量
      Map<Integer, Integer> categoricalFeaturesInfo = new HashMap();
      String impurity = "gini";//对于分类问题，我们可以用熵entropy或Gini来表示信息的无序程度 ,对于回归问题，我们用方差(Variance)来表示无序程度，方差越大，说明数据间差异越大
      Integer maxDepth = 5;//最大树深
      Integer maxBins = 32;//最大划分数
      final DecisionTreeModel tree_model = DecisionTree.trainClassifier(training, numClasses,categoricalFeaturesInfo, impurity, maxDepth, maxBins);//构建模型
      System.out.println("决策树分类结果：");   
      System.out.println(tree_model.predict(v));
       
      //随机森林
      Integer numTrees = 3; // Use more in practice.
      String featureSubsetStrategy = "auto"; // Let the algorithm choose.
      Integer seed = 12345;
      // Train a RandomForest model.
      final RandomForestModel forest_model = RandomForest.trainRegressor(training,
        categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed);//参数与决策数基本一致，除了seed
      System.out.println("随机森林结果：");   
      System.out.println(forest_model.predict(v));
    }
  }

三.注意

１.利用spark进行数据分析时，数据一般要转化为ＲＤＤ（利用spark所提供接口读取外部文件,一般会自动转化为ＲＤＤ，通过ＭａｐＲｅｄｕｃｅ处理同样可以产生与接口匹配的训练集）

2.训练样本统一为标签向量(LabelPoint)。样本集为List,但是转化为ＲＤＤ时，数据类型却为JavaRDD<LabeledPoint>（模型训练时，接口只接收数据类型为JavaRDD<LabeledPoint>）

3.分类predict返回结果为类别标签,贝叶斯模型可返回属于不同类的概率（ｐｙｔｈｏｎ没用该接口）

line 算法

安科网

Spark-Mllib中各分类算法的java实现

HeavyIndustry

HeavyIndustry

相关推荐

了解这些操作，Python中99%的文件操作都将变得游刃有余！

shell while 双循环对比查找共有字段

Python-闭包

数据处理：oltp和olap

启动django报错

python读取大词向量文件

Python Basic - 练习-提示用户输入长度跟宽度，然后输出字符“0”描述出一个正方形

Shell脚本批量修改文件编码为UTF-8

python 逐行读取文件的几种方法

python with (as)语句

kafka 配置文件参数详解

python--读取excel通过django框架入库mysql(完整代码)

python逐行读取文件内容的三种方法

linux中按行读取指定行

Mac gyp: No Xcode or CLT version detected!

PHP Warning: PHP Startup: in Unknown on line 0

Python3并发写文件与Python对比

更新 Ubuntu 12.04 内核至 3.8.0

line-clamp无法生效的解决方案

Twitter的分布式自增ID算法Snowflake

HeavyIndustry