贝叶斯分类算法

贝叶斯算法主要用于分类数据预测

以下为垃圾邮件分类算法

数据

type,text
ham,00 00 00 are 0089 0089 having a good week. Just checking in
ham,K..give back my thanks.
ham,Am also doing in cbe only. But have to pay.
spam,"complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+"
spam,okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm
ham,Aiya we discuss later lar... Pick u up at 4 is it?
ham,Are you this much buzy
ham,Please ask mummy to call father
spam,Marvel Mobile Play the official Ultimate Spider-man game (£4.50) on ur mobile right now. Text SPIDER to 83338 for the game & we ll send u a FREE 8Ball wallpaper
ham,"fyi I'm at usf now, swing by the room whenever"
ham,"Sure thing big man. i have hockey elections at 6, shouldn€˜t go on longer than an hour though"
ham,I anything lor...
ham,"By march ending, i should be ready. But will call you for sure. The problem is that my capital never complete. How far with you. How's work and the ladies"
ham,"Hmm well, night night "
ham,K I'll be sure to get up before noon and see what's what
ham,Ha ha cool cool chikku chikku:-):-DB-)
ham,Darren was saying dat if u meeting da ge den we dun meet 4 dinner. Cos later u leave xy will feel awkward. Den u meet him 4 lunch lor.
ham,He dint tell anything. He is angry on me that why you told to abi.
ham,Up to u... u wan come then come lor... But i din c any stripes skirt...
spam,"U can WIN £100 of Music Gift Vouchers every week starting NOW Txt the word DRAW to 87066 TsCs www.ldew.com SkillGame,1Winaweek, age16.150ppermessSubscription"
ham,2mro i am not coming to gym machan. Goodnight.
ham,ARR birthday today:) i wish him to get more oscar.
ham,Reading gud habit.. Nan bari hudgi yorge pataistha ertini kano:-)
ham,"I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones"
ham,"Could you not read me, my Love ? I answered you"
ham,So what did the bank say about the money?
ham,Well if I'm that desperate I'll just call armand again
ham,"Fuuuuck I need to stop sleepin, sup"
ham,So how's the weather over there?
ham,Ok thanx...
ham,Ok.ok ok..then..whats ur todays plan
ham,1Apple/Day=No Doctor. 1Tulsi Leaf/Day=No Cancer. 1Lemon/Day=No Fat. 1Cup Milk/day=No Bone Problms 3 Litres Watr/Day=No Diseases Snd ths 2 Whom U Care..:-)
ham,"Sorry, I'll call later"
ham,Will do. Was exhausted on train this morning. Too much wine and pie. You sleep well too
spam,U have won a nokia 6230 plus a free digital camera. This is what u get when u win our FREE auction. To take part send NOKIA to 83383 now. POBOX114/14TCR/W1 16
ham,Ron say fri leh. N he said ding tai feng cant make reservations. But he said wait lor.
ham,"Call me when you/carlos is/are here, my phone's vibrate is acting up and I might not hear texts"
ham,Oh k :)why you got job then whats up?
spam,"SPJanuary Male Sale! Hot Gay chat now cheaper, call 08709222922. National rate from 1.5p/min cheap to 7.8p/min peak! To stop texts call 08712460324 (10p/min)"
ham,Yeah you should. I think you can use your gt atm now to register. Not sure but if there's anyway i can help let me know. But when you do be sure you are ready.
ham,Nationwide auto centre (or something like that) on Newport road. I liked them there
ham,He is there. You call and meet him
ham,Yeah sure I'll leave in a min
spam,URGENT! Your Mobile number has been awarded with a £2000 prize GUARANTEED. Call 09061790121 from land line. Claim 3030. Valid 12hrs only 150ppm
ham,"Mah b, I'll pick it up tomorrow"
ham,Then she dun believe wat?
ham,I've sent u my part..

python算法

# 编码转换模块
import codecs
from sklearn.naive_bayes import MultinomialNB  
from sklearn.feature_extraction.text import CountVectorizer
 

if __name__ == '__main__':
    corpus = []
    labels = []
    corpus_test = []
    labels_test = []
#     读取文件
    f = codecs.open("../../sms_spam.txt", "rb")
    
    count = 0
    while True:  
        line = f.readline()
#         第一行不处理
        if count == 0:
            count = count + 1
            continue
        if line: 
#             修改byte类型为str类型,python2是str python3是byte
            line=line.decode()
           
            count = count + 1
            line = line.split(",")
#             维度,特征参数
            sentence = line[1]
#             构建训练集特征值
            corpus.append(sentence)
#             目标参数
            label = line[0]
#             构建训练集目标值 将支付串转为0 1
            if "ham" == label:
                labels.append(0)
            elif "spam" == label:
                labels.append(1)
#                 构建测试集
            if count > 5550:
                corpus_test.append(sentence)
                if "ham" == label:
                    labels_test.append(0)
                elif "spam" == label:
                    labels_test.append(1)
        else:
            break
#         创建训练集
    # CountVectorizer是将文本向量转换成稀疏表示数值向量(字符频率向量)  vectorizer 将文档词块化
    # 把corpus 数据中的数据转成“字符频率”
    vectorizer = CountVectorizer()
    fea_train = vectorizer.fit_transform(corpus)
#     所有出现的字符按 ascii码顺序排序组建特征维度
    print (vectorizer.get_feature_names())
#     按特征维度统计每行的字符出现次数
    print (fea_train.toarray())

#         创建测试集
#     在已统计的vectorizer基础上带入测试集数据,如果测试集数据中有新单词出现,不做统计
    vectorizer2 = CountVectorizer(vocabulary=vectorizer.vocabulary_)
    fea_test = vectorizer2.fit_transform(corpus_test)
    print (vectorizer2.get_feature_names())
    print (fea_test.toarray())
    
    
    # 创建贝叶斯分类模型,带入训练数据
    # alpha = 1 拉普拉斯估计给每个单词加1 
    clf = MultinomialNB(alpha=1)   
    clf.fit(fea_train, labels)
    
#     在模型中带入测试数据,得出预测值
    pred = clf.predict(fea_test);  
    for p in pred:
        if p == 0:
            print ("正常邮件")
        else:
            print ("垃圾邮件")
    for i in range(len(pred)):
        print(pred[i] ,"\t",labels_test[i])

spark算法

package com.sunbin

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.{ Level, Logger }
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.classification.NaiveBayes

object Naive_bayes {
  def main(args: Array[String]): Unit = {
    //1 构建Spark对象
    val conf=new SparkConf().setMaster("local[2]").setAppName("bayes")
    val sc=new SparkContext(conf)
    Logger.getRootLogger.setLevel(Level.WARN)
    val data_path1 = "sms_spam.txt"
    val lines= sc.textFile(data_path1, 2)
    
    val tf = new HashingTF(numFeatures = 100000)
    
//    构建数据集
    val parsedData=lines.map(line=>{
      val parts= line.split(",")
//      将文本特征转成向量
      val features= tf.transform(parts(1).split(" ")) 
    		   if (parts(0) == "ham"){
    		     LabeledPoint(0, features)
//    		     LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble)))
    		   }else{
    		     LabeledPoint(1, features)
//    		      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble)))
    		   }
    })
    parsedData.cache()
//      切分数据集,训练集和测试集
   val splits= parsedData.randomSplit(Array(0.9,0.1), seed=1l)
   val test=splits(1)
   val train=splits(0)

//   训练模型
   val model = NaiveBayes.train(train, lambda=1.0)
//   测试数据
   val predictionAndLabel = test.map(p =>{ 
     println(model.predict(p.features), " ",p.label)
     (model.predict(p.features), p.label)
     })
    predictionAndLabel.count()
  }
  
}

相关推荐