贝叶斯分类算法
贝叶斯算法主要用于分类数据预测
以下为垃圾邮件分类算法
数据
type,text ham,00 00 00 are 0089 0089 having a good week. Just checking in ham,K..give back my thanks. ham,Am also doing in cbe only. But have to pay. spam,"complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+" spam,okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm ham,Aiya we discuss later lar... Pick u up at 4 is it? ham,Are you this much buzy ham,Please ask mummy to call father spam,Marvel Mobile Play the official Ultimate Spider-man game (£4.50) on ur mobile right now. Text SPIDER to 83338 for the game & we ll send u a FREE 8Ball wallpaper ham,"fyi I'm at usf now, swing by the room whenever" ham,"Sure thing big man. i have hockey elections at 6, shouldn€˜t go on longer than an hour though" ham,I anything lor... ham,"By march ending, i should be ready. But will call you for sure. The problem is that my capital never complete. How far with you. How's work and the ladies" ham,"Hmm well, night night " ham,K I'll be sure to get up before noon and see what's what ham,Ha ha cool cool chikku chikku:-):-DB-) ham,Darren was saying dat if u meeting da ge den we dun meet 4 dinner. Cos later u leave xy will feel awkward. Den u meet him 4 lunch lor. ham,He dint tell anything. He is angry on me that why you told to abi. ham,Up to u... u wan come then come lor... But i din c any stripes skirt... spam,"U can WIN £100 of Music Gift Vouchers every week starting NOW Txt the word DRAW to 87066 TsCs www.ldew.com SkillGame,1Winaweek, age16.150ppermessSubscription" ham,2mro i am not coming to gym machan. Goodnight. ham,ARR birthday today:) i wish him to get more oscar. ham,Reading gud habit.. Nan bari hudgi yorge pataistha ertini kano:-) ham,"I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones" ham,"Could you not read me, my Love ? I answered you" ham,So what did the bank say about the money? ham,Well if I'm that desperate I'll just call armand again ham,"Fuuuuck I need to stop sleepin, sup" ham,So how's the weather over there? ham,Ok thanx... ham,Ok.ok ok..then..whats ur todays plan ham,1Apple/Day=No Doctor. 1Tulsi Leaf/Day=No Cancer. 1Lemon/Day=No Fat. 1Cup Milk/day=No Bone Problms 3 Litres Watr/Day=No Diseases Snd ths 2 Whom U Care..:-) ham,"Sorry, I'll call later" ham,Will do. Was exhausted on train this morning. Too much wine and pie. You sleep well too spam,U have won a nokia 6230 plus a free digital camera. This is what u get when u win our FREE auction. To take part send NOKIA to 83383 now. POBOX114/14TCR/W1 16 ham,Ron say fri leh. N he said ding tai feng cant make reservations. But he said wait lor. ham,"Call me when you/carlos is/are here, my phone's vibrate is acting up and I might not hear texts" ham,Oh k :)why you got job then whats up? spam,"SPJanuary Male Sale! Hot Gay chat now cheaper, call 08709222922. National rate from 1.5p/min cheap to 7.8p/min peak! To stop texts call 08712460324 (10p/min)" ham,Yeah you should. I think you can use your gt atm now to register. Not sure but if there's anyway i can help let me know. But when you do be sure you are ready. ham,Nationwide auto centre (or something like that) on Newport road. I liked them there ham,He is there. You call and meet him ham,Yeah sure I'll leave in a min spam,URGENT! Your Mobile number has been awarded with a £2000 prize GUARANTEED. Call 09061790121 from land line. Claim 3030. Valid 12hrs only 150ppm ham,"Mah b, I'll pick it up tomorrow" ham,Then she dun believe wat? ham,I've sent u my part..
python算法
# 编码转换模块
import codecs
from sklearn.naive_bayes import MultinomialNB  
from sklearn.feature_extraction.text import CountVectorizer
 
if __name__ == '__main__':
    corpus = []
    labels = []
    corpus_test = []
    labels_test = []
#     读取文件
    f = codecs.open("../../sms_spam.txt", "rb")
    
    count = 0
    while True:  
        line = f.readline()
#         第一行不处理
        if count == 0:
            count = count + 1
            continue
        if line: 
#             修改byte类型为str类型,python2是str python3是byte
            line=line.decode()
           
            count = count + 1
            line = line.split(",")
#             维度,特征参数
            sentence = line[1]
#             构建训练集特征值
            corpus.append(sentence)
#             目标参数
            label = line[0]
#             构建训练集目标值 将支付串转为0 1
            if "ham" == label:
                labels.append(0)
            elif "spam" == label:
                labels.append(1)
#                 构建测试集
            if count > 5550:
                corpus_test.append(sentence)
                if "ham" == label:
                    labels_test.append(0)
                elif "spam" == label:
                    labels_test.append(1)
        else:
            break
#         创建训练集
    # CountVectorizer是将文本向量转换成稀疏表示数值向量(字符频率向量)  vectorizer 将文档词块化
    # 把corpus 数据中的数据转成“字符频率”
    vectorizer = CountVectorizer()
    fea_train = vectorizer.fit_transform(corpus)
#     所有出现的字符按 ascii码顺序排序组建特征维度
    print (vectorizer.get_feature_names())
#     按特征维度统计每行的字符出现次数
    print (fea_train.toarray())
#         创建测试集
#     在已统计的vectorizer基础上带入测试集数据,如果测试集数据中有新单词出现,不做统计
    vectorizer2 = CountVectorizer(vocabulary=vectorizer.vocabulary_)
    fea_test = vectorizer2.fit_transform(corpus_test)
    print (vectorizer2.get_feature_names())
    print (fea_test.toarray())
    
    
    # 创建贝叶斯分类模型,带入训练数据
    # alpha = 1 拉普拉斯估计给每个单词加1 
    clf = MultinomialNB(alpha=1)   
    clf.fit(fea_train, labels)
    
#     在模型中带入测试数据,得出预测值
    pred = clf.predict(fea_test);  
    for p in pred:
        if p == 0:
            print ("正常邮件")
        else:
            print ("垃圾邮件")
    for i in range(len(pred)):
        print(pred[i] ,"\t",labels_test[i])spark算法
package com.sunbin
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.{ Level, Logger }
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.classification.NaiveBayes
object Naive_bayes {
  def main(args: Array[String]): Unit = {
    //1 构建Spark对象
    val conf=new SparkConf().setMaster("local[2]").setAppName("bayes")
    val sc=new SparkContext(conf)
    Logger.getRootLogger.setLevel(Level.WARN)
    val data_path1 = "sms_spam.txt"
    val lines= sc.textFile(data_path1, 2)
    
    val tf = new HashingTF(numFeatures = 100000)
    
//    构建数据集
    val parsedData=lines.map(line=>{
      val parts= line.split(",")
//      将文本特征转成向量
      val features= tf.transform(parts(1).split(" ")) 
    		   if (parts(0) == "ham"){
    		     LabeledPoint(0, features)
//    		     LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble)))
    		   }else{
    		     LabeledPoint(1, features)
//    		      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble)))
    		   }
    })
    parsedData.cache()
//      切分数据集,训练集和测试集
   val splits= parsedData.randomSplit(Array(0.9,0.1), seed=1l)
   val test=splits(1)
   val train=splits(0)
//   训练模型
   val model = NaiveBayes.train(train, lambda=1.0)
//   测试数据
   val predictionAndLabel = test.map(p =>{ 
     println(model.predict(p.features), " ",p.label)
     (model.predict(p.features), p.label)
     })
    predictionAndLabel.count()
  }
  
} 相关推荐
  baike    2020-06-08  
   yedaoxiaodi    2020-05-20  
   wuxiaosi0    2020-05-14  
   燕哥带你学算法    2020-05-12  
   RememberMePlease    2020-05-12  
   清溪算法君老号    2020-05-11  
   rein0    2020-05-11  
   yuanran0    2020-05-10  
   seekerhit    2020-05-10  
   horizonheart    2020-05-09  
   wonner    2020-05-09  
   SystemArchitect    2020-05-07  
   earthhouge    2020-05-07  
   Broadview    2020-05-07  
   yuanran0    2020-04-16  
   清溪算法    2020-03-08  
   yishujixiaoxiao    2019-12-15  
   蜗牛慢爬的李成广    2019-12-07  
 