贝叶斯分类算法
贝叶斯算法主要用于分类数据预测
以下为垃圾邮件分类算法
数据
type,text ham,00 00 00 are 0089 0089 having a good week. Just checking in ham,K..give back my thanks. ham,Am also doing in cbe only. But have to pay. spam,"complimentary 4 STAR Ibiza Holiday or £10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! Box434SK38WP150PPM18+" spam,okmail: Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm ham,Aiya we discuss later lar... Pick u up at 4 is it? ham,Are you this much buzy ham,Please ask mummy to call father spam,Marvel Mobile Play the official Ultimate Spider-man game (£4.50) on ur mobile right now. Text SPIDER to 83338 for the game & we ll send u a FREE 8Ball wallpaper ham,"fyi I'm at usf now, swing by the room whenever" ham,"Sure thing big man. i have hockey elections at 6, shouldn€˜t go on longer than an hour though" ham,I anything lor... ham,"By march ending, i should be ready. But will call you for sure. The problem is that my capital never complete. How far with you. How's work and the ladies" ham,"Hmm well, night night " ham,K I'll be sure to get up before noon and see what's what ham,Ha ha cool cool chikku chikku:-):-DB-) ham,Darren was saying dat if u meeting da ge den we dun meet 4 dinner. Cos later u leave xy will feel awkward. Den u meet him 4 lunch lor. ham,He dint tell anything. He is angry on me that why you told to abi. ham,Up to u... u wan come then come lor... But i din c any stripes skirt... spam,"U can WIN £100 of Music Gift Vouchers every week starting NOW Txt the word DRAW to 87066 TsCs www.ldew.com SkillGame,1Winaweek, age16.150ppermessSubscription" ham,2mro i am not coming to gym machan. Goodnight. ham,ARR birthday today:) i wish him to get more oscar. ham,Reading gud habit.. Nan bari hudgi yorge pataistha ertini kano:-) ham,"I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones" ham,"Could you not read me, my Love ? I answered you" ham,So what did the bank say about the money? ham,Well if I'm that desperate I'll just call armand again ham,"Fuuuuck I need to stop sleepin, sup" ham,So how's the weather over there? ham,Ok thanx... ham,Ok.ok ok..then..whats ur todays plan ham,1Apple/Day=No Doctor. 1Tulsi Leaf/Day=No Cancer. 1Lemon/Day=No Fat. 1Cup Milk/day=No Bone Problms 3 Litres Watr/Day=No Diseases Snd ths 2 Whom U Care..:-) ham,"Sorry, I'll call later" ham,Will do. Was exhausted on train this morning. Too much wine and pie. You sleep well too spam,U have won a nokia 6230 plus a free digital camera. This is what u get when u win our FREE auction. To take part send NOKIA to 83383 now. POBOX114/14TCR/W1 16 ham,Ron say fri leh. N he said ding tai feng cant make reservations. But he said wait lor. ham,"Call me when you/carlos is/are here, my phone's vibrate is acting up and I might not hear texts" ham,Oh k :)why you got job then whats up? spam,"SPJanuary Male Sale! Hot Gay chat now cheaper, call 08709222922. National rate from 1.5p/min cheap to 7.8p/min peak! To stop texts call 08712460324 (10p/min)" ham,Yeah you should. I think you can use your gt atm now to register. Not sure but if there's anyway i can help let me know. But when you do be sure you are ready. ham,Nationwide auto centre (or something like that) on Newport road. I liked them there ham,He is there. You call and meet him ham,Yeah sure I'll leave in a min spam,URGENT! Your Mobile number has been awarded with a £2000 prize GUARANTEED. Call 09061790121 from land line. Claim 3030. Valid 12hrs only 150ppm ham,"Mah b, I'll pick it up tomorrow" ham,Then she dun believe wat? ham,I've sent u my part..
python算法
# 编码转换模块
import codecs
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
if __name__ == '__main__':
corpus = []
labels = []
corpus_test = []
labels_test = []
# 读取文件
f = codecs.open("../../sms_spam.txt", "rb")
count = 0
while True:
line = f.readline()
# 第一行不处理
if count == 0:
count = count + 1
continue
if line:
# 修改byte类型为str类型,python2是str python3是byte
line=line.decode()
count = count + 1
line = line.split(",")
# 维度,特征参数
sentence = line[1]
# 构建训练集特征值
corpus.append(sentence)
# 目标参数
label = line[0]
# 构建训练集目标值 将支付串转为0 1
if "ham" == label:
labels.append(0)
elif "spam" == label:
labels.append(1)
# 构建测试集
if count > 5550:
corpus_test.append(sentence)
if "ham" == label:
labels_test.append(0)
elif "spam" == label:
labels_test.append(1)
else:
break
# 创建训练集
# CountVectorizer是将文本向量转换成稀疏表示数值向量(字符频率向量) vectorizer 将文档词块化
# 把corpus 数据中的数据转成“字符频率”
vectorizer = CountVectorizer()
fea_train = vectorizer.fit_transform(corpus)
# 所有出现的字符按 ascii码顺序排序组建特征维度
print (vectorizer.get_feature_names())
# 按特征维度统计每行的字符出现次数
print (fea_train.toarray())
# 创建测试集
# 在已统计的vectorizer基础上带入测试集数据,如果测试集数据中有新单词出现,不做统计
vectorizer2 = CountVectorizer(vocabulary=vectorizer.vocabulary_)
fea_test = vectorizer2.fit_transform(corpus_test)
print (vectorizer2.get_feature_names())
print (fea_test.toarray())
# 创建贝叶斯分类模型,带入训练数据
# alpha = 1 拉普拉斯估计给每个单词加1
clf = MultinomialNB(alpha=1)
clf.fit(fea_train, labels)
# 在模型中带入测试数据,得出预测值
pred = clf.predict(fea_test);
for p in pred:
if p == 0:
print ("正常邮件")
else:
print ("垃圾邮件")
for i in range(len(pred)):
print(pred[i] ,"\t",labels_test[i])spark算法
package com.sunbin
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.{ Level, Logger }
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.classification.NaiveBayes
object Naive_bayes {
def main(args: Array[String]): Unit = {
//1 构建Spark对象
val conf=new SparkConf().setMaster("local[2]").setAppName("bayes")
val sc=new SparkContext(conf)
Logger.getRootLogger.setLevel(Level.WARN)
val data_path1 = "sms_spam.txt"
val lines= sc.textFile(data_path1, 2)
val tf = new HashingTF(numFeatures = 100000)
// 构建数据集
val parsedData=lines.map(line=>{
val parts= line.split(",")
// 将文本特征转成向量
val features= tf.transform(parts(1).split(" "))
if (parts(0) == "ham"){
LabeledPoint(0, features)
// LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble)))
}else{
LabeledPoint(1, features)
// LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' 、').map(_.toDouble)))
}
})
parsedData.cache()
// 切分数据集,训练集和测试集
val splits= parsedData.randomSplit(Array(0.9,0.1), seed=1l)
val test=splits(1)
val train=splits(0)
// 训练模型
val model = NaiveBayes.train(train, lambda=1.0)
// 测试数据
val predictionAndLabel = test.map(p =>{
println(model.predict(p.features), " ",p.label)
(model.predict(p.features), p.label)
})
predictionAndLabel.count()
}
} 相关推荐
baike 2020-06-08
yedaoxiaodi 2020-05-20
wuxiaosi0 2020-05-14
燕哥带你学算法 2020-05-12
RememberMePlease 2020-05-12
清溪算法君老号 2020-05-11
rein0 2020-05-11
yuanran0 2020-05-10
seekerhit 2020-05-10
horizonheart 2020-05-09
wonner 2020-05-09
SystemArchitect 2020-05-07
earthhouge 2020-05-07
Broadview 2020-05-07
yuanran0 2020-04-16
清溪算法 2020-03-08
yishujixiaoxiao 2019-12-15
蜗牛慢爬的李成广 2019-12-07