NLP实验 - word2vec仅用于降维
Pre-process
Dataset: http://www.sogou.com/labs/res... (搜狗实验室)
结巴分词: https://pypi.python.org/pypi/...
result = codecs.open(result_file, 'w', 'utf-8')
src_file = open("./datasets/" + filename, 'r')
for line in src_file:
seg_list = jieba.cut(line, cut_all=False)
result.write(' '.join(seg_list) + ' ')去除停用词可以read停用词词典,也可以用import jieba.posseg.cut检测词性为x的词,和加载自定义词典不同,自定义词典决定了分词结果,所以必须使用jieba内置函数
word2vec tutorial: https://rare-technologies.com...
for filename in files:
file_path = root + '/' + filename
if os.path.splitext(file_path)[-1] != '.txt':
continue
src_file = open(file_path, 'r')
for line in src_file:
if len(line) <= 1:
continue
# if is from html, cut tags
line = re.sub(re.compile('<.*?>'), ' ', line)
yield line如果不检查后缀,可能出现 utf-8 不能decode的文件,如mac下的.DSstore
sentences = MySentences(data_path)
# size is dim
model = gensim.models.Word2Vec(sentences, size=5, min_count=0)
model.save('./model/word2vec')Training
使用word2vec 向量化后的 word,对每篇文章进行加权,多篇文章组成一个matrix,用svm分类
Comparison
发现一篇简洁有料的类似survey,可以直接参考:https://zhuanlan.zhihu.com/p/...
坑
使用Word2Vec('f.txt', min_count=5),传入小文本测试(没有min_count=5)的时候会出现RuntimeError: you must first build vocabulary before training the model
model.save(/model)等操作可能需要文件已经存在,最好在训练前都创建一遍
相关推荐
lirika 2020-09-27
sxyhetao 2020-04-14
pySVNA 2019-12-24
yishujixiaoxiao 2019-10-30
快看是Charlie 2015-08-21
aaJamesJones 2019-07-01
WisdomXLH 2019-06-28
五小郎的学习笔记 2019-06-27
yukyinbaby 2019-06-27
yuzhou 2019-06-26
mori 2019-06-21
minerzhu 2019-05-21
FZUrxd 2019-03-29
lingpy 2017-12-14
cenylon 2018-09-05
manongpengzai 2018-09-05
Joyliness 2018-09-05
TangowL 2018-07-31