机器学习自然语言处理：使用scikit-learn计算矢量化

83327712

2018-08-25

关注关注

这是一个关于如何使用scikit-learn对真实文本数据进行计数矢量化的演示。

机器学习自然语言处理：使用scikit-learn计算矢量化

计数矢量化的概述

机器学习自然语言处理：使用scikit-learn计算矢量化

今天，我们将看看我们用数字表示文本数据的最基本方法之一：One-hot编码（或计数矢量化）。这个想法很简单。

我们将创建一个维度与词汇表大小相等的矢量，如果文本数据具有vocab单词的特征，我们将在那个维度中放置一个1。每当我们再次遇到这个词时，我们会增加计数。我们如果没有找到这个词则为0。

如果我们使用它们在真实的文本数据上，这将是非常大的矢量，我们会得到很准确的文本数据的内容。不幸的是，这不会提供任何语义或关系信息，但这没关系，因为这不是使用这个技术的关键。

今天，我们将使用scikit-learn的软件包。

一个基本的例子

以下是使用计数矢量化来获取矢量的基本Python示例：

from sklearn.feature_extraction.text import CountVectorizer

# To create a Count Vectorizer, we simply need to instantiate one.

# There are special parameters we can set here when making the vectorizer, but

# for the most basic example, it is not needed.

vectorizer = CountVectorizer()

# For our text, we are going to take some text from our previous blog post

# about count vectorization

sample_text = ["One of the most basic ways we can numerically represent words "

"is through the one-hot encoding method (also sometimes called "

"count vectorizing)."]

# To actually create the vectorizer, we simply need to call fit on the text

# data that we wish to fix

vectorizer.fit(sample_text)

# Now, we can inspect how our vectorizer vectorized the text

# This will print out a list of words used, and their index in the vectors

print('Vocabulary: ')

print(vectorizer.vocabulary_)

# If we would like to actually create a vector, we can do so by passing the

# text into the vectorizer to get back counts

vector = vectorizer.transform(sample_text)

# Our final vector:

print('Full vector: ')

print(vector.toarray())

# Or if we wanted to get the vector for one word:

print('Hot vector: ')

print(vectorizer.transform(['hot']).toarray())

# Or if we wanted to get multiple vectors at once to build matrices

print('Hot and one: ')

print(vectorizer.transform(['hot', 'one']).toarray())

# We could also do the whole thing at once with the fit_transform method:

print('One swoop:')

new_text = ['Today is the day that I do the thing today, today']

new_vectorizer = CountVectorizer()

print(new_vectorizer.fit_transform(new_text).toarray())

输出：

Vocabulary:

{'one': 12, 'of': 11, 'the': 15, 'most': 9, 'basic': 1, 'ways': 18, 'we': 19,

'can': 3, 'numerically': 10, 'represent': 13, 'words': 20, 'is': 7,

'through': 16, 'hot': 6, 'encoding': 5, 'method': 8, 'also': 0,

'sometimes': 14, 'called': 2, 'count': 4, 'vectorizing': 17}

Full vector:

[[1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1]]

Hot vector:

[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Hot and one:

[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]]

One swoop:

[[1 1 1 1 2 1 3]]

在真实数据上使用它：

所以让我们在一些真实的数据上使用它！我们将查看scikit-learn附带的20个新闻组数据集。

Python代码如下：

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import CountVectorizer

import numpy as np

# Create our vectorizer

vectorizer = CountVectorizer()

# Let's fetch all the possible text data

newsgroups_data = fetch_20newsgroups()

# Why not inspect a sample of the text data?

print('Sample 0: ')

print(newsgroups_data.data[0])

print()

# Create the vectorizer

vectorizer.fit(newsgroups_data.data)

# Let's look at the vocabulary:

print('Vocabulary: ')

print(vectorizer.vocabulary_)

print()

# Converting our first sample into a vector

v0 = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]

print('Sample 0 (vectorized): ')

print(v0)

print()

# It's too big to even see...

# What's the length?

print('Sample 0 (vectorized) length: ')

print(len(v0))

print()

# How many words does it have?

print('Sample 0 (vectorized) sum: ')

print(np.sum(v0))

print()

# What if we wanted to go back to the source?

print('To the source:')

print(vectorizer.inverse_transform(v0))

print()

# So all this data has a lot of extra garbage... Why not strip it away?

newsgroups_data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

# Why not inspect a sample of the text data?

print('Sample 0: ')

print(newsgroups_data.data[0])

print()

# Create the vectorizer

vectorizer.fit(newsgroups_data.data)

# Let's look at the vocabulary:

print('Vocabulary: ')

print(vectorizer.vocabulary_)

print()

# Converting our first sample into a vector

v0 = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]

print('Sample 0 (vectorized): ')

print(v0)

print()

# It's too big to even see...

# What's the length?

print('Sample 0 (vectorized) length: ')

print(len(v0))

print()

# How many words does it have?

print('Sample 0 (vectorized) sum: ')

print(np.sum(v0))

print()

# What if we wanted to go back to the source?

print('To the source:')

print(vectorizer.inverse_transform(v0))

print()

输出：

Sample 0:

From: lerxst@wam.umd.edu (where's my thing)

Subject: WHAT car is this!?

Nntp-Posting-Host: rac3.wam.umd.edu

Organization: University of Maryland, College Park

Lines: 15

I was wondering if anyone out there could enlighten me on this car I saw

the other day. It was a 2-door sports car, looked to be from the late 60s/

early 70s. It was called a Bricklin. The doors were really small. In addition,

the front bumper was separate from the rest of the body. This is

all I know. If anyone can tellme a model name, engine specs, years

of production, where this car is made, history, or whatever info you

have on this funky looking car, please e-mail.

Thanks,

- IL

---- brought to you by your neighborhood Lerxst ----

Vocabulary:

{'from': 56979, 'lerxst': 75358, 'wam': 123162, 'umd': 118280, 'edu': 50527,

'where': 124031, 'my': 85354, 'thing': 114688, 'subject': 111322,

'what': 123984, 'car': 37780, 'is': 68532, 'this': 114731, 'nntp': 87620,

'posting': 95162, 'host': 64095, 'rac3': 98949, 'organization': 90379,

'university': 118983, 'of': 89362, 'maryland': 79666,

'college': 40998, ... } (Abbreviated...)

Sample 0 (vectorized):

[0 0 0 ... 0 0 0]

Sample 0 (vectorized) length:

130107

Sample 0 (vectorized) sum:

122

To the source:

[array(['15', '60s', '70s', 'addition', 'all', 'anyone', 'be', 'body',

'bricklin', 'brought', 'bumper', 'by', 'called', 'can', 'car',

'college', 'could', 'day', 'door', 'doors', 'early', 'edu',

'engine', 'enlighten', 'from', 'front', 'funky', 'have', 'history',

'host', 'if', 'il', 'in', 'info', 'is', 'it', 'know', 'late',

'lerxst', 'lines', 'looked', 'looking', 'made', 'mail', 'maryland',

'me', 'model', 'my', 'name', 'neighborhood', 'nntp', 'of', 'on',

'or', 'organization', 'other', 'out', 'park', 'please', 'posting',

'production', 'rac3', 'really', 'rest', 'saw', 'separate', 'small',

'specs', 'sports', 'subject', 'tellme', 'thanks', 'the', 'there',

'thing', 'this', 'to', 'umd', 'university', 'wam', 'was', 'were',

'what', 'whatever', 'where', 'wondering', 'years', 'you', 'your'],

dtype='<U180')]

Sample 0:

I was wondering if anyone out there could enlighten me on this car I saw

the other day. It was a 2-door sports car, looked to be from the late 60s/

early 70s. It was called a Bricklin. The doors were really small. In addition,

the front bumper was separate from the rest of the body. This is

all I know. If anyone can tellme a model name, engine specs, years

of production, where this car is made, history, or whatever info you

have on this funky looking car, please e-mail.

Vocabulary:

{'was': 95844, 'wondering': 97181, 'if': 48754, 'anyone': 18915, 'out': 68847,

'there': 88638, 'could': 30074, 'enlighten': 37335, 'me': 60560, 'on': 68080,

'this': 88767, 'car': 25775, 'saw': 80623, 'the': 88532, 'other': 68781,

'day': 31990, 'it': 51326, 'door': 34809, 'sports': 84538, 'looked': 57390,

'to': 89360, 'be': 21987, 'from': 41715, 'late': 55746, '60s': 9843,

'early': 35974, '70s': 11174, 'called': 25492, 'bricklin': 24160, 'doors': 34810,

'were': 96247, 'really': 76471, ... } (Abbreviated...)

Sample 0 (vectorized):

[0 0 0 ... 0 0 0]

Sample 0 (vectorized) length:

101631

Sample 0 (vectorized) sum:

To the source:

[array(['60s', '70s', 'addition', 'all', 'anyone', 'be', 'body',

'bricklin', 'bumper', 'called', 'can', 'car', 'could', 'day',

'door', 'doors', 'early', 'engine', 'enlighten', 'from', 'front',

'funky', 'have', 'history', 'if', 'in', 'info', 'is', 'it', 'know',

'late', 'looked', 'looking', 'made', 'mail', 'me', 'model', 'name',

'of', 'on', 'or', 'other', 'out', 'please', 'production', 'really',

'rest', 'saw', 'separate', 'small', 'specs', 'sports', 'tellme',

'the', 'there', 'this', 'to', 'was', 'were', 'whatever', 'where',

'wondering', 'years', 'you'], dtype='<U81')]

下一步

我们知道如何根据计数对这些东西进行矢量化，但是我们实际上可以用这些信息做什么呢?

首先，我们可以做一些分析。我们可以观察词频，我们可以去掉stop words，我们可以尝试聚类。现在我们已经有了这些文本数据的数字表示形式，我们可以做很多以前无法做的事情!

让我们更具体一点。我们一直在使用来自20新闻组数据集的文本数据。

20新闻组数据集是一个数据集，分为20个不同的类别。为什么不使用我们的矢量化来尝试和分类这些数据呢?

Python代码如下：

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics

# Create our vectorizer

vectorizer = CountVectorizer()

# All data

newsgroups_train = fetch_20newsgroups(subset='train',

remove=('headers', 'footers', 'quotes'))

newsgroups_test = fetch_20newsgroups(subset='test',

remove=('headers', 'footers', 'quotes'))

# Get the training vectors

vectors = vectorizer.fit_transform(newsgroups_train.data)

# Build the classifier

clf = MultinomialNB(alpha=.01)

# Train the classifier

clf.fit(vectors, newsgroups_train.target)

# Get the test vectors

vectors_test = vectorizer.transform(newsgroups_test.data)

# Predict and score the vectors

pred = clf.predict(vectors_test)

acc_score = metrics.accuracy_score(newsgroups_test.target, pred)

f1_score = metrics.f1_score(newsgroups_test.target, pred, average='macro')

print('Total accuracy classification score: {}'.format(acc_score))

print('Total F1 classification score: {}'.format(f1_score))

输出

Total accuracy classification score: 0.6460435475305364

Total F1 classification score: 0.6203806145034193

自然语言处理机器学习矢量化 scikit-learn python机器学习

83327712

0 关注 0 粉丝 0 动态

关注关注

为什么企业应该了解聊天机器人中的自然语言处理？

自然语言处理使聊天机器人能够理解我们的会话信息并相应地作出响应。企业应该对聊天机器人中的自然语言处理有所了解，因为它可以确定用户意图，评估其情绪并了解其行为。聊天机器人已经成为企业在当今竞争激烈的市场中获得认可的当务之急。利用聊天机器人提供的前所未有的客户

saluzirobot 2020-11-05

MIT 推出机器翻译新算法，破译已消失的古语言

麻省理工学院计算机科学与人工智能实验室的研究人员最近开发出了一种新的计算机算法，可以帮助语言学家自动破译历史上消失已久失的古语言。官方指出，该系统能够自动破译已消失的语言，且无需对这一语言与其他语言的关系有深入的了解。他们还表明，该系统自身就可以确定语言之

baijingjing 2020-10-27

NLP：不要重新造轮子

自然语言处理是一个令人生畏的领域名称。从非结构化文本中生成有用的结论是很困难的，而且有无数的技术和算法，每一种都有自己的用例和复杂性。作为一个接触NLP最少的开发人员，很难知道要使用哪些方法以及如何实现它们。本文的目标读者是希望将自然语言处理快速集成到他们

熊Ok 2020-10-26

十大针对机器学习的文本注释工具与服务，你会选用哪个？

下面，我将和您一起探讨目前十大针对机器学习的文本注释工具与服务。您可以根据自己项目的实际情况，从中做出选择并试用。Tagtog是一款由波兰软件公司开发的工具，可用于自动化或手动注释文本。Tagtog既支持原生的PDF注释，又包含了可用于自动化文本注释的预训

lgblove 2020-10-23

机器翻译：谷歌翻译是如何对几乎所有语言进行翻译的？

谷歌翻译大家想必都不陌生，但你有没有想过，它究竟是如何将几乎所有的已知语言翻译成我们所选择的语言？本文将解开这个谜团，并且向各位展示如何用长短期记忆网络构建语言翻译程序。第一部分简单介绍神经网络机器翻译和编码器-解码器结构。什么是机器翻译？在机器翻译领域，

WhiteHacker 2020-10-11

AI创业哪家强？6大选择给你方向

guojin0 2020-10-08

自然语言处理实战：机器学习常见工具与技术

许多自然语言处理都涉及机器学习，所以理解机器学习的一些基本工具和技术是有益处的。有些工具已经在前几章中讨论过，有些还没有，但这里我们会讨论所有这些工具。数据选择和特征工程会带来偏见的风险。类似的种族、宗教甚至地理区域偏见在原始的Word2vec模型中普遍存

lirika 2020-09-27

人工智能的企业家来说，这四个新的方向可能值得重视

在过去、现在和未来，人们用人工智能改变了许多行业，取得了很大的进步，也对人工智能有着长期和系统的发展远景和规划。对于人工智能的企业家来说，这四个新的方向可能值得重视：一是人类的自然语言处理、密集学习、记忆网络和其他技术领域与其他物种不同。人类的重要能力之一

saluzirobot 2020-09-25

人工智能的研究热点:自然语言处理

人工智能作为新一轮科技革命和产业变革的重要驱动力量，正在深刻地影响世界、改变世界。而自然语言处理已成为AI领域的研究热点，它推动着语言智能的持续发展和突破，并越来越多地应用于各个行业。正如国际知名学者周海中先生曾经所言：“自然语言处理是极有吸引力的研究领域

hxq 2020-09-23

性能媲美BERT，参数量仅为1/300，谷歌最新的NLP模型

熊Ok 2020-09-22

AI写的文章，真的可以骗过老师了

在过去的几年里，机器学习和人工智能的专家们一直致力于研究一些算法，这些算法可以用以前认为人类完全可以写的方式撰写文章和其他类型的内容。前段时间，一位学生用编程语言让电脑写了一篇文章在外网火了，网友都看不出来那篇文章出自AI之手。随着这些技术越来越先进，我们

randy0 2020-09-17

自然语言处理必读：5本平衡理论与实践的书籍

当谈到学习方法时，我们常常会提到教程、博客文章、在线课程等等，书本总是会被忽略掉。即使是在找一本关于某个主题的书，也会很快发现这样的书太多了，无法快速判断哪一本最适合自己。它也称为NLTK书籍，应用Python进行自然语言处理在整个过程中都很大程度上依赖于

MakeRoomFor 2020-08-30

无需「域外」文本，微软：NLP就应该针对性预训练

在生物医学这样的专业领域训练NLP模型，除了特定数据集，「域外」文本也被认为是有用的。但最近，微软的研究人员「大呼」：我不这么觉得！这是一个拷问人工智能「门外汉」的灵魂问题。但研究人员将这部分归因于数据中的噪声。

mxs 2020-08-10

pytorch+huggingface实现基于bert模型的文本分类（附代码）

一年前的这个时候，我逃课了一个星期，从澳洲飞去上海观看电竞比赛，也顺便在上海的一个公司联系了面试。当时，面试官问我对RNN的了解程度，我回答“没有了解”。但我把这个问题带回了学校，从此接触了RNN，以及它的加强版-LSTM。时隔一年，LSTM好像已经可以退

qilixuening 2020-07-18

NLP中的标识化

从零开始掌握一门新的语言令人望而生畏。如果你曾经学过一种不是你母语的语言，你就会理解！有太多的层次需要考虑，例如语法需要考虑。这是一个相当大的挑战。这就是自然语言处理中标识化的概念。简单地说，标识化对于处理文本数据十分重要。标识化是一种将文本分割成称为标识

NgCafai 2020-06-28

基于OpenSeq2Seq的NLP与语音识别混合精度训练

Mixed Precision Training for NLP and Speech Recognition with OpenSeq2Seq. 迄今为止，神经网络的成功建立在更大的数据集、更好的理论模型和缩短的训练时间上。为此，我们创建了OpenSeq

bensonrachel 2020-06-21

人工智能的8个有用的日常例子

如果你在谷歌上搜索“人工智能”这个词，然后不知怎的就打开了这篇文章，或者用优步打车上班，那么你就利用了人工智能。虽然有人将其称为“机器人以邪恶的天才统治世界”的现象，但我们无法否认人工智能通过节省时间、金钱和精力使生活变得轻松。人工智能是指机器通过专门设计

womystery 2020-06-17

人工智能的三大领域及其工业应用

人工智能是一门新兴的技术学科，研究和开发用于模拟人类智能的扩展和扩展的理论，方法，技术和应用系统。人工智能研究的目标是让机器执行一些复杂的任务，这些任务需要聪明的人来完成。也就是说，我们希望机器可以代替我们来解决一些复杂的任务，不仅仅是重复的机械活动，而是

purgle 2020-06-01

腾讯AI又创新纪录：ACL 2020入选27篇论文

近日，国际计算语言学协会年会在官网公布了ACL 2020的论文收录名单，共计收录779篇论文。据不完全统计，此次腾讯共有27篇论文入选，投中论文总数刷新国内记录，领跑国内业界AI研究第一梯队。此次ACL 2020的审稿周期相比以往几乎增加了一倍的时间，据最

dynalidan 2020-05-22

什么是机器阅读理解？跟自然语言处理有什么关系？

学者C. Snow于2002年发表的一篇论文中将阅读理解定义为“通过交互从书面文字中提取与构造文章语义的过程”。而机器阅读理解的目标是利用人工智能技术，使计算机具有和人类一样理解文章的能力。深度学习的特点是，模型能根据训练集上预测的准确度直接优化参数，不断

xceman 2020-04-30

安科网

机器学习自然语言处理：使用scikit-learn计算矢量化

83327712

计数矢量化的概述

一个基本的例子

在真实数据上使用它：

下一步

83327712

相关推荐

为什么企业应该了解聊天机器人中的自然语言处理？

MIT 推出机器翻译新算法，破译已消失的古语言

NLP：不要重新造轮子

十大针对机器学习的文本注释工具与服务，你会选用哪个？

机器翻译：谷歌翻译是如何对几乎所有语言进行翻译的？

AI创业哪家强？6大选择给你方向

自然语言处理实战：机器学习常见工具与技术

人工智能的企业家来说，这四个新的方向可能值得重视

人工智能的研究热点:自然语言处理

性能媲美BERT，参数量仅为1/300，谷歌最新的NLP模型

AI写的文章，真的可以骗过老师了

自然语言处理必读：5本平衡理论与实践的书籍

无需「域外」文本，微软：NLP就应该针对性预训练

pytorch+huggingface实现基于bert模型的文本分类（附代码）

NLP中的标识化

基于OpenSeq2Seq的NLP与语音识别混合精度训练

人工智能的8个有用的日常例子

人工智能的三大领域及其工业应用

腾讯AI又创新纪录：ACL 2020入选27篇论文

什么是机器阅读理解？跟自然语言处理有什么关系？

83327712