大数据的数据清洗利器是什么呢？

你情我愿

2018-05-29

在这篇文章中，我们将引见一种新的关头字搜刮和互换的算法：Flashtext 算法。Flashtext 算法是一个高效的字符搜刮和互换算法。该算法的工夫复杂度不依托于搜刮或互换的字符的数量。比如，关于一个文档有 N 个字符，和一个有 M 个词的关头词库，那么工夫复杂度就是 O(N) 。这个算法比我们通俗的正则婚配法快很多，因为正则婚配的工夫复杂度是 O(M * N)。这个算法和 Aho Corasick 算法也有一点不合，因为它不婚配子字符串。

Flashtext 算法被设计为只婚配完全的单词。比如，我们输入一个单词 {Apple}，那么这个算法就不会去婚配 “I like Pineapple” 中的 apple。这个算法也被设计为起首婚配最长字符串。在举个例子，比如我们有多么一个数据集 {Machine， Learning，Machine Learning}，一个文档 “I like Machine Learning”，那么我们的算法只会去婚配 “Machine Learning” ，因为这是最长婚配。

这个算法我们曾在 Github 下面完成了，用的是 Python 措辞。

1. 引见

在信息检索范围，关头字搜刮和替代都是很罕有的结果。我们常常想要在一个特定的文本中搜刮特定的关头词，或许在文本中替代一个特定的关头词。

例如：

关头字搜刮：假定我们有一个软件工程师的简历 (D)，我们具有一个 20k 的编程身手词库 corpus = {Java, python, javascript, machien learning, ...}。我们想要从这个词库 corpus 中哪些词在这个简历中呈现了，即 corpus ∩ D。
关头字互换：另外一个常常利用的用例是当我们有一个同义词的语料库（不合的拼写暗示的是同一个单词），比如 corpus = {javascript: [‘javascript’, ‘javascripting’, ‘java script’], ...} ，在软件工程师的简历中我们需要应用规范化的词，一切我们需要用一个规范化的词来互换不合简历中的同义词。

为了去措置上述这些结果，正则表达式是最常常利用的一个技能。当然正则表达式可以很好的措置这个结果，然则当我们的数据量增大年夜大年夜时，这个速度就会异常慢了。假定我们的文档达到百万级别时，那么这个运转工夫将达到几天的量级。比以下面的图1，正则表达式在一个 10k 的词库中查找 15k 个关头词的工夫差不多是 0.165 秒。然则关于 Flashtext 而言只需要 0.002 秒。是以，在这个结果上 Flashtext 的速度大年夜大年夜约比正则表达式快 82 倍。

跟着我们需要措置的字符愈来愈多，正则表达式的措置速度的确都是线性增加的。可是，Flashtext 的确是一个常量。在本文中，我们将侧重评论辩论正则表达式与 Flashtext 之间的机能差别。我们还将具体的刻画 Flashtext 算法及其任务事理，和一些基准测试。

1.1 用于关头字搜刮的正则表达式

正则表达式是一种异常活络和有效的情势婚配编制。比如我们在文本中搜刮一个婚配 “\d{4}”，它暗示任何 4 位数字婚配，如 2017。我们应用 Python 代码可以完成多么一个功用，以下：

import re
compiled_regex = re.compile(r'\b2017\b|\b\d{4}\b')
compiled_regex.findall('In 2017 2311 is my birthday.')
# output
['2017', '2311']

这里 ‘\b’ 用来暗示单词边界，它会去婚配出格字符，如 'space'，'period'，'new line' 等。

1.2 用于关头字互换的正则表达式

我们也能够应用正则表达式来制造一个规范化术语的互换脚本，比如我们可以编写一个 Python 脚本来用 “javascript” 互换 “java script”。以下：

import re
re.sub(r"\bjava script\b", 'javascript', 'java script is awesome.')
# output
javascript is awesome.

2. Flashtext

Flashtext 是一种基于 Trie 字典数据布局和 Aho Corasick 的算法。它的任务编制是，起首它将一切相干的关头字作为输入。应用这些关头字建立一个 trie 字典，以下图3所示：

大数据的数据清洗利器是什么呢？

start 和 eot 是两个出格的字符，用来定义词的边界，这和我们下面提到的正则表达式是一样的。这个 trie 字典就是我们前面要用来搜刮和互换的数据布局。

2.1 应用 Flashtext 遏制搜刮

关于输入字符串（文档），我们对字符遏制一一遍历。当我们在文档中的字符序列 <\b>word<\b> 婚配到字典中的 <start>word<eot> 时（start 和 eot 辨别是字符序列的末尾标签和中断标签），我们觉得这是一个完全婚配了。我们将婚配到的字符序列所对应的规范关头字遏制输入，具体以下：

2.2 应用 Flashtext 遏制互换

关于输入字符串（文档），我们对字符遏制一一遍历它。我们先创建一个空的字符串，当我们字符序列中的 <\b>word<\b> 没法在 Trie 字典中找到婚配时，那么我们就简单的原始字符复制到前去字符串中。然则，当我们可以从 Trie 字典中找到婚配时，那么我们将将婚配到的字符的规范字符复制到前去字符串中。是以，前去字符串是输入字符串的一个正本，独一的不合是互换了婚配到的字符序列，具体以下：

2.3 Flashtext 算法

Flashtext 算法那首要分为三局部，我们接上去将对每局部遏制伶仃分解：

构建 Trie 字典；
关头字搜刮；
关头字互换；

2.3.1 构建 Trie 字典

为了构建 trie 字典，我们必须设立一个空的节点指向我们的空字典。这个节点被用作一切单词的终点。我们在字典中拔出一个单词。这个单词中的下一个字符在本字典中作为关头字，并且这个指针需要再次指向一个空字典。这个过程赓续反复，直到我们达到单词中的最后一个字符。当我们达到单词的末尾时，我们拔出一个出格的字符（eot）来暗示词尾。

输入

关头词 w = c1c2c3...cn，个中 ci 暗示一个输入字符。规范词我们用 s 来暗示。

代码：用于 Flashtext 的初始化并向字典添加关头字

class FlashText(object):
 def __init__(self, case_sensitive=False):
 self._keyword = '_keyword_' # end of term (eot) and key to store standardized name
 sef._white_space_chars = set(['.', '\t', '\n', '\a', ' ', ','])
 self.keyword_trie_dict = dict()
 sefl.case_sensitive = case_sensitive
 def add_keyword(self, keyword, clean_name = None):
 if not clean_name and keyword:
 clean_name = keyword
 if keyword and clean_name:
 # if both keyword and clean_name are not empty.
 if not self.case_sensitive:
 # if not case_sensitive then lowercase the keyword
 keyword = keyword.lower()
 current_dict = self.keyword_trie_dict
 for letter in keyword:
 current_dict = current_dict.setdefault(letter, {})
 current_dict[self._keyword] = clean_name

输入

上述法度典型将会创建一个字典，如图3所示。

2.3.2 关头字搜刮

一旦我们将一切的词都添加到 trie 字典中，我们便可以在输入字符串中找到关头字。

输入

字符串 x = a1a2...an，个中 ai 是输入字符串 x 中的第 i 个字符。

代码：Python 代码用来获得字典中的输入字符串中的关头字。

def extract_keywords(self, sentence):
 keywords_extracted = []
 if not self.case_sensitive:
 # if not case_sensitive then lowercase the sentence
 sentence = sentence.lower()
 current_dict = self.keyword_trie_dict
 sequence_and_pos = 0
 idx = 0
 sentence_len = len(sentence)
 while idx &lt; sentence_len:
 char = sentence[idx]
 # when we reach a character that might denote word end
 if char not in self.non_word_boundaries:
 # if eot is present in current_dict
 if self._keyword in current_dict or char in current_dict:
 # update longest sequence found
 sequence_found = None
 longest_sequence_found = None
 is_longer_seq_found = False
 if self._keyword in current_dict:
 sequence_found = current_dict[self._keyword]
 longest_sequence_found = current_dict[self._keyword]
 sequence_end_pos = idx
 # re look for longest_sequence from this position
 if char in current_dict:
 current_dict_continued = current_dict[char]
 idy = idx + 1
 while idy &lt; sentence_len:
 inner_char = sentence[idy]
 if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continued:
 # update longest sequence found
 longest_sequence_found = current_dict_continued[self._keyword]
 sequence_end_ops = idy
 is_longer_seq_found = True
 if inner_char in current_dict_continued:
 current_dict_continued = current_dict_continued[inner_char]
 else:
 break
 idy += 1
 else:
 # end of sentence reached
 if self._keyword in current_dict_continued:
 # update longest sequence found
 longest_sequence_found = current_dict_continued[self._keyword]
 sequence_end_pos = idy
 is_longer_seq_found = True
 if is_longer_seq_found:
 idx = sequence_end_pos
 current_dict = self.keyword_trie_dict
 if longest_sequence_found:
 keywords_extracted.append(longest_sequence_found)
 else:
 # we reset current_dict
 current_dict = self.keyword_trie_dict
 elif char in current_dict:
 # char is present in current dictionary position
 current_dict = current_dict[char]
 else:
 # we reset current_dict
 current_dict = self.keyword_trie_dict
 # skip to end of word
 idy = idx + 1
 while idy &lt; sentence_len:
 char = sentence[idy]
 if char not in self.non_word_boundaries:
 break
 idy += 1
 idx = idy
 # if we are end of sentence and have a sequence discovered
 if idx + 1 &gt;= sentence_len:
 if self._keyword in current_dict:
 sequence_found = current_dict[self._keyword]
 keywords_extracted.append(sequence_found)
 idx += 1
 return keywords_extracted

输入

前去一个列表，输入字符串 x 中找到的一切规范化以后的字，如图 4 所示。

2.3.3 关头字互换

我们应用不异的字典用规范化的字来互换输入字符串中的关头字。

输入

输入字符串 x = a1a2...an，个中 ai 暗示第 i 个字符。

代码：Python 代码用于从输入字符串顶用规范词互换。

def replace_keywords(self, sentence):
 new_sentence = ''
 orig_sentence = sentence
 if not self.case_sensitive:
 sentence = sentence.lower()
 current_word = ''
 current_dict = self.keyword_trie_dict
 current_white_space = ''
 sequence_end_pos = 0
 idx = 0
 sentence_len = len(sentence)
 while idx &lt; sentence_len:
 char = sentence[idx]
 current_word += orig_sentence[idx]
 # when we reach whitespace
 if char not in self.non_word_boundaries:
 current_white_space = char
 # if end is present in current_dict
 if self._keyword in current_dict or char in current_dict:
 # update longest sequence found
 sequence_found = None
 longest_sequence_found = None
 is_longer_seq_found = False
 if self._keyword in current_dict:
 sequence_found = curretn_dcit[self._keyword]
 longest_sequence_found = current_dict[self._keyword]
 sequence_end_pos = idx
 # re look for longest_sequence from this position
 if char in current_dict:
 current_dict_continued = current_dict[char]
 current_word_continued = current_word
 idy = idx + 1
 while idy &lt; sentence_len:
 inner_char = sentence[idy]
 current_word_continued += orig_sentence[idy]
 if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continuted:
 # update longest sequence found
 current_white_space = inner_char
 longest_sequence_found = current_dict_continued[self._keyword]
 sequence_end_pos = idy
 is_longer_seq_found = True
 if inner_char in current_dict_continued:
 current_dict_continued = curretn_dict_continued[inner_char]
 else:
 break
 idy += 1
 else:
 # end of sentence reached.
 if self._keyword in current_dict_continued:
 # update longest sequence found
 current_white_space = ''
 longest_sequence_found = current_dict_continued[self._keyword]
 sequence_end_pos = idy
 is_longer_seq_found = True
 if is_longer_seq_found:
 idx = sequence_end_pos
 current_word = current_word_continued
 current_dict = self.keyword_trie_dict
 if longest_sequence_found:
 new_sentence += longest_sequence_found + curretn_white_space
 current_word = ''
 current_white_space = ''
 else:
 new_sentence += current_word
 current_word = ''
 current_white_space = ''
 else:
 # we reset current_dict
 current_dict = self.keyword_trie_dict
 new_sentence += current_word
 current_word = ''
 current_white_space = ''
 elif char in current_dict:
 # we can continue from this char
 current_dict = current_dict[char]
 else:
 # we reset current_dict
 current_dict = self.keyword_trie_dict
 # skip to end of word
 idy = idx + 1
 while idy &lt; sentence_len:
 char = sentence[idy]
 current_word += orig_sentence[idy]
 if char not in self.non_word_boundaries:
 break
 idy += 1
 idx = idy
 new_sentence += current_word
 current_word = ''
 current_white_space = ''
 # if we are end of sentence and have a sequence disv=convered
 if idx + 1 &gt;= sentence_len:
 if self._keyword in current_dict:
 sequence_found = current_dict[self._keyword]
 new_sentence += sequence_found
 idx += 1
 return new_sentence

输入

在字符串 x 中找到需要互换的词，然后用规范词遏制互换输入，如图 5 所示。

3. Flashtext 和正则表达式的基准测试

正如在图 1 和图 2 中所揭示的，Flashtext 比正则表达式的措置速度要快很多。此刻我们来做一些基准测试来加倍诠释这个结果。

3.1 关头字搜刮

我们应用 Python 代码来完成这个关头字搜刮的基准测试。起首，我们会随机创建一个 10K 的语料库。然后，我们将从单词列表当选择 1K 的词用来创建一个新的文档。

我们将从语料库当选择 k 个词，个中 k ∈ {0, 1000, 2000, .. , 20000}。我们将应用正则表达式和 Flashtext 辨别对这个文档中的关头词遏制搜刮，然后对比分解。具体 Python 代码以下：

from flashtext.keyword import KeywordProcessor
import random
import re
import string
import time
def get_word_of_length(str_length):
 # generate a random word of given length
 return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))
# generate a list of 100K words of randomly chosen size
all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]
print('Count | FlashText | Regex ')
print('------------------------------')
for keywords_length in [0, 1000, 5000, 10000, 15000]:
 # chose 1000 terms and create a string to search in.
 all_words_chosen = random.sample(all_words, 1000)
 story = ' '.join(all_words_chosen)
 # get unique keywrods from the list of words generated.
 unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))
 # compile Regex
 compiled_re = re.compile('|'.join([r'\b' + keyword + r'\b' for keyword in unique_keywords_sublist]))
 # add keywords to Flashtext
 keyword_processor = KeywordProcessor()
 keyword_processor.add_keywords_from_list(unique_keywords_sublist)
 # time the modules
 start = time.time()
 _ = keyword_processor.extract_keywords(story)
 mid = time.time()
 _ = compiled_re.findall(story)
 end = time.time()
 # print output
 print(str(keywords_length).ljust(6), '|',
 "{0:.5f}".format(mid - start).ljust(9), '|',
 "{0:.5f}".format(end - mid).ljust(9), '|')
 # output: Data for Figure 1

3.2 关头词互换

下面的代码是用来做关头词互换的 Python 代码。

from flashtext.keyword import KeywordProcessor
import random
import string
import re
import time
def get_word_of_length(str_length):
 # generate a random word of given length
 return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))
# generate a list of 100K words of randomly chosen size
all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]
print('Count | FlashText | Regex ')
print('-------------------------------')
for keywords_length in range(1, 20002, 1000):
 # chose 5000 terms and create a string to search in.
 all_words_chosen = random.sample(all_words, 5000)
 story = ' '.join(all_words_chosen)
 # get unique keywords from the list of words generated.
 unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))
 # compile regex
 # source: https://stackoverflow.com/questions/6116978/python-replace-multiple-strings
 rep = dict([(key, '_keyword_') for key in unique_keywords_sublist])
 compiled_re = re.compile("|".join(rep.keys()))
 # add keywords to flashtext
 keyword_processor = KeywordProcessor()
 for keyword in unique_keywords_sublist:
 keyword_processor.add_keyword(keyword, '_keyword_')
 # time the modules
 start = time.time()
 _ = keyword_processor.replace_keywords(story)
 mid = time.time()
 _ = compiled_re.sub(lambda m: rep[re.escape(m.group(0))], story)
 end = time.time()
 # print output
 print(str(keywords_length).ljust(6), '|',
 "{0:.5f}".format(mid - start).ljust(9), '|',
 "{0:.5f}".format(end - mid).ljust(9), '|',)
# output: Data for Figure 2

3.3 结论

正如我们鄙人面看到的对比结果，Flashtext 在关头字搜刮和互换下面比正则表达式都快很多。出格是在措置大年夜大年夜范围数据的时辰，这个优势会加倍的显示被表示出来。

4. Flashtext 应用文档

4.1 装配

pip install flashtext

4.2 应用例子

4.2.1 关头字提取

&gt;&gt;&gt; from flashtext import KeywordProcessor
&gt;&gt;&gt; keyword_processor = KeywordProcessor()
&gt;&gt;&gt; # keyword_processor.add_keyword(&lt;unclean name&gt;, &lt;standardised name&gt;)
&gt;&gt;&gt; keyword_processor.add_keyword('Big Apple', 'New York')
&gt;&gt;&gt; keyword_processor.add_keyword('Bay Area')
&gt;&gt;&gt; keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
&gt;&gt;&gt; keywords_found
&gt;&gt;&gt; # ['New York', 'Bay Area']

4.2.2 关头字互换

&gt;&gt;&gt; keyword_processor.add_keyword('New Delhi', 'NCR region')
&gt;&gt;&gt; new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')
&gt;&gt;&gt; new_sentence
&gt;&gt;&gt; # 'I love New York and NCR region.'

4.2.3 辨别大年夜大年夜小写字母

&gt;&gt;&gt; from flashtext import KeywordProcessor
&gt;&gt;&gt; keyword_processor = KeywordProcessor(case_sensitive=True)
&gt;&gt;&gt; keyword_processor.add_keyword('Big Apple', 'New York')
&gt;&gt;&gt; keyword_processor.add_keyword('Bay Area')
&gt;&gt;&gt; keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
&gt;&gt;&gt; keywords_found
&gt;&gt;&gt; # ['Bay Area']

4.2.4 关头字不清楚

&gt;&gt;&gt; from flashtext import KeywordProcessor
&gt;&gt;&gt; keyword_processor = KeywordProcessor()
&gt;&gt;&gt; keyword_processor.add_keyword('Big Apple')
&gt;&gt;&gt; keyword_processor.add_keyword('Bay Area')
&gt;&gt;&gt; keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
&gt;&gt;&gt; keywords_found
&gt;&gt;&gt; # ['Big Apple', 'Bay Area']

4.2.5 同时添加多个关头词

&gt;&gt;&gt; from flashtext import KeywordProcessor
&gt;&gt;&gt; keyword_processor = KeywordProcessor()
&gt;&gt;&gt; keyword_dict = {
&gt;&gt;&gt; "java": ["java_2e", "java programing"],
&gt;&gt;&gt; "product management": ["PM", "product manager"]
&gt;&gt;&gt; }
&gt;&gt;&gt; # {'clean_name': ['list of unclean names']}
&gt;&gt;&gt; keyword_processor.add_keywords_from_dict(keyword_dict)
&gt;&gt;&gt; # Or add keywords from a list:
&gt;&gt;&gt; keyword_processor.add_keywords_from_list(["java", "python"])
&gt;&gt;&gt; keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
&gt;&gt;&gt; # output ['product management', 'java']

4.2.6 删除关头字

&gt;&gt;&gt; from flashtext import KeywordProcessor
&gt;&gt;&gt; keyword_processor = KeywordProcessor()
&gt;&gt;&gt; keyword_dict = {
&gt;&gt;&gt; "java": ["java_2e", "java programing"],
&gt;&gt;&gt; "product management": ["PM", "product manager"]
&gt;&gt;&gt; }
&gt;&gt;&gt; keyword_processor.add_keywords_from_dict(keyword_dict)
&gt;&gt;&gt; print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))
&gt;&gt;&gt; # output ['product management', 'java']
&gt;&gt;&gt; keyword_processor.remove_keyword('java_2e')
&gt;&gt;&gt; # you can also remove keywords from a list/ dictionary
&gt;&gt;&gt; keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})
&gt;&gt;&gt; keyword_processor.remove_keywords_from_list(["java programing"])
&gt;&gt;&gt; keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
&gt;&gt;&gt; # output ['product management']

有时辰我们会将一些出格符号作为字符边界，比如空格，\ 等等。为了从头设定字边界，我们需要添加一些符号通知算法，这是单词字符的一局部。

&gt;&gt;&gt; from flashtext import KeywordProcessor
&gt;&gt;&gt; keyword_processor = KeywordProcessor()
&gt;&gt;&gt; keyword_processor.add_keyword('Big Apple')
&gt;&gt;&gt; print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
&gt;&gt;&gt; # ['Big Apple']
&gt;&gt;&gt; keyword_processor.add_non_word_boundary('/')
&gt;&gt;&gt; print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
&gt;&gt;&gt; # []

4.3 API 文档

具体的 API 文档，你可以点击这里遏制查抄。

参考：

Stack overflow

Flashtext: github

Flashtext: paper

Flashtext: medium

Flashtext: API doc

正则表达式数据清洗 javascript 算法

安科网

大数据的数据清洗利器是什么呢？

你情我愿

你情我愿

相关推荐

利用正则表达式对网络日志数据进行数据清洗

shell模糊匹配与正则详解

正则表达式中两个反斜杠的匹配规则详解

正则表达式解决input框固定输入值得格式(金额,特殊字符)

浅析golang 正则表达式

Oracle数据库正则表达式使用场景代码实例

Shell—正则表达式（grep命令、sed工具）

【教程】图文解读正则表达式的使用技巧

如何掌握正则表达式这一开发利器，看这篇就够了

基于xpath选择器、PyQuery、正则表达式的格式清理工具详解

3个助你玩转正则表达式的利器

如何使用Grep命令查找多个字符串

C# 正则表达式

正则表达式常用通配符

正则表达式在NLP中应用

正则表达式匹配样例

正则表达式常用的字符类

用正则表达式验证表格的格式

SHELL正则表达式

02-re模块使用

你情我愿