大数据的数据清洗利器是什么呢?

在这篇文章中,我们将引见一种新的关头字搜刮和互换的算法:Flashtext 算法。Flashtext 算法是一个高效的字符搜刮和互换算法。该算法的工夫复杂度不依托于搜刮或互换的字符的数量。比如,关于一个文档有 N 个字符,和一个有 M 个词的关头词库,那么工夫复杂度就是 O(N) 。这个算法比我们通俗的正则婚配法快很多,因为正则婚配的工夫复杂度是 O(M * N)。这个算法和 Aho Corasick 算法也有一点不合,因为它不婚配子字符串。

Flashtext 算法被设计为只婚配完全的单词。比如,我们输入一个单词 {Apple},那么这个算法就不会去婚配 “I like Pineapple” 中的 apple。这个算法也被设计为起首婚配最长字符串。在举个例子,比如我们有多么一个数据集 {Machine, Learning,Machine Learning},一个文档 “I like Machine Learning”,那么我们的算法只会去婚配 “Machine Learning” ,因为这是最长婚配。

这个算法我们曾在 Github 下面完成了,用的是 Python 措辞。

1. 引见

在信息检索范围,关头字搜刮和替代都是很罕有的结果。我们常常想要在一个特定的文本中搜刮特定的关头词,或许在文本中替代一个特定的关头词。

例如:

  1. 关头字搜刮:假定我们有一个软件工程师的简历 (D),我们具有一个 20k 的编程身手词库 corpus = {Java, python, javascript, machien learning, ...}。我们想要从这个词库 corpus 中哪些词在这个简历中呈现了,即 corpus ∩ D。

  2. 关头字互换:另外一个常常利用的用例是当我们有一个同义词的语料库(不合的拼写暗示的是同一个单词),比如 corpus = {javascript: [‘javascript’, ‘javascripting’, ‘java script’], ...} ,在软件工程师的简历中我们需要应用规范化的词,一切我们需要用一个规范化的词来互换不合简历中的同义词。

为了去措置上述这些结果,正则表达式是最常常利用的一个技能。当然正则表达式可以很好的措置这个结果,然则当我们的数据量增大年夜大年夜时,这个速度就会异常慢了。假定我们的文档达到百万级别时,那么这个运转工夫将达到几天的量级。比以下面的图1,正则表达式在一个 10k 的词库中查找 15k 个关头词的工夫差不多是 0.165 秒。然则关于 Flashtext 而言只需要 0.002 秒。是以,在这个结果上 Flashtext 的速度大年夜大年夜约比正则表达式快 82 倍。

跟着我们需要措置的字符愈来愈多,正则表达式的措置速度的确都是线性增加的。可是,Flashtext 的确是一个常量。在本文中,我们将侧重评论辩论正则表达式与 Flashtext 之间的机能差别。我们还将具体的刻画 Flashtext 算法及其任务事理,和一些基准测试。

1.1 用于关头字搜刮的正则表达式

正则表达式是一种异常活络和有效的情势婚配编制。比如我们在文本中搜刮一个婚配 “\d{4}”,它暗示任何 4 位数字婚配,如 2017。我们应用 Python 代码可以完成多么一个功用,以下:

import re
compiled_regex = re.compile(r'\b2017\b|\b\d{4}\b')
compiled_regex.findall('In 2017 2311 is my birthday.')
# output
['2017', '2311']

这里 ‘\b’ 用来暗示单词边界,它会去婚配出格字符,如 'space','period','new line' 等。

1.2 用于关头字互换的正则表达式

我们也能够应用正则表达式来制造一个规范化术语的互换脚本,比如我们可以编写一个 Python 脚本来用 “javascript” 互换 “java script”。以下:

import re
re.sub(r"\bjava script\b", 'javascript', 'java script is awesome.')
# output
javascript is awesome.

2. Flashtext

Flashtext 是一种基于 Trie 字典数据布局和 Aho Corasick 的算法。它的任务编制是,起首它将一切相干的关头字作为输入。应用这些关头字建立一个 trie 字典,以下图3所示:

大数据的数据清洗利器是什么呢?

start 和 eot 是两个出格的字符,用来定义词的边界,这和我们下面提到的正则表达式是一样的。这个 trie 字典就是我们前面要用来搜刮和互换的数据布局。

2.1 应用 Flashtext 遏制搜刮

关于输入字符串(文档),我们对字符遏制一一遍历。当我们在文档中的字符序列 <\b>word<\b> 婚配到字典中的 <start>word<eot> 时(start 和 eot 辨别是字符序列的末尾标签和中断标签),我们觉得这是一个完全婚配了。我们将婚配到的字符序列所对应的规范关头字遏制输入,具体以下:

2.2 应用 Flashtext 遏制互换

关于输入字符串(文档),我们对字符遏制一一遍历它。我们先创建一个空的字符串,当我们字符序列中的 <\b>word<\b> 没法在 Trie 字典中找到婚配时,那么我们就简单的原始字符复制到前去字符串中。然则,当我们可以从 Trie 字典中找到婚配时,那么我们将将婚配到的字符的规范字符复制到前去字符串中。是以,前去字符串是输入字符串的一个正本,独一的不合是互换了婚配到的字符序列,具体以下:

2.3 Flashtext 算法

Flashtext 算法那首要分为三局部,我们接上去将对每局部遏制伶仃分解:

  1. 构建 Trie 字典;

  2. 关头字搜刮;

  3. 关头字互换;

2.3.1 构建 Trie 字典

为了构建 trie 字典,我们必须设立一个空的节点指向我们的空字典。这个节点被用作一切单词的终点。我们在字典中拔出一个单词。这个单词中的下一个字符在本字典中作为关头字,并且这个指针需要再次指向一个空字典。这个过程赓续反复,直到我们达到单词中的最后一个字符。当我们达到单词的末尾时,我们拔出一个出格的字符(eot)来暗示词尾。

输入

关头词 w = c1c2c3...cn,个中 ci 暗示一个输入字符。规范词我们用 s 来暗示。

代码:用于 Flashtext 的初始化并向字典添加关头字

class FlashText(object):
 def __init__(self, case_sensitive=False):
 self._keyword = '_keyword_' # end of term (eot) and key to store standardized name
 sef._white_space_chars = set(['.', '\t', '\n', '\a', ' ', ','])
 self.keyword_trie_dict = dict()
 sefl.case_sensitive = case_sensitive
 def add_keyword(self, keyword, clean_name = None):
 if not clean_name and keyword:
 clean_name = keyword
 if keyword and clean_name:
 # if both keyword and clean_name are not empty.
 if not self.case_sensitive:
 # if not case_sensitive then lowercase the keyword
 keyword = keyword.lower()
 current_dict = self.keyword_trie_dict
 for letter in keyword:
 current_dict = current_dict.setdefault(letter, {})
 current_dict[self._keyword] = clean_name

输入

上述法度典型将会创建一个字典,如图3所示。

2.3.2 关头字搜刮

一旦我们将一切的词都添加到 trie 字典中,我们便可以在输入字符串中找到关头字。

输入

字符串 x = a1a2...an,个中 ai 是输入字符串 x 中的第 i 个字符。

代码:Python 代码用来获得字典中的输入字符串中的关头字。

def extract_keywords(self, sentence):
 keywords_extracted = []
 if not self.case_sensitive:
 # if not case_sensitive then lowercase the sentence
 sentence = sentence.lower()
 current_dict = self.keyword_trie_dict
 sequence_and_pos = 0
 idx = 0
 sentence_len = len(sentence)
 while idx < sentence_len:
 char = sentence[idx]
 # when we reach a character that might denote word end
 if char not in self.non_word_boundaries:
 # if eot is present in current_dict
 if self._keyword in current_dict or char in current_dict:
 # update longest sequence found
 sequence_found = None
 longest_sequence_found = None
 is_longer_seq_found = False
 if self._keyword in current_dict:
 sequence_found = current_dict[self._keyword]
 longest_sequence_found = current_dict[self._keyword]
 sequence_end_pos = idx
 # re look for longest_sequence from this position
 if char in current_dict:
 current_dict_continued = current_dict[char]
 idy = idx + 1
 while idy < sentence_len:
 inner_char = sentence[idy]
 if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continued:
 # update longest sequence found
 longest_sequence_found = current_dict_continued[self._keyword]
 sequence_end_ops = idy
 is_longer_seq_found = True
 if inner_char in current_dict_continued:
 current_dict_continued = current_dict_continued[inner_char]
 else:
 break
 idy += 1
 else:
 # end of sentence reached
 if self._keyword in current_dict_continued:
 # update longest sequence found
 longest_sequence_found = current_dict_continued[self._keyword]
 sequence_end_pos = idy
 is_longer_seq_found = True
 if is_longer_seq_found:
 idx = sequence_end_pos
 current_dict = self.keyword_trie_dict
 if longest_sequence_found:
 keywords_extracted.append(longest_sequence_found)
 else:
 # we reset current_dict
 current_dict = self.keyword_trie_dict
 elif char in current_dict:
 # char is present in current dictionary position
 current_dict = current_dict[char]
 else:
 # we reset current_dict
 current_dict = self.keyword_trie_dict
 # skip to end of word
 idy = idx + 1
 while idy < sentence_len:
 char = sentence[idy]
 if char not in self.non_word_boundaries:
 break
 idy += 1
 idx = idy
 # if we are end of sentence and have a sequence discovered
 if idx + 1 >= sentence_len:
 if self._keyword in current_dict:
 sequence_found = current_dict[self._keyword]
 keywords_extracted.append(sequence_found)
 idx += 1
 return keywords_extracted

输入

前去一个列表,输入字符串 x 中找到的一切规范化以后的字,如图 4 所示。

2.3.3 关头字互换

我们应用不异的字典用规范化的字来互换输入字符串中的关头字。

输入

输入字符串 x = a1a2...an,个中 ai 暗示第 i 个字符。

代码:Python 代码用于从输入字符串顶用规范词互换。

def replace_keywords(self, sentence):
 new_sentence = ''
 orig_sentence = sentence
 if not self.case_sensitive:
 sentence = sentence.lower()
 current_word = ''
 current_dict = self.keyword_trie_dict
 current_white_space = ''
 sequence_end_pos = 0
 idx = 0
 sentence_len = len(sentence)
 while idx < sentence_len:
 char = sentence[idx]
 current_word += orig_sentence[idx]
 # when we reach whitespace
 if char not in self.non_word_boundaries:
 current_white_space = char
 # if end is present in current_dict
 if self._keyword in current_dict or char in current_dict:
 # update longest sequence found
 sequence_found = None
 longest_sequence_found = None
 is_longer_seq_found = False
 if self._keyword in current_dict:
 sequence_found = curretn_dcit[self._keyword]
 longest_sequence_found = current_dict[self._keyword]
 sequence_end_pos = idx
 # re look for longest_sequence from this position
 if char in current_dict:
 current_dict_continued = current_dict[char]
 current_word_continued = current_word
 idy = idx + 1
 while idy < sentence_len:
 inner_char = sentence[idy]
 current_word_continued += orig_sentence[idy]
 if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continuted:
 # update longest sequence found
 current_white_space = inner_char
 longest_sequence_found = current_dict_continued[self._keyword]
 sequence_end_pos = idy
 is_longer_seq_found = True
 if inner_char in current_dict_continued:
 current_dict_continued = curretn_dict_continued[inner_char]
 else:
 break
 idy += 1
 else:
 # end of sentence reached.
 if self._keyword in current_dict_continued:
 # update longest sequence found
 current_white_space = ''
 longest_sequence_found = current_dict_continued[self._keyword]
 sequence_end_pos = idy
 is_longer_seq_found = True
 if is_longer_seq_found:
 idx = sequence_end_pos
 current_word = current_word_continued
 current_dict = self.keyword_trie_dict
 if longest_sequence_found:
 new_sentence += longest_sequence_found + curretn_white_space
 current_word = ''
 current_white_space = ''
 else:
 new_sentence += current_word
 current_word = ''
 current_white_space = ''
 else:
 # we reset current_dict
 current_dict = self.keyword_trie_dict
 new_sentence += current_word
 current_word = ''
 current_white_space = ''
 elif char in current_dict:
 # we can continue from this char
 current_dict = current_dict[char]
 else:
 # we reset current_dict
 current_dict = self.keyword_trie_dict
 # skip to end of word
 idy = idx + 1
 while idy < sentence_len:
 char = sentence[idy]
 current_word += orig_sentence[idy]
 if char not in self.non_word_boundaries:
 break
 idy += 1
 idx = idy
 new_sentence += current_word
 current_word = ''
 current_white_space = ''
 # if we are end of sentence and have a sequence disv=convered
 if idx + 1 >= sentence_len:
 if self._keyword in current_dict:
 sequence_found = current_dict[self._keyword]
 new_sentence += sequence_found
 idx += 1
 return new_sentence

输入

在字符串 x 中找到需要互换的词,然后用规范词遏制互换输入,如图 5 所示。

3. Flashtext 和正则表达式的基准测试

正如在图 1 和图 2 中所揭示的,Flashtext 比正则表达式的措置速度要快很多。此刻我们来做一些基准测试来加倍诠释这个结果。

3.1 关头字搜刮

我们应用 Python 代码来完成这个关头字搜刮的基准测试。起首,我们会随机创建一个 10K 的语料库。然后,我们将从单词列表当选择 1K 的词用来创建一个新的文档。

我们将从语料库当选择 k 个词,个中 k ∈ {0, 1000, 2000, .. , 20000}。我们将应用正则表达式和 Flashtext 辨别对这个文档中的关头词遏制搜刮,然后对比分解。具体 Python 代码以下:

from flashtext.keyword import KeywordProcessor
import random
import re
import string
import time
def get_word_of_length(str_length):
 # generate a random word of given length
 return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))
# generate a list of 100K words of randomly chosen size
all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]
print('Count | FlashText | Regex ')
print('------------------------------')
for keywords_length in [0, 1000, 5000, 10000, 15000]:
 # chose 1000 terms and create a string to search in.
 all_words_chosen = random.sample(all_words, 1000)
 story = ' '.join(all_words_chosen)
 # get unique keywrods from the list of words generated.
 unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))
 # compile Regex
 compiled_re = re.compile('|'.join([r'\b' + keyword + r'\b' for keyword in unique_keywords_sublist]))
 # add keywords to Flashtext
 keyword_processor = KeywordProcessor()
 keyword_processor.add_keywords_from_list(unique_keywords_sublist)
 # time the modules
 start = time.time()
 _ = keyword_processor.extract_keywords(story)
 mid = time.time()
 _ = compiled_re.findall(story)
 end = time.time()
 # print output
 print(str(keywords_length).ljust(6), '|',
 "{0:.5f}".format(mid - start).ljust(9), '|',
 "{0:.5f}".format(end - mid).ljust(9), '|')
 # output: Data for Figure 1

3.2 关头词互换

下面的代码是用来做关头词互换的 Python 代码。

from flashtext.keyword import KeywordProcessor
import random
import string
import re
import time
def get_word_of_length(str_length):
 # generate a random word of given length
 return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))
# generate a list of 100K words of randomly chosen size
all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]
print('Count | FlashText | Regex ')
print('-------------------------------')
for keywords_length in range(1, 20002, 1000):
 # chose 5000 terms and create a string to search in.
 all_words_chosen = random.sample(all_words, 5000)
 story = ' '.join(all_words_chosen)
 # get unique keywords from the list of words generated.
 unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))
 # compile regex
 # source: https://stackoverflow.com/questions/6116978/python-replace-multiple-strings
 rep = dict([(key, '_keyword_') for key in unique_keywords_sublist])
 compiled_re = re.compile("|".join(rep.keys()))
 # add keywords to flashtext
 keyword_processor = KeywordProcessor()
 for keyword in unique_keywords_sublist:
 keyword_processor.add_keyword(keyword, '_keyword_')
 # time the modules
 start = time.time()
 _ = keyword_processor.replace_keywords(story)
 mid = time.time()
 _ = compiled_re.sub(lambda m: rep[re.escape(m.group(0))], story)
 end = time.time()
 # print output
 print(str(keywords_length).ljust(6), '|',
 "{0:.5f}".format(mid - start).ljust(9), '|',
 "{0:.5f}".format(end - mid).ljust(9), '|',)
# output: Data for Figure 2

3.3 结论

正如我们鄙人面看到的对比结果,Flashtext 在关头字搜刮和互换下面比正则表达式都快很多。出格是在措置大年夜大年夜范围数据的时辰,这个优势会加倍的显示被表示出来。

4. Flashtext 应用文档

4.1 装配

pip install flashtext

4.2 应用例子

4.2.1 关头字提取

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> # ['New York', 'Bay Area']

4.2.2 关头字互换

>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')
>>> new_sentence
>>> # 'I love New York and NCR region.'

4.2.3 辨别大年夜大年夜小写字母

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Bay Area']

4.2.4 关头字不清楚

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Big Apple', 'Bay Area']

4.2.5 同时添加多个关头词

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>> "java": ["java_2e", "java programing"],
>>> "product management": ["PM", "product manager"]
>>> }
>>> # {'clean_name': ['list of unclean names']}
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> # Or add keywords from a list:
>>> keyword_processor.add_keywords_from_list(["java", "python"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management', 'java']

4.2.6 删除关头字

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>> "java": ["java_2e", "java programing"],
>>> "product management": ["PM", "product manager"]
>>> }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))
>>> # output ['product management', 'java']
>>> keyword_processor.remove_keyword('java_2e')
>>> # you can also remove keywords from a list/ dictionary
>>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})
>>> keyword_processor.remove_keywords_from_list(["java programing"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management']

有时辰我们会将一些出格符号作为字符边界,比如 空格,\ 等等。为了从头设定字边界,我们需要添加一些符号通知算法,这是单词字符的一局部。

>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
>>> # ['Big Apple']
>>> keyword_processor.add_non_word_boundary('/')
>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
>>> # []

4.3 API 文档

具体的 API 文档,你可以点击这里遏制查抄。


参考:

Stack overflow

Flashtext: github

Flashtext: paper

Flashtext: medium

Flashtext: API doc

相关推荐