NLP刚入门或还未入门,搜资料时经常碰到的概念就是n-gram,特别是bigram,更加常见。了解它,会省不少事~
维基百科的定义:
n元语法(英语:n-gram)指文本中连续出现的n个语词。n元语法模型是基于(n-1)阶马尔可夫链的一种概率语言模型,通过n个语词出现的概率来推断语句的结构。
当n分别为1、2、3时,又分别称为一元语法(unigram)、二元语法(bigram)与三元语法(trigram)
所以概念本身非常简单,就是把文本连续出现的n个词都找出来。
举例:
文本:我是一个好人
先做分词:我 是 一个 好人
unigram:
我
是
一个
好人
bigram:
我 是
是 一个
一个 好人
trigram:
我 是 一个
是 一个 好人
你可能会问,最后面词语的个数不够n个呢?这样的情况,就需要由你确定是在左边补齐还是在右边补齐了。
nltk的实现挺好的,可以参考它的代码,在此摘录一下
# 此方法用来做补齐
def pad_sequence(sequence, n, pad_left=False, pad_right=False,
left_pad_symbol=None, right_pad_symbol=None):
"""
Returns a padded sequence of items before ngram extraction.
>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
['<s>', 1, 2, 3, 4, 5, '</s>']
>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
['<s>', 1, 2, 3, 4, 5]
>>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[1, 2, 3, 4, 5, '</s>']
:param sequence: the source data to be padded
:type sequence: sequence or iter
:param n: the degree of the ngrams
:type n: int
:param pad_left: whether the ngrams should be left-padded
:type pad_left: bool
:param pad_right: whether the ngrams should be right-padded
:type pad_right: bool
:param left_pad_symbol: the symbol to use for left padding (default is None)
:type left_pad_symbol: any
:param right_pad_symbol: the symbol to use for right padding (default is None)
:type right_pad_symbol: any
:rtype: sequence or iter
"""
sequence = iter(sequence)
if pad_left:
sequence = chain((left_pad_symbol,) * (n-1), sequence)
if pad_right:
sequence = chain(sequence, (right_pad_symbol,) * (n-1))
return sequence
def ngrams(sequence, n, pad_left=False, pad_right=False,
left_pad_symbol=None, right_pad_symbol=None):
"""
Return the ngrams generated from a sequence of items, as an iterator.
For example:
>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Wrap with list for a list version of this function. Set pad_left
or pad_right to true in order to get additional ngrams:
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
:param sequence: the source data to be converted into ngrams
:type sequence: sequence or iter
:param n: the degree of the ngrams
:type n: int
:param pad_left: whether the ngrams should be left-padded
:type pad_left: bool
:param pad_right: whether the ngrams should be right-padded
:type pad_right: bool
:param left_pad_symbol: the symbol to use for left padding (default is None)
:type left_pad_symbol: any
:param right_pad_symbol: the symbol to use for right padding (default is None)
:type right_pad_symbol: any
:rtype: sequence or iter
"""
sequence = pad_sequence(sequence, n, pad_left, pad_right,
left_pad_symbol, right_pad_symbol)
history = []
while n > 1:
history.append(next(sequence))
n -= 1
for item in sequence:
history.append(item)
yield tuple(history)
del history[0]
def bigrams(sequence, **kwargs):
"""
Return the bigrams generated from a sequence of items, as an iterator.
For example:
>>> from nltk.util import bigrams
>>> list(bigrams([1,2,3,4,5]))
[(1, 2), (2, 3), (3, 4), (4, 5)]
Use bigrams for a list version of this function.
:param sequence: the source data to be converted into bigrams
:type sequence: sequence or iter
:rtype: iter(tuple)
"""
for item in ngrams(sequence, 2, **kwargs):
yield item
def trigrams(sequence, **kwargs):
"""
Return the trigrams generated from a sequence of items, as an iterator.
For example:
>>> from nltk.util import trigrams
>>> list(trigrams([1,2,3,4,5]))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
Use trigrams for a list version of this function.
:param sequence: the source data to be converted into trigrams
:type sequence: sequence or iter
:rtype: iter(tuple)
"""
for item in ngrams(sequence, 3, **kwargs):
yield item