英文文本处理

结合nltk包进行说明。

1 Tokenization(标记化/分词)

文本是不能成段送入模型中进行分析的，我们通常会把文本切成有独立含义的字、词或者短语，这个过程叫做tokenization，这通常是大家解决自然语言处理问题的第一步。

NLTK中提供了2种不同方式的tokenization： sentence tokenization 和 word tokenization，前者把文本进行“断句”，后者对文本进行“分词”。

sentence tokenization相比于split的优势在于，是别不是句号的位置，如：Mr.H

1	`from nltk import word_tokenize, sent_tokenize`

2 去停用词

在自然语言处理的一些任务中，我们处理的主体“文本”中有一些功能词经常出现，然而对于最后的任务目标并没有帮助，甚至会对结果产生干扰，我们把这类词叫做停用词。

nltk.download('stopwords')

# 导入内置停用词
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

3 词性标注（part-of-speech tagging）

词性（part-of-speech）是词汇基本的语法属性，通常也称为词性。

词性标注是很多NLP任务的预处理步骤，如句法分析，经过词性标注后的文本会带来很大的便利性，但也不是不可或缺的步骤。

主流的做法可以分为基于规则和基于统计的方法，包括：

基于最大熵的词性标注

基于统计最大概率输出词性

基于HMM的词性标注

nltk.download('averaged_perceptron_tagger')

# 词性标注
from nltk import pos_tag
tags = pos_tag(filtered_corpus)
tags[:20]

4 chunking/组块分析

chunking分块是命名实体识别的基础，词性给出来的句子成分的属性，但有时候，更多的信息(比如句子句法结构)可以帮助我们对句子中的模式挖掘更充分。

from nltk.chunk import RegexpParser  # RegexpParser
from nltk import sent_tokenize,word_tokenize

# 写一个匹配名词的模式
# JJ	adjective; NN	noun, singular or mass
# CC	coordinating conjunction
pattern = """
    NP: {<JJ>*<NN>+}
    {<JJ>*<NN><CC>*<NN>+}
    """
    
# 定义组块分析器
chunker = RegexpParser(pattern)

# 分句
tokenized_sentence = nltk.sent_tokenize(text)
# 分词
tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]
# 词性标注
tagged_words = [nltk.pos_tag(word) for word in tokenized_words]
# 识别组块
word_tree = [chunker.parse(word) for word in tagged_words]

word_tree[0].draw()  # 绘图

5 命名实体识别（Named Entity Recognition，NER）

命名实体识别包括两部分：1) 实体边界识别；2) 确定实体类别（人名、地名、机构名或其他）。

stanford core nlp modules 速度更快，而且有更高的识别准确度。

from nltk import ne_chunk, pos_tag,  word_tokenize

sentence = "John studies at Stanford University."
print(ne_chunk(pos_tag(word_tokenize(sentence))))

6 Stemming和Lemmatizing 词干算法和词型还原

对英文当中的时态语态等做归一化。

Stemmer

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmer.stem("running")
# Out: 'run'


from nltk.stem import SnowballStemmer

stemmer2 = SnowballStemmer("english")
stemmer2.stem("growing")
# Out: 'grow'

# Create your own stemmer using Regular Expression
from nltk.stem import RegexpStemmer

rst = RegexpStemmer(r'ing$|s$|e$|able$')
rst.stem('controllable')
# Out: 'controll'

Lemmatization和Stemmer很类似，不同的地方在于Lemmatization会根据语言词根找到原型，而Stemmer只是根据规则截断结尾的 -ing 等后缀。Stemmer的速度更快，但是它通常只是一系列的规则。

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("makes")

# If you do not provide POS tag of the word, 
# lemmatizer will consider word as a noun and you may not get the result you expected 
lemmatizer.lemmatize('spoken')
# Out: 'spoken'

# 给出词性
lemmatizer.lemmatize('spoken','v')
# Out: 'speak'

NLTK：WordNet

NLTK，全称Natural Language Toolkit，自然语言处理工具包，是NLP研究领域常用的一个Python库，由宾夕法尼亚大学的Steven Bird和Edward Loper在Python的基础上开发的一个模块，至今已有超过十万行的代码。这是一个开源项目，包含数据集、Python模块、教程等；NLTK是最常用的英文自然语言处理python基础库之一。

Corpus Module

WordNet与词义解析

from nltk.corpus import wordnet as wn

# 同义
wn.synsets('man')

# 第一种词义
wn.synsets('man')[0].definition()
# 第二种词义
wn.synsets('man')[1].definition()

# 造句：选择一种词义（dog.n.01）
dog = wn.synset('dog.n.01')
dog.examples()[0]

# 上位词
dog.hypernyms()

# 查看不同词义下的Lemmatization解析
for syn in wn.synsets('spoken'):
    print(syn,':', syn.lemma_names())

Frequency Distribution

from nltk import ConditionalFreqDist, FreqDist

...
fd = FreqDist(l)
fd.most_common(2)
# Return a list of all samples that occur once
fd.hapaxes()
# Find the word occuring max number of times
fd_w_humor.max()
# Freq = Number of occurences / total number of words
fd_w_humor.freq('the')
# check how many times word 'pen' appeared
fd_w_humor.get('pen')
...

# Conditional Frequency Distribution

# Use tabulate mathod to check distribution of modal words in different genre 
cfd = ConditionalFreqDist(
           (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))

cfd.tabulate(conditions=genres, samples=modals)

# 绘制分布图
l_names = ([('male',name[-1]) for name in names.words('male.txt')] +
         [('female',name[-1]) for name in names.words('female.txt')])
         
cfd_names = ConditionalFreqDist(l_names)
cfd_names.plot()

SpaCy

spaCy是Python和Cython中的高级自然语言处理库，它建立在最新的研究基础之上，从一开始就设计用于实际产品。spaCy 带有预先训练的统计模型和单词向量，目前支持 20 多种语言的标记。它具有快速的句法分析器，用于标签的卷积神经网络模型，解析和命名实体识别以及与深度学习整合。

!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple spaCy

!python -m spacy download en

支持的语言

# Tokenization
import spacy

nlp = spacy.load('en')
doc = nlp('Hello World!')

for token in doc:
    print('"' + token.text + '"')
    
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,     # 开始index
        token.lemma_,  # 原型
        token.is_punct,# 判断标点
        token.is_space,# 判断空格
        token.shape_,  # 词格式
        token.pos_,    # 词性
        token.tag_     # 标注
    ))
    
# 断句
for sent in doc.sents:
    print(sent)
    

# 词性标注
print([(token.text, token.tag_) for token in doc])


# NER
for ent in doc.ents:
    print(ent.text, ent.label_)

BIO/IOB tagging 是一种对给定句子中的单元做序列标注的方式，用于从给定句子中抽取连续字/词块构成的有意义短语。

每个词标注为B（Beginning，指示某短语起始）、I（Inside，指示短语内部）、O（Outside，指示不在短语中）中的一个。如：B-人名 O B-机构名 I-机构名

iob_tagged = [
    (token.text, token.tag_, "{0}-{1}".format(token.ent_iob_, token.ent_type_)
    	if token.ent_iob_ != 'O' else token.ent_iob_) for token in doc
]

ent_iob = [(token.ent_iob_, token.ent_type_) for token in doc]


from nltk.chunk import conlltags2tree
# 按照nltk.Tree的格式显示
print(conlltags2tree(iob_tagged))

可视化

from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

displacy.render(doc, style='dep', jupyter=True)

chunking/组块分析

1 2	`for chunk in doc.noun_chunks: print(chunk.text,'---', chunk.label_,'---', chunk.root.text)`

句法依存

for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))  #head
    
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

词向量

# 如果要使用英文的词向量，需要先下载预先训练好的结果
# !python -m spacy download en_core_web_lg

nlp = spacy.load('en_core_web_lg')
print(nlp.vocab['banana'].vector)


# 在词向量的基础上，spaCy提供了从词到文档的相似度计算的方法
target = nlp("Cats are beautiful animals.")
 
doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")
 
print(target.similarity(doc1))  # 0.8901765218466683
print(target.similarity(doc2))  # 0.9115828449161616
print(target.similarity(doc3))  # 0.7822956752876101

Notes NLP

english process nltk

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

中文文本处理上一篇

常用python文本处理函数下一篇

英文文本处理

英文文本处理

1 Tokenization(标记化/分词)

2 去停用词

3 词性标注（part-of-speech tagging）

4 chunking/组块分析

5 命名实体识别（Named Entity Recognition，NER）

6 Stemming和Lemmatizing 词干算法 和 词型还原

NLTK：WordNet

Corpus Module

WordNet与词义解析

Frequency Distribution

SpaCy

chunking/组块分析

句法依存

词向量

6 Stemming和Lemmatizing 词干算法和词型还原