文本情感度分析

1）What is Sentiment Analysis?##

情感分析（Sentiment analysis），又称倾向性分析，意见抽取（Opinion extraction），意见挖掘（Opinion mining），情感挖掘（Sentiment mining），主观分析（Subjectivity analysis），它是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程，如从评论文本中分析用户对“数码相机”的“变焦、价格、大小、重量、闪光、易用性”等属性的情感倾向。

情感分析的意义何在？下面以实际应用为例进行直观的阐述：
• Movie: is this review positive or negative?
• Products: what do people think about the new iPhone?
• Public sentiment: how is consumer confidence? Is despair increasing?
• Politics: what do people think about this candidate or issue?
**• Prediction: ** predict election outcomes or market trends from sentiment
情感分析主要目的就是识别用户对事物或人的看法、态度（attitudes：enduring, affectively colored beliefs, dispositions towards objects or persons），参与主体主要包括：
**Holder (source) **of attitude：观点持有者
**Target (aspect) **of attitude：评价对象
**Type of attitude：评价观点 From a set of types：Like, love, hate, value, desire, etc.
Or (more commonly) simple weighted polarity: *positive, negative, neutral, *together with strength

Text containing the attitude：评价文本，一般是句子或整篇文档

更细更深入的还包括评价属性，情感词/极性词，评价搭配等、
通常，我们面临的情感分析任务包括如下几类：
Simplest task: Is the attitude of this text positive or negative?
More complex: Rank the attitude of this text from 1 to 5
Advanced: Detect the target, source, or complex attitude types

后续章节将以Simplest task为例进行介绍。

2）A Baseline Algorithm##

本小节对影评进行情感分析为例，向大家展示一个简单、实用的情感分析系统。详细见论文: Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.
Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL, 271-278
我们面临的任务是“Polarity detection: Is an IMDB movie review positive or negative?”，数据集为“*Polrity Data 2.0: *http://www.cs.cornell.edu/people/pabo/movie-review-data”.
作者将情感分析当作分类任务，拆分成如下子任务：
Tokenization：正文提取，过滤时间、电话号码等，保留大写字母开头的字符串，保留表情符号，切词；
Feature Extraction：直观上，我们会认为形容词直接决定文本的情感，而Pang和Lee的实验表明，采用所有词（unigram）作为特征，可以达到更好的情感分类效果。

其中，需要对否定句进行特别的处理，如句子”I didn’t like this movie”vs “I really like this movie”，unigram只差一个词，但是有着截然不同的含义。为了有效处理这种情况，Das and Chen (2001)提出了“Add NOT_ to every word between negation and following punctuation”，根据此规则可以将句子“didn’t like this movie , but I”转换为“didn’t NOT_like NOT_this NOT_movie, but I”。
另外，在抽取特征时，直观的感觉“Word occurrence may matter more than word frequency”，这是因为最相关的情感词在一些文本片段中仅仅出现一次，词频模型起得作用有限，甚至是负作用，则使用多重伯努利模型事件空间代替多项式事件空间，实验也的确证明了这一点。所以，论文最终选择二值特征，即词的出现与否，代替传统的频率特征。log(freq(w))也是一种值得尝试的降低频率干扰的方法。
Classification using different classifiers:如Naïve Bayes、MaxEnt、SVM，以朴素贝叶斯分类器为例，训练过程如下：

预测过程如下：

实验表明，MaxEnt和SVM相比Naïve Bayes可以得到更好的效果。
最后，通过case review可以总结下，影评情感分类的难点是什么？
语言表达的含蓄微妙：“If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.”，“ She runs the gamut of emotions from A to B”。
挫败感表达方式：先描述开始的期待（不吝赞美之词），后表达最后失望感受，如“This film should be brilliant. It sounds like a great plot, the actors are** first grade**, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.”，“Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishbourne is **not so good **either, I was surprised.”。

3）Sentiment Lexicons##

情感分析模型非常依赖于情感词典抽取特征或规则，以下罗列了较为流行且成熟的开放情感词典资源：
GI（The General Inquirer）：该词典给出了每个词条非常全面的信息，如词性，反义词，褒贬，等，组织结构如下：

详细见论文：Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press
LIWC (Linguistic Inquiry and Word Count)：该词典通过大量正则表达式描述不同类别的情感词规律，其类别体系与GI（The General Inquirer）基本一致，组织结构如下：

详细见论文：Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC 2007. Austin, TX
MPQA Subjectivity Cues Lexicon：其中包含Positive words: 2718，Negative words: 4912，组织结构如下图所示：

详细见论文：Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.
Bing Liu Opinion Lexicon：其中包含Positive words: 2006，Negative words: 4783，需要特别说明的是，词典不但包含正常的用词，还包含了拼写错误、语法变形，俚语以及社交媒体标记等，详细见论文：Minqing Hu and Bing Liu. Mining and Summarizing Customer Reviews. ACM SIGKDD-2004.
SentiWordNet：其通过对WordNet中的词条进行情感分类，并标注出每个词条属于positive和negative类别的权重大小，组织结构如下：

详细见论文：Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010 SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. LREC-2010
以上给出了一系列可用的情感词典资源，但是，如何选择一个合适的为我所用呢？这里，通过对比同一词条在不同词典之间的分类，衡量词典资源的不一致程度，如下：

对于在不同词典中表现不一致的词条，我们至少可以做两件事情。第一，review这些词条，通过少量人工加以纠正；第二，可以得到一些存在褒贬歧义的词条。
给定一个词，如何确定其以多大概率出现在某种情感类别文本中呢？以IMDB下不同打分下影评为例，最简单的方法就是计算每个分数（星的个数）对应的文本中词条出现的频率，如下图所示为Count(“bad”)分布情况：

使用更多的是likelihood公式：

为了使得不同词条在不同类别下的概率可比，通常使用Scaled likelihood公式代替，如下：

如下图所示，列出了部分词条在不同类别下的Scaled likelihood，据此可以判断每个词条的倾向性。
另外，我们通常会有这么一个疑问：否定词（如not, n’t, no, never）是否更容易出现在negative情感文本中？Potts, Christopher（2011）等通过实验给出了答案：More negation in negative sentiment，如下图所示：

4）Learning Sentiment Lexicons##

我们在庆幸和赞扬众多公开情感词典为我所用的同时，我们不免还想了解构建情感词典的方法，正所谓知其然知其所以然。一方面在面临新的情感分析问题，解决新的情感分析任务时，难免会需要结合实际需求构建或完善情感词典，另一方面，可以将成熟的词典构建方法应用于其他领域，知识无边界，许多方法都是相通的。
常见的情感词典构建方法是基于半指导的bootstrapping学习方法，主要包括两步：
Use a small amount of information（Seed）A few labeled examples
A few hand-built patterns

To bootstrap a lexicon

接下来，通过相关的几篇论文，详细阐述下构建情感词典的方法。具体如下：
** 1. Hatzivassiloglou & McKeown：论文见Vasileios Hatzivassiloglou and Kathleen R. McKeown. 1997. Predicting the Semantic Orientation of Adjectives. ACL, 174–181，基于这样的一种语言现象：“Adjectives conjoined by ‘and’’ have same polarity；Adjectives conjoined by ‘but **‘ do not”，如下示例：
Fair and legitimate, corrupt and brutal
*fair and brutal, *corrupt and legitimate
fair **but **brutal

Hatzivassiloglou & McKeown（1997）提出了基于bootstrapping的学习方法，主要包括四步：
Step 1：Label seed set of 1336 adjectives (all >20 in 21 million word WSJ corpus)

初始种子集包括657个 positive words（如adequate central clever famous intelligent remarkable reputed sensitive slender thriving…）和679个 negative words（如contagious drunken ignorant lanky listless primitive strident troublesome unresolved unsuspecting…）
Step 2：Expand seed set to conjoined adjectives，如下图所示：

Step 3：Supervised classifier assigns “polarity similarity” to each word pair, resulting in graph，如下图所示：

Step 4：Clustering for partitioning the graph into two

最终，输出新的情感词典，如下（加粗词条为自动挖掘出的词条）：
Positive: bold decisive disturbing generous good honest important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty…
Negative: ambiguous cautious cynical evasive harmful hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…

** 2. Turney Algorithm：**论文见Turney (2002): Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews，具体步骤如下：
Step 1：Extract a phrasal lexicon from reviews，通过规则抽取的phrasal如下图所示：

Step 2：Learn polarity of each phrase，那么，如何评价phrase的polarity呢？直观上，有这样的结论：“Positive phrases co-occur more with ‘excellent’，Negative phrases co-occur more with ’poor’”，这时，将问题转换成如何衡量词条之间的共现关系？于是，学者们引入了点互信息（Pointwise mutual information，PMI），它经常被用于度量两个具体事件的相关程度，公式为：

两个词条的PMI公式为：

常用的计算PMI(word1, word2)方法是分别以”word1”，”word2”和”word1 NEAR word2”为query，根据搜索引擎检索结果，得到P(word)和P(word1, word2)，如下：
P(word) = hits(word)/N
P(word1
,word2
) = hits(word1 NEAR word2)/N2

  则有：

那么，计算一个phrase的polarity公式为（excellent和poor也可以使用其它已知极性词代替）：

Turney Algorithm在410 reviews（from Epinions）的数据集上，其中170 (41%) negative，240 (59%) positive，取得了74%的准确率（baseline为59%，均标注为positive）。
Step 3：Rate a review by the average polarity of its phrases

** 3. Using WordNet to learn polarity：**论文见S.M. Kim and E. Hovy. 2004. Determining the sentiment of opinions. COLING 2004，M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of KDD, 2004.该方法步骤如下：
Create positive (“good”) and negative seed-words (“terrible”)
Find Synonyms and Antonyms

Positive Set: Add synonyms of positive words (“well”) and antonyms of negative words
Negative Set: Add synonyms of negative words (“awful”) and antonyms of positive words (”evil”)
Repeat, following chains of synonyms
Filter

以上几个方法都有较好的领域适应性和鲁棒性，基本思想可以概括为“Use seeds and semi-supervised learning to induce lexicons”，即：
Start with a seed set of words (‘good’, ‘poor’)
Find other words that have similar polarity:Using “and” and “but”
Using words that occur nearby in the same document
Using WordNet synonyms and antonyms
Use seeds and semi-supervised learning to induce lexicons

最后编辑于：2017.12.08 01:30:31

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,271评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,275评论 2赞 380
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,151评论 0赞 336
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,550评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,553评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,559评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,924评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,580评论 0赞 257
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,826评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,578评论 2赞 320
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,661评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,363评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,940评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,926评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,156评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,872评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,391评论 2赞 342

文本情感度分析

1）What is Sentiment Analysis?##

2）A Baseline Algorithm##

3）Sentiment Lexicons##

4）Learning Sentiment Lexicons##

推荐阅读更多精彩内容