猴子都能懂的NLP (NLU)

import glob
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences, to_categorical
from keras import Sequential
import pandas as pd
from keras.layers import Embedding, Bidirectional, LSTM, Dense
import re

files = glob.glob('./nlu_data/SMSSpamCollection.csv')
data_pd = pd.concat([pd.read_csv(f, header=None, names=['label', 'text'], sep='\t') for f in files], ignore_index=True)

print(data_pd.info())

text_tok = Tokenizer(lower=False, split=' ', oov_token='<OOV>')
label_tok = Tokenizer(lower=False, split=' ', oov_token='<OOV>')

text_tok.fit_on_texts(data_pd['text'])
label_tok.fit_on_texts(data_pd['label'])

text_config = text_tok.get_config()
label_config = label_tok.get_config()

print(text_config.get('document_count'))
print(label_config)

text_vocab = eval(text_config['index_word'])
label_vocab = eval(label_config['index_word'])

x_tok = text_tok.texts_to_sequences(data_pd['text'])
y_tok = label_tok.texts_to_sequences(data_pd['label'])

print('text', data_pd['text'][0], x_tok[0])
print('label', data_pd['label'][0], y_tok[0])

max_len = 172

x_pad = pad_sequences(x_tok, padding='post', maxlen=max_len)
y_pad = y_tok

num_classes = len(label_vocab) + 1
Y = to_categorical(y_pad, num_classes)

vocab_size = len(text_vocab) + 1
embedding_dim = 64
rnn_units = 100
BATCH_SIZE = 90
dropout = 0.2

唯一需要注意的地方是,需要接两层BiLstm 来对数据进行降一个维度。
如果不降维,会导致输出的矩阵形状与预设值不一致

model = Sequential([
    Embedding(vocab_size, embedding_dim, mask_zero=True, batch_input_shape=[BATCH_SIZE, None]),
    Bidirectional(LSTM(units=rnn_units, return_sequences=True, dropout=dropout, kernel_initializer=tf.keras.initializers.he_normal())),
    Bidirectional(LSTM(round(num_classes / 2))),
    Dense(num_classes, activation='softmax')
])
print(model.summary())

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

X = x_pad
# 5572  61 =  49 + 12
X_train = X[0: 4410]
Y_train = Y[0: 4410]

print(Y_train.shape)

X_test = X[4410: 5490]
Y_test = Y[4410: 5490]

model.fit(X_train, Y_train, batch_size=BATCH_SIZE, epochs=15)

model.evaluate(X_test, Y_test, batch_size=BATCH_SIZE)

y_pred = model.predict(X_test, batch_size=BATCH_SIZE)

# 3s 43ms/step - loss: 0.2169 - accuracy: 0.9333

# convert prediction one-hot encoding back to number
y_pred = tf.argmax(y_pred, -1)
y_pnp = y_pred.numpy()

# convert ground true one-hot encode back to number
y_ground_true = tf.argmax(Y_test, -1)
y_ground_true_pnp = y_ground_true.numpy()


for i in range(20):
    x = 'sentence=> ' + text_tok.sequences_to_texts([X_test[i]])[0]
    x = re.sub(r'<OOV>*.', '', x)
    ground_true = 'ground_true=> ' + label_tok.sequences_to_texts([[y_ground_true_pnp[i]]])[0]
    prediction = 'prediction=> ' + label_tok.sequences_to_texts([[y_pnp[i]]])[0]
    print(x)
    print(ground_true)
    print(prediction)
    print('\n')

测试集的准确率为 97.87

12/12 [==============================] - 3s 53ms/step - loss: 0.1081 - accuracy: 0.9787

输出结果

sentence=> For your chance to WIN a FREE Bluetooth Headset then simply reply back with ADP 
ground_true=> spam
prediction=> spam


sentence=> You also didnt get na hi hi hi hi hi 
ground_true=> ham
prediction=> ham


sentence=> Ya but it cant display internal subs so i gotta extract them 
ground_true=> ham
prediction=> ham


sentence=> If i said anything wrong sorry de 
ground_true=> ham
prediction=> ham


sentence=> Sad story of a Man Last week was my b'day My Wife did'nt wish me My Parents forgot n so did my Kids I went to work Even my Colleagues did not wish 
ground_true=> ham
prediction=> ham


sentence=> How stupid to say that i challenge god You dont think at all on what i write instead you respond immed 
ground_true=> ham
prediction=> ham


sentence=> Yeah I should be able to I'll text you when I'm ready to meet up 
ground_true=> ham
prediction=> ham


sentence=> V skint too but fancied few bevies waz gona go meet othrs in spoon but jst bin watchng planet earth sofa is v comfey If i dont make it hav gd night 
ground_true=> ham
prediction=> ham


sentence=> says that he's quitting at least5times a day so i wudn't take much notice of that Nah she didn't mind Are you gonna see him again Do you want to come to taunton tonight U can tell me all about 
ground_true=> ham
prediction=> ham


sentence=> When you get free call me 
ground_true=> ham
prediction=> ham


sentence=> How have your little darlings been so far this week Need a coffee run tomo Can't believe it's that time of week already … 
ground_true=> ham
prediction=> ham


sentence=> Ok i msg u b4 i leave my house 
ground_true=> ham
prediction=> ham


sentence=> Still at west coast Haiz Ü'll take forever to come back 
ground_true=> ham
prediction=> ham


sentence=> MMM Fuck Merry Christmas to me 
ground_true=> ham
prediction=> ham


sentence=> alright Thanks for the advice Enjoy your night out I'ma try to get some sleep 
ground_true=> ham
prediction=> ham


sentence=> Update your face book status frequently 
ground_true=> ham
prediction=> ham


sentence=> Just now saw your message it k da 
ground_true=> ham
prediction=> ham


sentence=> Was it something u ate 
ground_true=> ham
prediction=> ham


sentence=> So what did the bank say about the money 
ground_true=> ham
prediction=> ham


sentence=> Aiyar dun disturb u liao Thk u have lots 2 do aft ur cupboard come 
ground_true=> ham
prediction=> ham

代码传送门

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,271评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,275评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,151评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,550评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,553评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,559评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,924评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,580评论 0 257
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,826评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,578评论 2 320
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,661评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,363评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,940评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,926评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,156评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,872评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,391评论 2 342

推荐阅读更多精彩内容

  • 创建一个简单的模型理解句子某些词的语义(NER) 加载一些包 加载标签和语句 在ner文件夹里面有一堆原始数据,每...
    那个大螺丝阅读 307评论 0 0
  • 1 为什么要对特征做归一化 特征归一化是将所有特征都统一到一个大致相同的数值区间内,通常为[0,1]。常用的特征归...
    顾子豪阅读 6,298评论 2 22
  • 1 为什么要对特征做归一化 特征归一化是将所有特征都统一到一个大致相同的数值区间内,通常为[0,1]。常用的特征归...
    顾子豪阅读 1,305评论 0 1
  • 自然语言处理面试题 有哪些文本表示模型,它们各有什么优缺点? 词袋模型与N-gram  最基本的文本表示模型是词袋...
    Viterbi阅读 3,888评论 0 1
  • 1.Transformer为何使用多头注意力机制?(为什么不使用一个头) 答: 多头可以使参数矩阵形成多个子空间,...
    tianyunzqs阅读 1,893评论 0 1