深度学习之路（一）：用LSTM网络做时间序列数据预测

简介

问题：有一组1维数据，可能是某商品的销售量，可能是股票的价格等，用深度学习模型来解决对该数据的预测问题，比如用前50个数据，来预测下一个数据。

首先，给出数据集：

前10行数据.png

接下来，通过对数据进行处理，以及模型的搭建和训练，最终得到想要的预测模型。

数据的读取及处理：

读取数据 load_data(filename, time_step)

使用pandas进行csv文件的读取，其中需要注意的是路径，即filename中要使用‘/’ 而不是'\'，另外，time_step变量，是用于设置以多少历史数据作为预测下一个数据的基础。按照题目简介，使用前50个数据，因此，time_step为50.

import time
import keras
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Activation, Dropout

def load_data(filename, time_step):
    '''
    filename: 
    instruction: file address, note '/'

    time_step: int
    instruction: how many previous samples are used to predict the next sample, it is the same with the time_steps of that in LSTM
    '''
    df = pd.read_csv(filename, header=None)
    data = df.values
    data = data.astype('float32')  # confirm the type as 'float32'
    data = data.reshape(data.shape[0], )
    # plt.title('original data')
    # plt.plot(data)
    # plt.savefig('original data.png')
    # plt.show()
    # using a list variable to rebuild a dataset to store previous time_step samples and another predicted sample
    result = []
    for index in range(len(data) - time_step):
        result.append(data[index:index + time_step + 1])
    
    # variable 'result' can be (len(data)-time_step) * (time_step + 1), the last column is predicted sample.
    return np.array(result)

在这里，使用list变量result，将50个历史数据与一个预测数据放在一行，因此最终result是一个维数为((len(data) - time_step), 51）的一个列表，当然后面还要转换成numpy，便于操作。

数据归一化以及划分训练测试集

首先将数据进行归一化，调用的是sklearn.preprocessing中的MiniMaxScaler。之后按照7:3的比例划分成训练集和测试集。

data = load_data('sp500.csv', 50)

# normalize the data and split it into train and test set
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(data)
# define a variable to represent the ratio of train/total and split the dataset
train_count = int(0.7 * len(dataset))
x_train_set, x_test_set = dataset[:train_count, :-1], dataset[train_count:, :-1]
y_train_set, y_test_set = dataset[:train_count, -1], dataset[train_count:, -1]

# reshape the data to satisfy the input acquirement of LSTM
x_train_set = x_train_set.reshape(x_train_set.shape[0], 1, x_train_set.shape[1])
x_test_set = x_test_set.reshape(x_test_set.shape[0], 1, x_test_set.shape[1])
y_train_set = y_train_set.reshape(y_train_set.shape[0], 1)
y_test_set = y_test_set.reshape(y_test_set.shape[0], 1)

需要注意的是，如果不将y_train_set进行reshape的话，那么它的维度将会是（M, ）这种向量形式，而不是一维数据。（M, ）这种数据是机器学习里面最容易出现bug的来源。

构建模型

此处我们构建一个4层网络，每一层的神经元个数取自layer形参。需要说明的是，构建LSTM时，里面的参数，尤其是各种size令许多新手产生困惑。根据个人理解，LSTM中，需要设定的基础参数有两个，分别是units以及input_shape。

LSTM Cell（左）及其unfold形式（右）

units:实际上指代的就是第一层隐藏层的输出神经元个数，即第二层隐藏层输入神经元的个数。
input_shape：官网中给出的形式如下：（samples, time_steps, features）。features实际上就是每个样本的维度。假如time_steps = t，其实就相当于将该神经元unfold成x₀到x_t-1，samples可省略。

在本例中，数据经过处理后，X的维度是（m, 50），m是样本数,50是特征数（其实应该是time_steps）。因此，此处的time_steps = 1，features = 50。

def build_model(layer):
    '''
    layer: list
    instruction: the number of neurons in each layer
    '''
    model = Sequential()
    # set the first hidden layer and set the input dimension
    model.add(LSTM(
        input_shape=(1, layer[0]), units=layer[1], return_sequences=True
    ))
    model.add(Dropout(0.2))

    # add the second layer
    model.add(LSTM(
        units=layer[2], return_sequences=False
    ))
    model.add(Dropout(0.2))

    # add the output layer with a Dense
    model.add(Dense(units=layer[3], activation='linear'))
    model.compile(loss='mse', optimizer='adam')

    return model

新手入门，对很多概念也不是特别清晰，若有幸得到大神指点，吾感激不尽。

模型训练及预测

构建好模型之后，使用训练集进行训练，以及使用测试集进行测试。

# train the model and use the validation part to validate
model.fit(x_train_set, y_train_set, batch_size=128, epochs=20, validation_split=0.2)

# do the prediction
y_predicted = model.predict(x_test_set)

其中，设置了validation_split，用于从训练集中划分出一部分来做验证集，对过拟合问题提出预警。有关validation_split的问题，可以参考https://www.jianshu.com/p/0c7af5fbcf72

画图

最后一步就是将预测的数据以图的形式表现出来，为了与原始数据进行比对，先将预测出的数据变换到与原始数据同单位的样子。在此，调用的是前文定义的scaler中的inverse_transform。

遇到的问题就是inverse_transform中提示y_test_set与变换前的数据尺寸不一致，想想也是这样子的。当初用scaler.fit_transform的时候，是对列数为51的数据做的，因此需要对y_test_set进行数据补充，使用hstack将y_test_set与一个0数组进行堆叠。

# plot the predicted curve and the original curve
# fill some zeros to get a (len, 51) array
temp = np.zeros((len(y_test_set), 50))
origin_temp = np.hstack((temp, y_test_set))
predict_temp = np.hstack((temp, y_predicted))

# tranform the data back to the original one
origin_test = scaler.inverse_transform(origin_temp)
predict_test = scaler.inverse_transform(predict_temp)

plot_curve(origin_test[:, -1], predict_test[:, -1])

若前文中的y_test_set不使用reshape调整为列数为1的array的话，此处就会出现bug，提示维度不一，因为reshape前的为（M, ）的向量。

plot_curve函数如下：

def plot_curve(true_data, predicted_data):
    '''
    true_data: float32
    instruction: the true test data
    predicted_data: float32
    instruction: the predicted data from the model
    '''
    plt.plot(true_data, label='True data')
    plt.plot(predicted_data, label='Predicted data')
    plt.legend()
    plt.savefig('result.png')
    plt.show()

结果如下：

预测结果.png

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,056评论 5赞 474
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 84,842评论 2赞 378
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 148,938评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,296评论 1赞 272
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,292评论 5赞 363
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,413评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,824评论 3赞 393
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,493评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,686评论 1赞 295
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,502评论 2赞 318
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,553评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,281评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,820评论 3赞 305
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,873评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,109评论 1赞 258
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,699评论 2赞 348
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,257评论 2赞 341