Spark构建回归模型（一）

为了阐述本章的一些概念，我们选择了bike sharing数据集做实验。这个数据集记录了bike
sharing系统每小时自行车的出租次数。另外还包括日期、时间、天气、季节和节假日等相关信息。

[hadoop@master spark]$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip
[hadoop@master spark]$ tar xvf Bike-Sharing-Dataset.zip 

[hadoop@master spark]$ sed 1d hour.csv > hour_noheader.csv
[hadoop@master spark]$ hdfs dfs -put hour_noheader.csv ML/
[hadoop@master spark]$ cat Readme.txt 

Dataset characteristics
=========================================   
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
    
    - instant: record index
    - dteday : date
    - season : season (1:springer, 2:summer, 3:fall, 4:winter)
    - yr : year (0: 2011, 1:2012)
    - mnth : month ( 1 to 12)
    - hr : hour (0 to 23)
    - holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
    - weekday : day of the week
    - workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
    + weathersit : 
        - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
        - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
        - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
        - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
    - temp : Normalized temperature in Celsius. The values are divided to 41 (max)
    - atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
    - hum: Normalized humidity. The values are divided to 100 (max)
    - windspeed: Normalized wind speed. The values are divided to 67 (max)
    - casual: count of casual users
    - registered: count of registered users
    - cnt: count of total rental bikes including both casual and registered
    
=========================================

加载和查看数据集

export PYSPARK_DRIVER_PYTHON=/usr/local/program/python2.7/bin/python
export PYSPARK_PYTHON=/usr/local/program/python2.7/bin/python[hadoop@master ~]$ pyspark --master yarn --driver-memory 4G
>>> raw_data = sc.textFile("ML/hour_noheader.csv")
>>> num_data = raw_data.count()
>>> records = raw_data.map(lambda x: x.split(","))                              
>>> first = records.first()
>>> print first
[u'1', u'2011-01-01', u'1', u'0', u'1', u'0', u'0', u'6', u'0', u'1', u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13', u'16']
>>> print num_data
17379
>>> records.cache()
PythonRDD[4] at RDD at PythonRDD.scala:48

为了将类型特征表示成二维形式，我们将特征值映射到二元向量中非0的位置


def get_mapping(rdd, idx):
    return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()
    
>>> print "Mapping of first categorical feasture column: %s" % get_mapping(records, 2)
Mapping of first categorical feasture column: {u'1': 0, u'3': 1, u'2': 2, u'4': 3}

#8个类型变量
mappings = [get_mapping(records, i) for i in range(2,10)]
>>> print mappings
[{u'1': 0, u'3': 1, u'2': 2, u'4': 3}, {u'1': 0, u'0': 1}, {u'11': 0, u'10': 6, u'12': 7, u'1': 1, u'3': 2, u'2': 8, u'5': 3, u'4': 9, u'7': 4, u'6': 10, u'9': 5, u'8': 11}, {u'20': 2, u'21': 14, u'22': 4, u'23': 15, u'1': 6, u'0': 18, u'3': 7, u'2': 19, u'5': 8, u'4': 20, u'7': 9, u'6': 21, u'9': 10, u'8': 22, u'11': 0, u'10': 12, u'13': 1, u'12': 13, u'15': 11, u'14': 23, u'17': 3, u'16': 17, u'19': 5, u'18': 16}, {u'1': 0, u'0': 1}, {u'1': 0, u'0': 3, u'3': 1, u'2': 4, u'5': 2, u'4': 5, u'6': 6}, {u'1': 0, u'0': 1}, {u'1': 0, u'3': 1, u'2': 2, u'4': 3}]

#计算完每个变量的映射之后，统计一下最终二元向量的总长度
#这里的len是函数len
cat_len = sum(map(len, mappings))
num_len = len(records.first()[11:15])
total_len = num_len + cat_len

>>> print "Feature vector length for categorical features: %d" % cat_len
Feature vector length for categorical features: 57
>>> print "Feature vector length for numerical features: %d" % num_len
Feature vector length for numerical features: 4
>>> print "Total feature vector length: %d" % total_len
Total feature vector length: 61

为线性模型创建特征向量
用上面的映射函数将所有类型特征转换为二元编码的特征。为了方便对每条记录提取特征和标签，我们分别定义两个辅助函数extract_features和extract_label

from pyspark.mllib.regression import LabeledPoint
import numpy as np

def extract_features(record):
    cat_vec = np.zeros(cat_len)
    i = 0
    step = 0
    for field in record[2:9]:
        #m是一个字典
        m = mappings[i]
        idx = m[field]
        cat_vec[idx + step] = 1
        i = i + 1
        step = step + len(m)
    num_vec = np.array([float(field) for field in record[10:14]])
    return np.concatenate((cat_vec, num_vec))

def extract_label(record):
    #pyhotn的数组可以看成一个环，所以-1就是第0个的前一个，也就是最后一个
    return float(record[-1])
    
data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))
first_point = data.first()
>>> print "Raw data: " + str(first[2:])
Raw data: [u'1', u'0', u'1', u'0', u'0', u'6', u'0', u'1', u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13', u'16']
>>> print "Label: " + str(first_point.label)
Label: 16.0
>>> print "Linear Model feature vector:\n" + str(first_point.features)
Linear Model feature vector:
[1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.24,0.2879,0.81,0.0]
>>> print "Linear Model feature vector length: " + str(len(first_point.features))
Linear Model feature vector length: 61

为决策树创建特征向量
决策树模型可以直接使用原始数据（不需要将类型数据用二元向量表示）。因此，只需要创建一个分割函数简单地将所有数值转换为浮点数，最后用numpy的array封装

def extract_features_dt(record):
    return np.array(map(float, record[2:14]))
data_dt = records.map(lambda r: LabeledPoint(extract_label(r),extract_features_dt(r)))
first_point_dt = data_dt.first()

>>> print "Decision Tree feature vector: " + str(first_point_dt.features)
Decision Tree feature vector: [1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,0.24,0.2879,0.81,0.0]
>>> print "Decision Tree feature vector length: " + str(len(first_point_dt.features))
Decision Tree feature vector length: 12

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,607评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,047评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,496评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,405评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,400评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,479评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,883评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,535评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,743评论 1赞 295
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,544评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,612评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,309评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,881评论 3赞 306
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,891评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,136评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,783评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,316评论 2赞 342

Spark构建回归模型（一）

推荐阅读更多精彩内容