Kaggle 分类特征编码比赛

Kaggle Categorical Feature Encoding Challenge比赛是一个比较有意思的比赛，旨在帮助参赛者了解一些特征工程的基础编码技巧。这个比赛中的数据包含了很多常出现的非数值数据，我做了一个简单的总结。

Categorical Feature Encoding Challenge

1.数据总览

首先我们引入一些必要的包

import string
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from pandas.api.types import CategoricalDtype 
from sklearn.linear_model import LogisticRegression

1.1我们读取数据集

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.head(2)

数据

这是一个二分类数据集，它的属性分为了4种：

bin 开头的列：二值特征(属性)(比如: 0/1, true/false, left/right)
nom 开头的列: 类别特征(比如: 颜色, 形状...)
ord 开头的列: 序数特征(比如: 等级，热度...)
day month:   周期特征

对于每一种特征我们进行不同的编码，需要注意的是这个数据集没有缺失数据

1.2二值特征处理

从上面的表中我们可以看出bin_0, bin_1, bin_2这三列已经是0/1数据，因此不做改动，bin_3, bin_4我们使用LabelEncoder将对应的字母变成0/1

train_length = len(train_data) # 训练集长度
#把训练集target列排除，将训练集与测试集合并，一并做特征工程
data = pd.concat([train_data.iloc[:,:-1], test_data], axis=0)
data.bin_3 = LabelEncoder().fit_transform(data.bin_3)
data.bin_4 = LabelEncoder().fit_transform(data.bin_4)
bin_0_4 = data.loc[:,['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4']] # 我们先把bin列提取出来

我们将处理好的bin_0 到 bin_4 提取出来，使用逻辑回归跑一个baseline，为了简单这里使用accuracy作为标准

# 使用十折交叉验证， StratifiedKFold保证测试集类标签比例与训练集一致
tmp_data = bin_0_4.iloc[:train_length, :]
model = LogisticRegression()
kfold = StratifiedKFold(n_splits=10,random_state=42, shuffle=True)
metric = cross_val_score(model, tmp_data, train_data.target, cv=kfold, scoring="accuracy").mean()
print(f'平均f1 score是: {metric}')

[out]:
平均f1 score是: 0.6941200000000001

可以看出单单使用bin属性正确率在69%左右

1.3 类别特征1

我们首先看看类别属性分布

data.loc[:, ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']].describe()

	nom_0	nom_1	nom_2	nom_3	nom_4	nom_5	nom_6	nom_7	nom_8	nom_9
count	500000	500000	500000	500000	500000	500000	500000	500000	500000	500000
unique	3	6	6	6	4	222	522	1220	2219	12068
top	Green	Trapezoid	Lion	Russia	Oboe	f7821e391	2ed5a94b0	fe27cc23d	c389000ab	21578b358
freq	212496	168431	168960	168480	153692	4623	1991	874	489	113

我们看到unique这一行, 其中nom_0, nom_1, nom_2, nom_3, nom_4的类别较少, 这几列属于低数量类别特征(low-cardinality features), 这里我们使用one-hot编码来解决, 剩下的高数量类别特征(high-cardinality features)我们待会再说

# 通常来说one-hot编码可以使用sklearn的api，这里为了简单我直接使用pandas的get_dummies(效果一样)
nom_0_4 = pd.get_dummies(data.loc[:, ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4']])

我们再结合二值特征bin，再跑一个baseline

# 将bin_0_4, nom_0_4拼接
tmp_data = pd.concat([bin_0_4, nom_0_4], axis=1).iloc[:train_length, :]
model = LogisticRegression()
kfold = StratifiedKFold(n_splits=10,random_state=42, shuffle=True)
metric = cross_val_score(model, tmp_data, train_data.target, cv=kfold, scoring="accuracy").mean()
print(f'平均f1 score是: {metric}')

[out]
0.6955533333333335

结果有一定提升，接下来我们把高数量类别特征先放一下，来处理ord序数特征

1.4 序数特征1

同样的我们看一下序数特征分布,由于nom_0已经是数值类型，所以我们直接使用

data.loc[:, ['nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']].describe()

	ord_1	ord_2	ord_3	ord_4	ord_5
count	500000	500000	500000	500000	500000
unique	5	6	15	26	192
top	Novice	Freezing	g	S	od
freq	210877	166065	60708	31773	8454

ord_1，ord_2，ord_3，ord_4这几列取值较少，ord_5取值较多暂时不考虑，ord是序列数据单纯使用one-hot编码也可以，这里使用类似LabelEncoder的方法，不过字母转数字是有顺序的

# 实际上这也是一种LabelEncoder，不过能保证顺序一致 比如从冷到热Freezing=0，Cold=1，Warm=2...
ord_1 = CategoricalDtype(categories=['Novice', 'Contributor','Expert', 
                                     'Master', 'Grandmaster'], ordered=True)
ord_2 = CategoricalDtype(categories=['Freezing', 'Cold', 'Warm', 'Hot',
                                     'Boiling Hot', 'Lava Hot'], ordered=True)
ord_3 = CategoricalDtype(categories=['a', 'b', 'c', 'd', 'e', 'f', 'g',
                                     'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o'], ordered=True)
ord_4 = CategoricalDtype(categories=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I',
                                     'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
                                     'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'], ordered=True)

ord_0_4 = data.loc[:, ['ord_0','ord_1', 'ord_2', 'ord_3','ord_4']]
ord_0_4.ord_1 = ord_0_4.ord_1.astype(ord_1)
ord_0_4.ord_2 = ord_0_4.ord_2.astype(ord_2)
ord_0_4.ord_3 = ord_0_4.ord_3.astype(ord_3)
ord_0_4.ord_4 = ord_0_4.ord_4.astype(ord_4)

# 将序列数据转换为0-n的数字
ord_0_4.ord_1 = ord_0_4.ord_1.cat.codes
ord_0_4.ord_2 = ord_0_4.ord_2.cat.codes
ord_0_4.ord_3 = ord_0_4.ord_3.cat.codes
ord_0_4.ord_4 = ord_0_4.ord_4.cat.codes

将转码后ord_0，ord_1，ord_2，ord_3，ord_4加入在看看baseline

tmp_data = pd.concat([bin_0_4, nom_0_4, ord_0_4], axis=1).iloc[:train_length, :]
model = LogisticRegression()
kfold = StratifiedKFold(n_splits=10,random_state=42, shuffle=True)
metric = cross_val_score(model, tmp_data, train_data.target, cv=kfold, scoring="accuracy").mean()
print(f'平均f1 score是: {metric}')

[out]:
平均f1 score是: 0.7273566666666667

可以看出加入了ord_0_4之后提升还是比较明显的

1.5 周期特征

剩下没处理的是day和month这两列了，对于这种周期型特征，一种常用方法是使用正弦和一致变换将数据转换为二维

# 这里分别把day和month分别转化为cos和sin表示，最后生成四列
def date_cyc_enc(df, col, max_vals):
    df[col + '_sin'] = np.sin(2 * np.pi * df[col]/max_vals)
    df[col + '_cos'] = np.cos(2 * np.pi * df[col]/max_vals)
    return df.loc[:, [col + '_sin', col + '_cos']]
df_day = date_cyc_enc(data, 'day',  7)
df_month = date_cyc_enc(data, 'month',  12)
day_month = pd.concat([df_day, df_month], axis=1)

最终测试一下加入day_month的baseline

tmp_data = pd.concat([bin_0_4, nom_0_4, ord_0_4, day_month], axis=1).iloc[:train_length, :]
model = LogisticRegression()
kfold = StratifiedKFold(n_splits=10,random_state=42, shuffle=True)
metric = cross_val_score(model, tmp_data, train_data.target, cv=kfold, scoring="accuracy").mean()
print(f'平均f1 score是: {metric}')

[out]:
平均f1 score是: 0.73155

1.6 提交结果

1.你可能注意到，对于ord_5，还有'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9'高数量类别特征没有进行编码，对于这几个属性的编码基本上能决定你的得分，等下次再讲吧
2.接下来，我使用xgboost配合hyperopt进行超参数调节，然后将最终结果提交
3.需要注意的是在比赛中你电脑的计算力也是一个重要的因素

# 这里我使用了GPU版本的XGBOOST，会更快一些
from xgboost import XGBClassifier
from hyperopt import fmin, anneal,tpe, hp, space_eval, rand, Trials, partial, STATUS_OK

# 加载数据
tmp_data = pd.concat([bin_0_4, nom_0_4, ord_0_4, day_month], axis=1).iloc[:train_length, :]
X, Y = tmp_data, train_data.target

def XGB(argsDict):
    max_depth = argsDict["max_depth"] + 1
    n_estimators = argsDict['n_estimators'] * 10+50
    learning_rate = argsDict["learning_rate"] * 0.02 + 0.05
    subsample = argsDict["subsample"] * 0.1 + 0.7
    min_child_weight = argsDict["min_child_weight"]+1
    reg_alpha = argsDict["reg_alpha"]
    reg_lambda = argsDict["reg_lambda"]
    colsample_bytree = argsDict["colsample_bytree"]
    

    gbm = XGBClassifier(tree_method='gpu_hist', # 这里使用gpu_hist树
                        max_bin=255,
                        objective="binary:logistic",
                        max_depth=max_depth,  #最大深度
                        n_estimators=n_estimators,   #树的数量
                        learning_rate=learning_rate, #学习率
                        subsample=subsample,      #采样数
                        min_child_weight=min_child_weight,   #孩子数
                        max_delta_step=10,  #10步不降则停止
                        reg_alpha=reg_alpha,
                        reg_lambda=reg_lambda,
                        colsample_bytree=colsample_bytree,
                       )
    kfold = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
    metric = cross_val_score(gbm, X, Y, cv=kfold, scoring="roc_auc_ovo_weighted").mean()

    print(f"xgb的训练得分为: {metric}")
    return -metric

# 搜索空间
space = {
        "max_depth": hp.randint("max_depth", 15),  # [0, upper)
        "n_estimators": hp.randint("n_estimators", 5),  # [0,1000)
        "learning_rate": hp.uniform("learning_rate", 0.001, 2),  # 0.001-2均匀分布
        "min_child_weight": hp.randint("min_child_weight", 5),
        "subsample": hp.randint("subsample", 4),
        "reg_alpha": hp.choice("reg_alpha", [1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1]),
        "reg_lambda": hp.choice("reg_lambda", [1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100]),
        "colsample_bytree": hp.choice("colsample_bytree", [0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]),
        }

max_evals = 10
algo = partial(tpe.suggest, n_startup_jobs=1)  # 优化算法种类使用tpe.suggest
best = fmin(XGB, space, algo=algo, max_evals=max_evals)  # max_evals表示想要训练的最大模型数量，越大越容易找到最优解

[out]:
xgb的训练得分为: 0.7286977297104635
xgb的训练得分为: 0.7278963951265259
xgb的训练得分为: 0.7305032975469229
xgb的训练得分为: 0.7315826599442659
xgb的训练得分为: 0.7317137817745843
xgb的训练得分为: 0.7283892825120473
xgb的训练得分为: 0.7371492660965547
xgb的训练得分为: 0.7368506514127253
xgb的训练得分为: 0.7364754330553307
xgb的训练得分为: 0.6997757005147975
100%|██████████| 10/10 [04:19<00:00, 13.58s/it, best loss: -0.7371492660965547]

最终f1 score在0.7371492660965547，我们得到了目前最优的参数，接下来我们将全量数据进行训练，然后预测测试集

def RECOVERXGB(argsDict): #返回最优真实参数
    from copy import deepcopy
    best = deepcopy(argsDict)
    best["max_depth"] = best["max_depth"] + 1
    best['n_estimators'] = best['n_estimators'] * 10 + 50
    best["learning_rate"] = best["learning_rate"] * 0.02 + 0.05
    best["subsample"] = best["subsample"] * 0.1 + 0.7
    best["min_child_weight"] = best["min_child_weight"] + 1
    best["colsample_bytree"] = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0][best["colsample_bytree"]]
    best["reg_alpha"] = [1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1][best["reg_alpha"]]
    best["reg_lambda"] = [1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100][best["reg_lambda"]]
    return best

# 得到最终参数后，训练模型
def TRAINXGB(X, Y, argsDict):
    max_depth = argsDict["max_depth"]
    n_estimators = argsDict['n_estimators']
    learning_rate = argsDict["learning_rate"]
    subsample = argsDict["subsample"]
    min_child_weight = argsDict["min_child_weight"]
    reg_alpha = argsDict["reg_alpha"]
    reg_lambda = argsDict["reg_lambda"]
    colsample_bytree = argsDict["colsample_bytree"]
    gbm = XGBClassifier(tree_method='gpu_hist',
                        max_bin=800,
                        objective="binary:logistic",
                        n_jobs=16,
                        max_depth=max_depth,  #最大深度
                        n_estimators=n_estimators,   #树的数量
                        learning_rate=learning_rate, #学习率
                        subsample=subsample,      #采样数
                        min_child_weight=min_child_weight,   #孩子数
                        max_delta_step=10,  #10步不降则停止
                        reg_alpha=reg_alpha,
                        reg_lambda=reg_lambda,
                        colsample_bytree=colsample_bytree,
                       )
    gbm.fit(X, Y)
    return gbm

best_params = RECOVERXGB(best)
model = TRAINXGB(X, Y, best_params)
test_data = pd.concat([bin_0_4, nom_0_4, ord_0_4, day_month], axis=1).iloc[train_length:, :]
submission = pd.read_csv('/content/drive/My Drive/kaggle/Categorical_Feature_Encoding_Challenge/sample_submission.csv')
submission.target = model.predict_proba(test_data)
submission.to_csv('submission.csv', index=False)

最后编辑于：2020.01.20 18:04:33

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,802评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,109评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,683评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,458评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,452评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,505评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,901评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,550评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,763评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,556评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,629评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,330评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,898评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,897评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,140评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,807评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,339评论 2赞 342