天池 O2O 优惠券使用新人赛

赛题简介

………

本次大赛为参赛选手提供了O2O场景相关的丰富数据，希望参赛选手通过分析建模，精准预测用户是否会在规定时间内使用相应优惠券。

数据

本赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为，预测用户在2016年7月领取优惠券后15天以内的使用情况。即对每个优惠券coupon_id单独计算核销预测的 AUC 值，再对所有优惠券的 AUC 值求平均作为最终的评价标准。

ROC 曲线

复习一下混淆矩阵(confusion matrix)：

Predicted Positive(1) Predicted Negative(0)

Actual Positive(1) TP FN

Actual Negative(0) FP TN

有新的评价指标真阳率(TPRate)、伪阳率(FPRate)：
$\begin{align} TPRate&=\frac{TP}{TP+FN}\\ FPRate&=\frac{FP}{FP+TN} \end{align}$

TPRate 的意义是所有真实类别为 $1$ 的样本中，预测类别为 $1$ 的比例。

FPRate 的意义是所有真实类别为 $0$ 的样本中，预测类别为 $1$ 的比例。

ROC 曲线的横轴是 FPRate，纵轴是 TPRate。

AUC

AUC 即 ROC 曲线下的面积，AUC 越大，模型越强。

	Predicted Positive(1)	Predicted Negative(0)
Actual Positive(1)	TP	FN
Actual Negative(0)	FP	TN

字段表

| Field | Description |
| ——————- | —————————————————————————————— |
| User_id | 用户ID |
| Merchant_id | 商户ID |
| Coupon_id | 优惠券ID：null表示无优惠券消费，此时Discount_rate和Date_received字段无意义 |
| Discount_rate | 优惠率：x \in [0,1]代表折扣率；x:y表示满x减y。单位是元 |
| Distance | user经常活动的地点离该merchant的最近门店距离是x*500米（如果是连锁店，则取最近的一家门店），x\in[0,10]；null表示无此信息，0表示低于500米，10表示大于5公里； |
| Date_received | 领取优惠券日期 |
| Date | 消费日期：如果Date=null & Coupon_id != null，该记录表示领取优惠券但没有使用，即负样本；如果Date!=null & Coupon_id = null，则表示普通消费日期；如果Date!=null & Coupon_id != null，则表示用优惠券消费日期，即正样本； |

工具导入

import os, sys, pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import date
from sklearn.linear_model import SGDClassifier, LogisticRegression

数据导入

1 2	dfoff = pd.read_csv('data/ccf_offline_stage1_train.csv') dftest = pd.read_csv('data/ccf_offline_stage1_test_revised.csv')

特征工程

def getDiscountType(row):     # 针对满:减，新增特征：是否打折、满、减、打折率
    if pd.isnull(row):
        return np.nan
    elif ':' in row:
        return 1
    else:
        return 0
    
def convertRate(row):
    if pd.isnull(row):
        return 1.0
    elif ':' in str(row):
        rows = row.split(':')
        return 1.0 - float(rows[1]) / float(rows[0])
    else:
        return float(row)

def getDiscountMan(row):
    if ':' in str(row):
        rows = row.split(':')
        return int(rows[0])
    else:
        return 0
    
def getDiscountJian(row):
    if ':' in str(row):
        rows = row.split(':')
        return int(rows[1])
    else:
        return 0
#-----#    
def processData(df):     # 打包，应用 apply() 函数
    df['discount_rate'] = df['Discount_rate'].apply(convertRate)
    df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
    df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)
    df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
    df['distance'] = df['Distance'].fillna(-1).astype(int)
    return df

dfoff = processData(dfoff)
dftest = processData(dftest)
# 对领取优惠卷日期进行处理，添加特征：星期几、是否是周末 
def getWeekday(row):
    if row == 'nan':
        return np.nan
    else:
        return date(int(row[:4]), int(row[4:6]), int(row[6:8])).weekday() + 1
    
dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)

dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x : 1 if x in [6, 7] else 0)
dftest['weekday_type'] = dftest['weekday'].apply(lambda x : 1 if x in [6, 7] else 0)
# 使用独热（one-hot）编码处理 星期几 特征
weekdaycols = ['weekday_' + str(i) for i in range(1, 8)]

tmpdf = pd.get_dummies(dfoff['weekday'].replace('nan', np.nan))
tmpdf.columns = weekdaycols
dfoff[weekdaycols] = tmpdf

tmpdf = pd.get_dummies(dftest['weekday'].replace('nan', np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf
# 建立标签：如果Date=null & Coupon_id != null，该记录表示领取优惠券但没有使用，即负样本；如果Date!=null & Coupon_id = null，则表示普通消费日期；如果Date!=null & Coupon_id != null，则表示用优惠券消费日期，即正样本；
def label(row):
    if pd.isnull(row['Date_received']):
        return -1
    if pd.notnull(row['Date']):
        td = pd.to_datetime(row['Date'], format = '%Y%m%d') - pd.to_datetime(row['Date_received'], format = '%Y%m%d')
        if td <= pd.Timedelta(15, 'D'):
            return 1
    return 0

dfoff['label'] = dfoff.apply(label, axis = 1)
# 分离测试集和训练集，这种带有时间序列的应该以一整个连续的周期为单位，不能单独的随机打乱
df = dfoff[dfoff['label'] != -1].copy()
train = df[(df['Date_received'] < 20160516)].copy()
vaild = df[(df['Date_received'] >= 20160516) & (df['Date_received'] <= 20160615)].copy()

训练模型

original_feature = ['discount_rate', 'weekday_type', 'discount_man', 'discount_jian', 'distance', 'weekday', 'weekday_type'] + weekdaycols

model = SGDClassifier(loss = 'log', penalty = 'elasticnet', fit_intercept = True, max_iter = 100, shuffle = True, alpha = 0.01, l1_ratio = 0.01, n_jobs = 1, class_weight = None)

model.fit(train[original_feature], train['label'])

对于SGDClassifier函数。明天再写