作业描述

采集了台湾环境监测所的数据。
要求:根据前9小时的数据,用线性回归来预测第10个小时的PM2.5的数值。

任务要求

输入:9个小时的数据,共18项特征(AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR)

输出:第10小时的PM2.5数值

模型:线性回归

模型搭建

Load 'train.csv’

本次作业使用了某个检测站一年的观测数据,数据中每个小时有18个观测指标,将其作为特征。将数据分为train.csv和test.csv,train.csv是该检测站每个月前20天的所有数据,test.csv是从该检测站剩余数据中取样出的部分数据。

  • train.csv:每个月前20天的完整数据
  • test.csv:从剩下的数据中取样连续的10小时为一组,前9小时所有观测数据当做feature,第10小时的PM2.5当做answer。一共取出240组不重复的test data,请根据feature预测这240组的PM2.5.

所有的数据含有18个观测数据(特征):AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR。

import sys
import pandas as pd
import numpy as np
from google.colab import drive 
!gdown --id '1wNKAxQ29G15kgpBy_asjTcZRRgmsCZRm' --output data.zip
!unzip data.zip
# data = pd.read_csv('gdrive/My Drive/hw1-regression/train.csv', header = None, encoding = 'big5')
data = pd.read_csv('./train.csv', encoding = 'big5')
Downloading...
From: https://drive.google.com/uc?id=1wNKAxQ29G15kgpBy_asjTcZRRgmsCZRm
To: /content/data.zip
100% 177k/177k [00:00<00:00, 53.8MB/s]
Archive:  data.zip
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: test.csv                
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: train.csv               

Preprocessing

训练集中前三列不是数值部分,因此要从第四列开始取值,且需要将RAINFALL行的内容‘NR’替换为数值(以0代替)。

# 无用数据去除
data = data.iloc[:, 3:] # 去掉0-3列
data[data == 'NR'] = 0 # 数据中值为“NR”的数据全部赋值为0,去除特殊值。
raw_data = data.to_numpy()
data.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0 14 14 14 13 12 12 12 12 15 17 20 22 22 22 22 22 21 19 17 16 15 15 15 15
1 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8
2 0.51 0.41 0.39 0.37 0.35 0.3 0.37 0.47 0.78 0.74 0.59 0.52 0.41 0.4 0.37 0.37 0.47 0.69 0.56 0.45 0.38 0.35 0.36 0.32
3 0.2 0.15 0.13 0.12 0.11 0.06 0.1 0.13 0.26 0.23 0.2 0.18 0.12 0.11 0.1 0.13 0.14 0.23 0.18 0.12 0.1 0.09 0.1 0.08
4 0.9 0.6 0.5 1.7 1.8 1.5 1.9 2.2 6.6 7.9 4.2 2.9 3.4 3 2.5 2.2 2.5 2.3 2.1 1.9 1.5 1.6 1.8 1.5

提取特征(1)

在这里插入图片描述

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XT5kWTxZ-1605095864678)(https://drive.google.com/uc?id=1ZroBarcnlsr85gibeqEF-MtY13xJTG47)]

將原始 4320 * 18 的資料依照每個月分重組成 12 個 18 (features) * 480 (hours) 的資料。

month_data = {}
for month in range(12):
    sample = np.empty([18,480])
    for day in range(20):
        sample[:, day * 24 :(day+1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day+1), :]
    month_data[month] = sample

特征提取(2)

由于最终目的是根据前9小时的数据预测第10小时的PM2.5,所以要把10小时的数据分为一组,其中前9小时的18*9个特征作为输入x,第10个小时的第10个特征作为输出y。上述转换后的18*480数组中,480列依次为同一个月1号到20号各小时的数据,也就是480个小时的数据。以1月份的数据为例,1-10列是1-10小时的数据,作为第1组数据;2-11列是2-10小时的数据,作为第2组数据;以步长为1的方式取值到第480列为止。如下图所示。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yCMcEMSK-1605095864679)(https://drive.google.com/uc?id=1wKoPuaRHoX682LMiBgIoOP4PDyNKsJLK)]

这样,一个月就有471组数据,12个月就有12*471组数据,每组数据中,前9小时的数据(18*9的数组)转换成18*9维行向量作为输出;将第10小时的数据(18维列向量)中的第10行的数据(即第10个特征PM2.5的值)作为输出。12个月的12*471组数据的输出放入总的输入x中(12*471行,18*9列的数组),同理输出y为12*471维列向量。

这一步的工作内容为获取x和y这两个数据,就相当于得到训练的数据与目标的结果。
在这里插入图片描述

reshape(1, -1):转换成一个行向量

x = np.empty([12 * 471, 18 * 9], dtype=float)
y = np.empty([12 * 471, 1], dtype=float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            if day == 19 and hour > 14:
                continue
            x[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day *24 + hour + 9].reshape(1,-1)
            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9]

Normalize 归一化

mean_x = np.mean(x, axis=0)代码解释:
axis=0:对x数据集中的每一列求平均值。

mean_x = np.mean(x, axis=0)
std_x = np.std(x, axis=0)
for i in range(len(x)):
    for j in range(len(x[0])):
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j] / std_x[j])

train 训练

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
Adagrad讲解

和上图不同处: 下面Loss的代码用到的是 Root Mean Square Error

因为存在常数项b,所以维度(dim)需要多加一列;eps项是极小值,避免adagrad的分母为0.

每一个维度(dim)会对应到各自的gradient和权重w,通过一次次的迭代(iter_time)学习。最终,将训练得到的模型(权重w)存储为.npy格式的文件。

模型:线性函数

输入:一次18*9个数据

参数:18*9个权重+一个偏置

损失函数:均方根误差(Root Mean Square Error)

优化算法:梯度下降、Adagrad

np.concatenate((a,b),axis=1) : 进行行拼接,拼接结果是[a,b]


x.transpose():将x转置。

模型:线性回归
y = [ y 1 y 2 ⋮ y m ] = [ b + w 1 x 1 , 1 + w 2 x 1 , 2 + … + w n x 1 , n b + w 1 x 2 , 1 + w 2 x 2 , 2 + … + w n x 2 , n ⋮ b + w 1 x m , 1 + w 2 x m , 2 + … + w n x m , n ] \mathbf{y}=\left[\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{array}\right]=\left[\begin{array}{c} b+w_{1} x_{1,1}+w_{2} x_{1,2}+\ldots+w_{n} x_{1, n} \\ b+w_{1} x_{2,1}+w_{2} x_{2,2}+\ldots+w_{n} x_{2, n} \\ \vdots \\ b+w_{1} x_{m, 1}+w_{2} x_{m, 2}+\ldots+w_{n} x_{m, n} \end{array}\right] y=y1y2ym=b+w1x1,1+w2x1,2++wnx1,nb+w1x2,1+w2x2,2++wnx2,nb+w1xm,1+w2xm,2++wnxm,n
= [ 1 x 1 , 1 x 1 , 2 ⋯ x 1 , n 1 x 2 , 1 x 2 , 2 ⋯ x 2 , n ⋮ ⋮ ⋮ ⋱ ⋮ 1 x m , 1 x m , 2 ⋯ x m , n ] [ b w 1 w 2 ⋮ w n ] = [ x 1 x 2 ⋮ x m ] w = X w =\left[\begin{array}{ccccc} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1, n} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2, n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m, 1} & x_{m, 2} & \cdots & x_{m, n} \end{array}\right]\left[\begin{array}{c} b \\ w_{1} \\ w_{2} \\ \vdots \\ w_{n} \end{array}\right]=\left[\begin{array}{c} \mathbf{x}_{1} \\ \mathbf{x}_{2} \\ \vdots \\ \mathbf{x}_{m} \end{array}\right] \mathbf{w}=\mathbf{X} \mathbf{w} =111x1,1x2,1xm,1x1,2x2,2xm,2x1,nx2,nxm,nbw1w2wn=x1x2xmw=Xw
损失函数:均方根误差
L ( w ) = 1 m ∑ i = 1 m ( y i − y ^ i ) 2 = 1 m ∑ i = 1 m ( x i w − y ^ i ) 2 , m = 12 ∗ 471 L(\mathbf{w})=\sqrt{\frac{1}{m} \sum_{i=1}^{m}\left(y_{i}-\hat{y}_{i}\right)^{2}}=\sqrt{\frac{1}{m} \sum_{i=1}^{m}\left(\mathbf{x}_{i} \mathbf{w}-\hat{y}_{i}\right)^{2}}, \quad m=12 * 471 L(w)=m1i=1m(yiy^i)2 =m1i=1m(xiwy^i)2 ,m=12471
优化方法:Adagrad
w t + 1 = w t − η ∑ i = 0 t ( ∂ L ( w i ) ∂ w i ) 2 ∂ L ( w t ) ∂ w t \mathbf{w}^{t+1}=\mathbf{w}^{t}-\frac{\eta}{\sqrt{\sum_{i=0}^{t}\left(\frac{\partial L\left(\mathbf{w}^{i}\right)}{\partial \mathbf{w}^{i}}\right)^{2}}} \frac{\partial L\left(\mathbf{w}^{t}\right)}{\partial \mathbf{w}^{t}} wt+1=wti=0t(wiL(wi))2 ηwtL(wt)

∂ L ( w t ) ∂ w t = [ ∂ L ( w t ) ∂ b ( w t ) ∂ U 1 t 1 ⋮ ∂ L ( w t ) ∂ w n t ] = 2 [ 1 ( x 1 w − y ^ 1 ) + 1 ( x 2 w − y ^ 2 ) + ⋯ + 1 ( x m w − y ^ m ) x 1 , 1 ( x 1 w − y ^ 1 ) + x 2 , 1 ( x 2 w − y ^ 2 ) + ⋯ + x m , 1 ( x m w − y ^ m ) ⋮ x 1 , n ( x 1 w − y ^ 1 ) + x 2 , n ( x 2 w − y ^ 2 ) + ⋯ + x m , n ( x m w − y ^ m ) ] \frac{\partial L\left(\mathbf{w}^{t}\right)}{\partial \mathbf{w}^{t}}=\left[\begin{array}{c} \frac{\partial L\left(\mathbf{w}^{t}\right)}{\partial b\left(\mathbf{w}^{t}\right)} \\ \frac{\partial U_{1}^{t}}{1} \\ \vdots \\ \frac{\partial L\left(\mathbf{w}^{t}\right)}{\partial w_{n}^{t}} \end{array}\right]=2\left[\begin{array}{c} 1\left(\mathbf{x}_{1} \mathbf{w}-\hat{y}_{1}\right)+1\left(\mathbf{x}_{2} \mathbf{w}-\hat{y}_{2}\right)+\cdots+1\left(\mathbf{x}_{m} \mathbf{w}-\hat{y}_{m}\right) \\ x_{1,1}\left(\mathbf{x}_{1} \mathbf{w}-\hat{y}_{1}\right)+x_{2,1}\left(\mathbf{x}_{2} \mathbf{w}-\hat{y}_{2}\right)+\cdots+x_{m, 1}\left(\mathbf{x}_{m} \mathbf{w}-\hat{y}_{m}\right) \\ \vdots \\ x_{1, n}\left(\mathbf{x}_{1} \mathbf{w}-\hat{y}_{1}\right)+x_{2, n}\left(\mathbf{x}_{2} \mathbf{w}-\hat{y}_{2}\right)+\cdots+x_{m, n}\left(\mathbf{x}_{m} \mathbf{w}-\hat{y}_{m}\right) \end{array}\right] wtL(wt)=b(wt)L(wt)1U1twntL(wt)=21(x1wy^1)+1(x2wy^2)++1(xmwy^m)x1,1(x1wy^1)+x2,1(x2wy^2)++xm,1(xmwy^m)x1,n(x1wy^1)+x2,n(x2wy^2)++xm,n(xmwy^m)
= 2 [ 1 1 ⋯ 1 x 1 , 1 x 2 , 1 ⋯ x m , 1 x 1 , 2 x 2 , 2 ⋯ x m , 2 ⋮ ⋮ ⋱ ⋮ x 1 , n x 2 , n ⋯ x m , n ] ( [ x 1 w x 2 w ⋮ x m w ] − [ y ^ 1 y ^ 2 ⋮ y ^ m ] ) = 2 X T ( X w − y ^ ) =2\left[\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1,1} & x_{2,1} & \cdots & x_{m, 1} \\ x_{1,2} & x_{2,2} & \cdots & x_{m, 2} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1, n} & x_{2, n} & \cdots & x_{m, n} \end{array}\right]\left(\left[\begin{array}{c} \mathbf{x}_{1} \mathbf{w} \\ \mathbf{x}_{2} \mathbf{w} \\ \vdots \\ \mathbf{x}_{m} \mathbf{w} \end{array}\right]-\left[\begin{array}{c} \hat{y}_{1} \\ \hat{y}_{2} \\ \vdots \\ \hat{y}_{m} \end{array}\right]\right)=2 \mathbf{X}^{T}(\mathbf{X} \mathbf{w}-\hat{\mathbf{y}}) =21x1,1x1,2x1,n1x2,1x2,2x2,n1xm,1xm,2xm,nx1wx2wxmwy^1y^2y^m=2XT(Xwy^)

dim = 18 * 9 + 1 #参数:18*9个权重 + 1个偏置bias
w = np.zeros([dim, 1]) #参数列向量初始化为0
#x新增一列用来与偏置bias相乘,将模型y=wx+b转化成y=[b,w]转置·[1,x]
x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)
learning_rate = 2 
iter_time = 6000
adagrad = np.zeros([dim, 1])
eps = 0.0000000001
for t in range(iter_time):
    loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse
    if(t%1000==0):
        print(str(t) + ":" + str(loss))
    gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / np.sqrt(adagrad + eps) #eps是一个很小的正数,避免分母为0
np.save('weight.npy', w)
#print(w)

0:27.071214829194115
1000:7.009213113051888
2000:6.502608081214534
3000:6.274166821906605
4000:6.142124124714215
5000:6.055390909221868

Testing

在这里插入图片描述

测试集数据与训练集类似,不过只给了9个小时的数据,第10小时的数据需要用训练的模型预测。总共有240个id的数据,也就是240组18*9的数据,需要预测出240个值。

# 读入测试数据test.csv
testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5')
# 丢弃前两列,需要的是从第3列开始的数据
test_data = testdata.iloc[:, 2:]
# 把降雨为NR字符变成数字0
test_data[test_data == 'NR'] = 0
# 将dataframe变成numpy数组
test_data = test_data.to_numpy()
# 将test数据也变成 240 个维度为 18 * 9 + 1 的数据。
test_x = np.empty([240, 18*9], dtype = float)
for i in range(240):
    test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)
test_x

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py:3093: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(-key, value, inplace=True)





array([[ 1.        , -0.24447681, -0.24545919, ..., -0.67065391,
        -1.04594393,  0.07797893],
       [ 1.        , -1.35825331, -1.51789368, ...,  0.17279117,
        -0.10906991, -0.48454426],
       [ 1.        ,  1.5057434 ,  1.34508393, ..., -1.32666675,
        -1.04594393, -0.57829812],
       ...,
       [ 1.        ,  0.3919669 ,  0.54981237, ...,  0.26650729,
        -0.20275731,  1.20302531],
       [ 1.        , -1.8355861 , -1.8360023 , ..., -1.04551839,
        -1.13963133, -1.14082131],
       [ 1.        , -1.35825331, -1.35883937, ...,  2.98427476,
         3.26367657,  1.76554849]])

Prediction

說明圖同上

在这里插入图片描述

载入模型即可对test数据进行预测,得到预测值ans_y。

w = np.load('weight.npy')
ans_y = np.dot(test_x, w)

把预测值保存为CSV文件

import csv
with open('submit.csv', mode='w', newline='') as submit_file:
    csv_writer = csv.writer(submit_file)
    header = ['id', 'value']
    csv_writer.writerow(header)
    for i in range(240):
        row = ['id_' + str(i), ans_y[i][0]]
        csv_writer.writerow(row)
        

完整代码

# 导入相关库
import sys
import pandas as pd
import numpy as np

# 读入数据
data = pd.read_csv('./train.csv', encoding = 'big5')

# 数据预处理
data = data.iloc[:, 3:]
data[data == 'NR'] = 0
raw_data = data.to_numpy()

# 按月分割数据
month_data = {}
for month in range(12):
    sample = np.empty([18, 480])
    for day in range(20):
        sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]
    month_data[month] = sample

# 分割x和y
x = np.empty([12 * 471, 18 * 9], dtype = float)
y = np.empty([12 * 471, 1], dtype = float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            if day == 19 and hour > 14:
                continue
            x[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 9].reshape(1, -1) #vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)
            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] #value


# 对x标准化
mean_x = np.mean(x, axis = 0) #18 * 9 
std_x = np.std(x, axis = 0) #18 * 9 
for i in range(len(x)): #12 * 471
    for j in range(len(x[0])): #18 * 9 
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]

# 训练模型并保存权重
dim = 18 * 9 + 1
w = np.zeros([dim, 1])
x2 = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)
learning_rate = 2
iter_time = 10000
adagrad = np.zeros([dim, 1])
eps = 1e-7
for t in range(iter_time):
    loss = np.sqrt(np.sum(np.power(np.dot(x2, w) - y, 2))/471/12)#rmse
    if(t%1000==0):
        print(str(t) + ":" + str(loss))
    gradient = 2 * np.dot(x2.transpose(), np.dot(x2, w) - y) #dim*1
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / (np.sqrt(adagrad) + eps)
  
np.save('weight.npy', w)

# 导入测试数据test.csv
testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x = np.empty([240, 18*9], dtype = float)
for i in range(240):
    test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)

# 对test的x进行预测,得到预测值ans_y
w = np.load('weight.npy')
ans_y = np.dot(test_x, w)
# 加一个预处理<0的都变成0
for i in range(240):
    if(ans_y[i][0]<0):
        ans_y[i][0]=0
    else:
        ans_y[i][0]=np.round(ans_y[i][0])

# 保存为csv文件,并提交到kaggle:https://www.kaggle.com/c/ml2020spring-hw1/submissions
import csv
with open('submit1.csv', mode='w', newline='') as submit_file:
    csv_writer = csv.writer(submit_file)
    header = ['id', 'value']
    
    csv_writer.writerow(header)
    for i in range(240):
        row = ['id_' + str(i), ans_y[i][0]]
        csv_writer.writerow(row)
   

0:27.071214829194115
1000:5.833520263644297
2000:5.725894874614578
3000:5.696965973062232
4000:5.687776442345435
5000:5.684605135867572
6000:5.683380671750098
7000:5.682811312617999
8000:5.682473834864111
9000:5.682226133038848


/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:63: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py:3093: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(-key, value, inplace=True)

版本2

加入x的平方项,private score的分数下降了很多,不过public score的分数上升了。

# 训练集
for month in range(12):
    for day in range(20):
        for hour in range(24):
            if day == 19 and hour > 14:
                continue
            x1 = month_data[month][:, day * 24 + hour: day * 24 + hour + 9].reshape(1,-1)  # vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)
            x[month * 471 + day * 24 + hour, :18 * 9] = x1
            # 在这里加入了x的二次项
            x[month * 471 + day * 24 + hour, 18 * 9: 18 * 9 * 2] = np.power(x1, 2)
            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9]  # value

# 测试集
testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x1 = np.empty([240, 18*9], dtype = float)
test_x = np.empty([240, 18*9*2], dtype = float)
for i in range(240):
    test_x1 = test_data[18 * i: 18 * (i + 1), :].reshape(1, -1).astype(float)
    # 同样在这里加入test x的二次项
    test_x[i, : 18 * 9] = test_x1
    test_x[i, 18 * 9:] = np.power(test_x1 , 2)
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)

完整代码

import sys
import pandas as pd
import numpy as np

# 读入数据
data = pd.read_csv('./train.csv', encoding='big5')

# 数据预处理
data = data.iloc[:, 3:]
data[data == 'NR'] = 0
raw_data = data.to_numpy()

# 按月分割数据
month_data = {}
for month in range(12):
    sample = np.empty([18, 480])
    for day in range(20):
        sample[:, day * 24: (day + 1) * 24] = raw_data[18 * (20 * month + day): 18 * (20 * month + day + 1), :]
    month_data[month] = sample

# 分割x和y
x = np.empty([12 * 471, 18 * 9 * 2], dtype=float)
y = np.empty([12 * 471, 1], dtype=float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            if day == 19 and hour > 14:
                continue
            x1 = month_data[month][:, day * 24 + hour: day * 24 + hour + 9].reshape(1,-1)  # vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)
            x[month * 471 + day * 24 + hour, :18 * 9] = x1
            # 在这里加入了x的二次项
            x[month * 471 + day * 24 + hour, 18 * 9: 18 * 9 * 2] = np.power(x1, 2)
            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9]  # value

# 对x标准化
mean_x = np.mean(x, axis=0)  # 18 * 9 * 2
std_x = np.std(x, axis=0)  # 18 * 9 * 2
for i in range(len(x)):  # 12 * 471
    for j in range(len(x[0])):  # 18 * 9 * 2
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]


# 随机打散X和Y
def _shuffle(X, Y):
    randomize = np.arange(len(X))
    np.random.shuffle(randomize)
    return (X[randomize], Y[randomize])

# 训练模型并保存权重
dim = 18 * 9 * 2 + 1
w = np.ones([dim, 1])
learning_rate = 2
iter_time = 5000
adagrad = np.zeros([dim, 1])
eps = 1e-7

for t in range(iter_time):
    x, y = _shuffle(x, y)
    x2 = np.concatenate((np.ones([len(x), 1]), x), axis=1).astype(float)
    gradient = 2 * np.dot(x2.transpose(), np.dot(x2, w) - y)  # dim*1
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / (np.sqrt(adagrad) + eps)

    loss = np.sqrt(np.sum(np.power(np.dot(x2, w) - y, 2)) / len(x))  # rmse
    if (t % 100 == 0):
        print(str(t) + ":" + str(loss))

np.save('weight.npy', w)

# 导入测试数据test.csv
testdata = pd.read_csv('./test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x1 = np.empty([240, 18*9], dtype = float)
test_x = np.empty([240, 18*9*2], dtype = float)
for i in range(240):
    test_x1 = test_data[18 * i: 18 * (i + 1), :].reshape(1, -1).astype(float)
    # 同样在这里加入test x的二次项
    test_x[i, : 18 * 9] = test_x1
    test_x[i, 18 * 9:] = np.power(test_x1 , 2)
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)

# 对test的x进行预测,得到预测值ans_y
w = np.load('weight.npy')
ans_y = np.dot(test_x, w)
# 加一个预处理<0的都变成0
for i in range(240):
    if(ans_y[i][0]<0):
        ans_y[i][0]=0
    else:
        ans_y[i][0]=np.round(ans_y[i][0])

# 保存为csv文件,并提交到kaggle:https://www.kaggle.com/c/ml2020spring-hw1/submissions
import csv
with open('submit.csv', mode='w', newline='') as submit_file:
    csv_writer = csv.writer(submit_file)
    header = ['id', 'value']
    print(header)
    csv_writer.writerow(header)
    for i in range(240):
        row = ['id_' + str(i), ans_y[i][0]]
        csv_writer.writerow(row)

Logo

汇聚全球AI编程工具,助力开发者即刻编程。

更多推荐