2023年马上就要步入尾声了,在这年末时刻,各地纷纷下起了大雪,温度也是骤降,这也难挡大家出行的热情,我很快也要加入出行的大军,朝着心中的归宿前行。
正好今天有点时间就想着以温度为切入点做点有趣的工作,我们爬取了近些年来的全国各省市下的月平均气温,想要基于这些月平均气温数据来进行分析计算最终构建模型实现预测计算。
话不多说,我们首先来看数据集,如下所示:
年月 | 省份 | 省份代码 | 城市 | 城市代码 | 平均气温 |
202201 | 黑龙江省 | 230000 | 七台河市 | 230900 | -17.17509676 |
202202 | 黑龙江省 | 230000 | 七台河市 | 230900 | -12.18510315 |
202203 | 黑龙江省 | 230000 | 七台河市 | 230900 | -1.651742341 |
202204 | 黑龙江省 | 230000 | 七台河市 | 230900 | 8.299090227 |
202205 | 黑龙江省 | 230000 | 七台河市 | 230900 | 13.81210092 |
202206 | 黑龙江省 | 230000 | 七台河市 | 230900 | 18.89282461 |
202201 | 海南省 | 460000 | 万宁市 | 469006 | 20.87951741 |
202202 | 海南省 | 460000 | 万宁市 | 469006 | 18.60191442 |
202203 | 海南省 | 460000 | 万宁市 | 469006 | 24.08055245 |
202204 | 海南省 | 460000 | 万宁市 | 469006 | 24.65870836 |
202205 | 海南省 | 460000 | 万宁市 | 469006 | 26.1541906 |
202206 | 海南省 | 460000 | 万宁市 | 469006 | 28.84376812 |
202201 | 海南省 | 460000 | 三亚市 | 460200 | 22.08995545 |
202202 | 海南省 | 460000 | 三亚市 | 460200 | 20.97310047 |
202203 | 海南省 | 460000 | 三亚市 | 460200 | 24.48222669 |
202204 | 海南省 | 460000 | 三亚市 | 460200 | 25.00777376 |
202205 | 海南省 | 460000 | 三亚市 | 460200 | 26.56800089 |
202206 | 海南省 | 460000 | 三亚市 | 460200 | 28.24468505 |
202201 | 福建省 | 350000 | 三明市 | 350400 | 10.43969564 |
202202 | 福建省 | 350000 | 三明市 | 350400 | 8.192309117 |
202203 | 福建省 | 350000 | 三明市 | 350400 | 17.12428889 |
202204 | 福建省 | 350000 | 三明市 | 350400 | 18.50666753 |
202205 | 福建省 | 350000 | 三明市 | 350400 | 20.29015878 |
202206 | 福建省 | 350000 | 三明市 | 350400 | 24.68056597 |
202201 | 海南省 | 460000 | 三沙市 | 460300 | 23.43025149 |
202202 | 海南省 | 460000 | 三沙市 | 460300 | 22.41091619 |
202203 | 海南省 | 460000 | 三沙市 | 460300 | 25.66283196 |
202204 | 海南省 | 460000 | 三沙市 | 460300 | 26.44142246 |
202205 | 海南省 | 460000 | 三沙市 | 460300 | 27.60009033 |
202206 | 海南省 | 460000 | 三沙市 | 460300 | 29.16017647 |
202201 | 河南省 | 410000 | 三门峡市 | 411200 | 0.155539375 |
202202 | 河南省 | 410000 | 三门峡市 | 411200 | 1.576565044 |
202203 | 河南省 | 410000 | 三门峡市 | 411200 | 10.70505929 |
202204 | 河南省 | 410000 | 三门峡市 | 411200 | 15.71527554 |
202205 | 河南省 | 410000 | 三门峡市 | 411200 | 18.77065246 |
202206 | 河南省 | 410000 | 三门峡市 | 411200 | 25.79582475 |
202201 | 上海市 | 310000 | 上海市 | 310000 | 6.567684365 |
202202 | 上海市 | 310000 | 上海市 | 310000 | 5.918965574 |
202203 | 上海市 | 310000 | 上海市 | 310000 | 13.05226396 |
202204 | 上海市 | 310000 | 上海市 | 310000 | 16.88218889 |
202205 | 上海市 | 310000 | 上海市 | 310000 | 20.1943752 |
202206 | 上海市 | 310000 | 上海市 | 310000 | 26.13170191 |
202201 | 江西省 | 360000 | 上饶市 | 361100 | 7.35258247 |
202202 | 江西省 | 360000 | 上饶市 | 361100 | 5.574442995 |
202203 | 江西省 | 360000 | 上饶市 | 361100 | 15.07619137 |
202204 | 江西省 | 360000 | 上饶市 | 361100 | 17.78171764 |
202205 | 江西省 | 360000 | 上饶市 | 361100 | 20.04967596 |
202206 | 江西省 | 360000 | 上饶市 | 361100 | 25.01682304 |
202201 | 海南省 | 460000 | 东方市 | 469007 | 21.38617906 |
可以看到:我们对爬取后的数据进行了对应的规整处理,这里保留的是我们所必须要用到几个字段。
简单的数据加载实现如下所示:
df=pd.read_excel("temperature.xlsx")
print(df.head(10))
输出如下所示:
年月 省份 省份代码 城市 城市代码 平均气温
0 202201 黑龙江省 230000 七台河市 230900 -17.175097
1 202202 黑龙江省 230000 七台河市 230900 -12.185103
2 202203 黑龙江省 230000 七台河市 230900 -1.651742
3 202204 黑龙江省 230000 七台河市 230900 8.299090
4 202205 黑龙江省 230000 七台河市 230900 13.812101
5 202206 黑龙江省 230000 七台河市 230900 18.892825
6 202201 海南省 460000 万宁市 469006 20.879517
7 202202 海南省 460000 万宁市 469006 18.601914
8 202203 海南省 460000 万宁市 469006 24.080552
9 202204 海南省 460000 万宁市 469006 24.658708
接下来我们需要对原始的数据进行解析处理,构建所需要的数据集,这里我选择对不同层级的数据构建映射字典来方便地存取计算。
pc_code_map = {}
data_dict = {}
city_name_map = {}
pro_name_map = {}
for one_list in datas:
ym, pro, pcode, city, ccode, temp = one_list
city_name_map[ccode] = city
pro_name_map[pcode] = pro
if ccode in data_dict:
data_dict[ccode][ym] = float(temp)
else:
data_dict[ccode] = {}
data_dict[ccode][ym] = float(temp)
if pcode in pc_code_map:
if ccode not in pc_code_map[pcode]:
pc_code_map[pcode].append(ccode)
else:
pc_code_map[pcode] = [ccode]
# 省份-城市编码映射表
with open("pc_code_map.json", "w") as f:
f.write(json.dumps(pc_code_map))
# 省份编码-名称映射表
with open("pro_name_map.json", "w") as f:
f.write(json.dumps(pro_name_map))
# 城市编码-名称映射表
with open("city_name_map.json", "w") as f:
f.write(json.dumps(city_name_map))
# 城市编码-数据映射表
with open("data_dict.json", "w") as f:
f.write(json.dumps(data_dict))
执行完成后即可得到所需要的数据字典。
省份编码-名称映射表如下:
{
"230000": "黑龙江省",
"460000": "海南省",
"350000": "福建省",
"410000": "河南省",
"310000": "上海市",
"360000": "江西省",
"440000": "广东省",
"370000": "山东省",
"640000": "宁夏回族自治区",
"620000": "甘肃省",
"140000": "山西省",
"530000": "云南省",
"210000": "辽宁省",
"330000": "浙江省",
"150000": "内蒙古自治区",
"650000": "新疆维吾尔自治区",
"510000": "四川省",
"340000": "安徽省",
"420000": "湖北省",
"130000": "河北省",
"520000": "贵州省",
"110000": "北京市",
"450000": "广西壮族自治区",
"320000": "江苏省",
"710000": "台湾省",
"220000": "吉林省",
"610000": "陕西省",
"120000": "天津市",
"430000": "湖南省",
"540000": "西藏自治区",
"630000": "青海省",
"500000": "重庆市",
"810000": "香港特别行政区",
"820000": "澳门特别行政区"
}
城市编码-名称映射表比较大,这里截图如下所示:
最重要的用于表征省份-城市所属关系的省份-城市编码映射表如下所示:
最后是我们的数据映射表,如下所示:
这里我们以第一个城市为例,给出来数据实例,如下所示:
"230900": {
"202201": -17.17509676,
"202202": -12.18510315,
"202203": -1.651742341,
"202204": 8.299090227,
"202205": 13.81210092,
"202206": 18.89282461,
"202207": 22.5161290322581,
"202208": 21.2258064516129,
"202209": 15.4,
"202210": 5.46666666666667,
"202211": -6.89285714285714,
"202212": -13.3225806451613,
"202101": -18.22501341,
"202102": -12.28888705,
"202103": -0.8183623,
"202104": 7.751819452,
"202105": 14.60449136,
"202106": 19.57243154,
"202107": 25.11467703,
"202108": 20.8148189,
"202109": 15.17717288,
"202110": 6.577239561,
"202111": -2.633364113,
"202112": -14.26601362,
"199601": -15.6399061175411,
"199602": -10.8490324459857,
"199603": -2.73487703742385,
"199604": 6.77624656346054,
"199605": 14.7480329803293,
"199606": 18.741110237741,
"199607": 21.8227186207128,
"199608": 19.8129428169199,
"199609": 13.9293662062715,
"199610": 4.87227430922752,
"199611": -5.75766094158093,
"199612": -15.2558164472958,
"199701": -17.0151278701834,
"199702": -10.1700525389631,
"199703": -4.00013012709894,
"199704": 7.57908673146959,
"199705": 13.1037022544812,
"199706": 19.5144368612577,
"199707": 23.8898612839882,
"199708": 20.9913130130252,
"199709": 13.2378954326523,
"199710": 3.90252111873771,
"199711": -2.57367660135431,
"199712": -11.8481315965942,
"199801": -18.4068766579621,
"199802": -8.22564867970505,
"199803": -0.717881091689514,
"199804": 10.4591723442796,
"199805": 16.2436263210651,
"199806": 19.87804687861,
"199807": 22.4432237963072,
"199808": 19.3297502453057,
"199809": 15.5532539887023,
"199810": 8.98750636938723,
"199811": -8.31419612615707,
"199812": -13.065483602281,
"199901": -14.3036750834318,
"199902": -11.6973310164259,
"199903": -7.68328891718841,
"199904": 6.64887418344983,
"199905": 12.4825940009938,
"199906": 18.6097842396113,
"199907": 24.318817637778,
"199908": 20.892966640922,
"199909": 15.2460178359725,
"199910": 5.37214900923093,
"199911": -3.94559904567139,
"199912": -13.1201753452356,
"200001": -19.1030736305323,
"200002": -13.5096400881974,
"200003": -3.87596338961345,
"200004": 5.42818873393441,
"200005": 14.4595480970601,
"200006": 21.1697590583826,
"200007": 23.3939194708654,
"200008": 22.4560168898168,
"200009": 15.6835960079016,
"200010": 5.33414761875174,
"200011": -6.97706680859187,
"200012": -17.8210475121602,
"200101": -20.1388059253368,
"200102": -14.787474221591,
"200103": -5.32133407793727,
"200104": 7.82743154160904,
"200105": 15.2847985253454,
"200106": 19.9560671099202,
"200107": 23.0852386581602,
"200108": 21.2225749829277,
"200109": 14.5846078492648,
"200110": 7.95760390739298,
"200111": -2.42007397024145,
"200112": -13.4813095639178,
"200201": -13.5939068387526,
"200202": -7.34173525981689,
"200203": -0.233700028014121,
"200204": 7.5959529414626,
"200205": 15.737571692425,
"200206": 17.7341122175261,
"200207": 21.3937887420561,
"200208": 18.9262384055316,
"200209": 15.1341121459354,
"200210": 3.86838359758175,
"200211": -9.91009024156768,
"200212": -14.9391234738333,
"200301": -15.8558837609752,
"200302": -11.1172298454566,
"200303": -0.884583456864082,
"200304": 8.91731797636618,
"200305": 15.3175305752245,
"200306": 20.466183886004,
"200307": 20.6435510291096,
"200308": 20.1389982491099,
"200309": 15.7656182772486,
"200310": 6.80093328934769,
"200311": -5.35017192739297,
"200312": -12.4382605970563,
"200401": -16.161431016735,
"200402": -10.6591540364631,
"200403": -3.24075662433218,
"200404": 6.0265178749232,
"200405": 14.0593823358904,
"200406": 20.735180352674,
"200407": 21.4848367883347,
"200408": 20.3676144741795,
"200409": 16.2031984840744,
"200410": 7.86029215529169,
"200411": -1.50282217862262,
"200412": -15.6096590048404,
"200501": -15.5840144289644,
"200502": -14.9848092760602,
"200503": -4.39028028412153,
"200504": 6.17825423405859,
"200505": 11.9487828150888,
"200506": 21.3162182720097,
"200507": 21.5767493114552,
"200508": 21.252397376787,
"200509": 15.6453687547535,
"200510": 7.07448287423985,
"200511": -2.81564207585917,
"200512": -16.3414678602839,
"200601": -17.1414624049409,
"200602": -12.6045068798972,
"200603": -4.45483560996889,
"200604": 3.85045858043847,
"200605": 15.47886914675,
"200606": 17.9060071662806,
"200607": 22.2829811213763,
"200608": 22.226202191106,
"200609": 15.4316028552512,
"200610": 6.60720356709298,
"200611": -4.98519541059605,
"200612": -12.3402560494168,
"200701": -10.9233167248898,
"200702": -9.20223256995648,
"200703": -4.76507837858786,
"200704": 5.80471054000558,
"200705": 13.0112096980299,
"200706": 21.3219405329277,
"200707": 21.8972189813208,
"200708": 21.9011801370854,
"200709": 15.8642576070777,
"200710": 6.84349337906247,
"200711": -4.25115765802184,
"200712": -10.8667010111846,
"200801": -16.86872285839,
"200802": -10.4305610512003,
"200803": 1.29352780411865,
"200804": 10.0219535910654,
"200805": 11.8595057186186,
"200806": 20.5839131697934,
"200807": 22.8916266610283,
"200808": 20.740608936226,
"200809": 15.5914179874009,
"200810": 7.62394885056038,
"200811": -5.24272307162044,
"200812": -11.8187382120454,
"200901": -16.0879528274568,
"200902": -12.8669374095161,
"200903": -5.1243527484677,
"200904": 7.62148689264423,
"200905": 16.9129819658406,
"200906": 17.1350830370896,
"200907": 20.5713803492934,
"200908": 20.8143828218523,
"200909": 14.479100648962,
"200910": 6.74166935650382,
"200911": -7.2839048133726,
"200912": -17.4216048535113,
"201001": -16.5181374821615,
"201002": -15.3885911570271,
"201003": -7.43819498355497,
"201004": 3.40319074928322,
"201005": 14.8994169980266,
"201006": 23.5336468263204,
"201007": 22.1642059968242,
"201008": 21.5261399863279,
"201009": 15.3014670035077,
"201010": 6.19597334215438,
"201011": -3.88975373294606,
"201012": -16.293905392013,
"201101": -18.6383442034344,
"201102": -10.8242112910774,
"201103": -3.8526875769709,
"201104": 5.43990886038269,
"201105": 12.9026193582865,
"201106": 18.6249206829726,
"201107": 23.1166580814646,
"201108": 21.7815000953305,
"201109": 14.3171715345182,
"201110": 8.61500022260052,
"201111": -3.90527000147059,
"201112": -14.6971294559892,
"201201": -19.2170327644329,
"201202": -13.4485062366396,
"201203": -4.3542751386122,
"201204": 6.37062930705704,
"201205": 15.1209917984733,
"201206": 20.0819674887397,
"201207": 22.3468330059722,
"201208": 21.3503007529542,
"201209": 15.9626167333807,
"201210": 5.77970646814959,
"201211": -4.65404081306043,
"201212": -18.0524462383532,
"201301": -19.1056120310673,
"201302": -14.5711372636033,
"201303": -5.98225236920353,
"201304": 3.55438380466907,
"201305": 15.9106076632177,
"201306": 20.8662264689511,
"201307": 22.5672709350665,
"201308": 21.2276419823929,
"201309": 14.8433802685303,
"201310": 6.71203289603078,
"201311": -1.96737054471101,
"201312": -13.6724142159894,
"201401": -17.5567513764841,
"201402": -14.1885294124841,
"201403": -2.02247094749546,
"201404": 9.14324976172626,
"201405": 13.6206002379863,
"201406": 20.9092332365779,
"201407": 22.482804206026,
"201408": 20.894762804907,
"201409": 14.7171206805914,
"201410": 5.90892831276899,
"201411": -2.13542997775412,
"201412": -16.0693780796525,
"201501": -14.1750510539983,
"201502": -10.166207283683,
"201503": -0.572699643662685,
"201504": 7.61793225234899,
"201505": 13.05827363592,
"201506": 20.0776463284231,
"201507": 21.8858386062309,
"201508": 21.4283733605604,
"201509": 15.3202416961285,
"201510": 6.31968932349534,
"201511": -5.40949708189554,
"201512": -12.3320910323936,
"201601": -17.0315551683299,
"201602": -12.5566674623577,
"201603": -0.975391675817315,
"201604": 6.60310537941138,
"201605": 14.2842873464881,
"201606": 18.1668820050165,
"201607": 22.8188637364763,
"201608": 21.5656483689152,
"201609": 15.7290895231049,
"201610": 4.21123821157741,
"201611": -9.20460399980297,
"201612": -12.3938321119691,
"201701": -14.3028140784659,
"201702": -9.47902238990518,
"201703": -1.47084530158176,
"201704": 7.76431746206703,
"201705": 15.2819791571205,
"201706": 18.2186029473995,
"201707": 23.2990204977831,
"201708": 20.8530229816558,
"201709": 14.9739486281703,
"201710": 5.59989002810805,
"201711": -6.06464857624191,
"201712": -15.9644998322323,
"201801": -17.3453689520356,
"201802": -14.982387981793,
"201803": -3.30781445828037,
"201804": 7.83733928654644,
"201805": 14.9377637771225,
"201806": 19.5411789153982,
"201807": 23.980412305595,
"201808": 19.6609498090851,
"201809": 14.6910227631623,
"201810": 7.88228395425143,
"201811": -3.2221266175684,
"201812": -12.0686651386512,
"201901": -12.3326921416886,
"201902": -8.66135822504557,
"201903": -0.131484663069162,
"201904": 7.71111855586347,
"201905": 15.185935796864,
"201906": 18.0235058578736,
"201907": 22.5231228868017,
"201908": 20.1451252374758,
"201909": 15.9761182329917,
"201910": 7.96678551380932,
"201911": -5.27560214569764,
"201912": -14.7560434953234,
"202001": -13.8535281381279,
"202002": -10.4773745109542,
"202003": -0.898360809957969,
"202004": 6.12593835509329,
"202005": 14.5690819463134,
"202006": 17.7175124289082,
"202007": 22.8796166491393,
"202008": 20.8529949000312,
"202009": 15.7087979425659,
"202010": 7.02369481992306,
"202011": -3.37084803211951,
"202012": -14.2457753700818
}
拿到了近27年的数据还是比较充足的了。
接下来我们想基于皮尔斯系数来计算同一个省份下面不同城市的气温走势相关性程度,并通过热力图的形式呈现出来,核心代码实现如下所示:
# 热力图
for one_pcode in pc_code_map:
one_code_list = pc_code_map[one_pcode]
print(one_pcode, one_code_list)
data_factors = [city_name_map[one] for one in one_code_list]
print(data_factors)
matrix = []
for one_ccode in one_code_list:
one_dict = data_dict[one_ccode]
one_sorted = sorted(one_dict.items(), key=lambda e: e[0])
one_value_list = [one[1] for one in one_sorted]
matrix.append(one_value_list)
title = "Different City Temperature Relation Analysis HeatMap in The Same Province"
relationAnalysis(
matrix,
data_factors,
title,
savepath="heatmap/" + pro_name_map[one_pcode] + ".png",
)
因为省份较多,这里给出来部分实例:
【福建省】
【广东省】
【河南省】
【黑龙江省】
我这里的选择呈现顺序是子南向北的顺序,可以非常明显的看到:同一个省份下面的不同城市之间的气温走势呈现出来非常强的正相关关系,这一点倒也不难理解。
接下来可以对单个城市数据进行可视化,核心代码实现如下所示:
for one_city in data_dict:
one_dict = data_dict[one_city]
one_sorted = sorted(one_dict.items(), key=lambda e: e[0])
one_value_list = [one[1] for one in one_sorted]
one_trick_List = [one[0] for one in one_sorted]
print("one_city: ", one_city, ", one_num: ", len(one_value_list))
plt.clf()
plt.figure(figsize=(12, 6))
plt.plot(one_value_list)
plt.xticks(list(range(len(one_value_list))), one_trick_List, rotation=60)
plt.title(str(one_city) + " Temperature Cruve")
plt.savefig("picture/" + str(one_city) + ".png")
plt.close()
由于城市数量较多,这里给出部分实例:
【15900】
【44200】
【360400】
到这里初步的可视化分析就结束了,接下来我们想要基于城市的月平均温度数据来进行建模预测未来的时段的平均温度数据,我们这里选择的模型是随机森林模型。
随机森林(Random Forest,简称RF)是一种在机器学习中广泛使用的模型,尤其在分类和回归问题中表现出色。其构建原理主要基于集成学习的思想,特别是其中的Bagging方法。在模型构建过程中主要有以下几个关键点:
1、集成学习:随机森林是集成学习的一个特例,集成学习的主要思想是通过组合多个“弱学习器”来构建一个“强学习器”。在随机森林中,这些弱学习器就是决策树。通过组合多个决策树的预测结果,随机森林可以提高模型的稳定性和准确性。
2、自助采样法(Bootstrap Sampling):随机森林在构建每一棵决策树时,都采用了自助采样法。即从原始数据集中有放回地随机抽取一定数量的样本,作为该决策树的训练集。这样,每棵决策树的训练集都是不同的,但都包含了原始数据集中的部分信息。这种方法可以有效地减少模型过拟合的风险。
3、特征选择:在构建决策树的过程中,随机森林还引入了特征的随机选择。对于每一个节点,都从一个随机子集中选择最优特征进行分裂,而不是从所有特征中选择。这个随机子集的大小通常设为总特征数的平方根。这种特征选择方法可以增加模型的多样性,进一步提高模型的泛化能力。
决策树的构建:基于上述采样得到的训练集和特征选择方法,可以构建多棵决策树。每棵决策树都尽可能地进行生长,直到满足某个停止条件(如叶子节点中的样本数小于预设阈值,或达到预设的最大深度等)。
基于sklearn可以非常快捷地完成模型的搭建初始化:
X_train, y_train, X_test, y_test = splitData(X, y, ratio=0.10)
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
完成预测后,我们对其进行评估计算预测值和真实值的拟合程度,主要计算的是MSE和R2,代码实现如下所示:
X_train, y_train, X_test, y_test = splitData(X, y, ratio=0.10)
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred = model.predict(X_test)
eva = evaluation(y_test, y_pred)
info = "MSE: " + str(eva[2]) + ", R2: " + str(eva[-1])
plt.clf()
plt.figure(figsize=(12, 6))
plt.plot(y_test, label="True Data Cruve")
plt.plot(y_pred, label="Predict Data Cruve")
plt.title(info)
plt.legend(loc="upper right", ncol=1)
plt.show()
结果如下所示:
最后我们顺带预测了未来一年的月平均气温,如下所示:
看走势的话还是蛮符合的。
感兴趣的话都可以动手试试吧!