2023年马上就要步入尾声了,在这年末时刻,各地纷纷下起了大雪,温度也是骤降,这也难挡大家出行的热情,我很快也要加入出行的大军,朝着心中的归宿前行。

正好今天有点时间就想着以温度为切入点做点有趣的工作,我们爬取了近些年来的全国各省市下的月平均气温,想要基于这些月平均气温数据来进行分析计算最终构建模型实现预测计算。

话不多说,我们首先来看数据集,如下所示:

年月

省份


省份代码

城市

城市代码

平均气温

202201

黑龙江省

230000

七台河市

230900

-17.17509676

202202

黑龙江省

230000

七台河市

230900

-12.18510315

202203

黑龙江省

230000

七台河市

230900

-1.651742341

202204

黑龙江省

230000

七台河市

230900

8.299090227

202205

黑龙江省

230000

七台河市

230900

13.81210092

202206

黑龙江省

230000

七台河市

230900

18.89282461

202201

海南省

460000

万宁市

469006

20.87951741

202202

海南省

460000

万宁市

469006

18.60191442

202203

海南省

460000

万宁市

469006

24.08055245

202204

海南省

460000

万宁市

469006

24.65870836

202205

海南省

460000

万宁市

469006

26.1541906

202206

海南省

460000

万宁市

469006

28.84376812

202201

海南省

460000

三亚市

460200

22.08995545

202202

海南省

460000

三亚市

460200

20.97310047

202203

海南省

460000

三亚市

460200

24.48222669

202204

海南省

460000

三亚市

460200

25.00777376

202205

海南省

460000

三亚市

460200

26.56800089

202206

海南省

460000

三亚市

460200

28.24468505

202201

福建省

350000

三明市

350400

10.43969564

202202

福建省

350000

三明市

350400

8.192309117

202203

福建省

350000

三明市

350400

17.12428889

202204

福建省

350000

三明市

350400

18.50666753

202205

福建省

350000

三明市

350400

20.29015878

202206

福建省

350000

三明市

350400

24.68056597

202201

海南省

460000

三沙市

460300

23.43025149

202202

海南省

460000

三沙市

460300

22.41091619

202203

海南省

460000

三沙市

460300

25.66283196

202204

海南省

460000

三沙市

460300

26.44142246

202205

海南省

460000

三沙市

460300

27.60009033

202206

海南省

460000

三沙市

460300

29.16017647

202201

河南省

410000

三门峡市

411200

0.155539375

202202

河南省

410000

三门峡市

411200

1.576565044

202203

河南省

410000

三门峡市

411200

10.70505929

202204

河南省

410000

三门峡市

411200

15.71527554

202205

河南省

410000

三门峡市

411200

18.77065246

202206

河南省

410000

三门峡市

411200

25.79582475

202201

上海市

310000

上海市

310000

6.567684365

202202

上海市

310000

上海市

310000

5.918965574

202203

上海市

310000

上海市

310000

13.05226396

202204

上海市

310000

上海市

310000

16.88218889

202205

上海市

310000

上海市

310000

20.1943752

202206

上海市

310000

上海市

310000

26.13170191

202201

江西省

360000

上饶市

361100

7.35258247

202202

江西省

360000

上饶市

361100

5.574442995

202203

江西省

360000

上饶市

361100

15.07619137

202204

江西省

360000

上饶市

361100

17.78171764

202205

江西省

360000

上饶市

361100

20.04967596

202206

江西省

360000

上饶市

361100

25.01682304

202201

海南省

460000

东方市

469007

21.38617906

可以看到:我们对爬取后的数据进行了对应的规整处理,这里保留的是我们所必须要用到几个字段。

简单的数据加载实现如下所示:

df=pd.read_excel("temperature.xlsx")
print(df.head(10))

输出如下所示:

年月    省份    省份代码    城市    城市代码       平均气温
0  202201  黑龙江省  230000  七台河市  230900 -17.175097
1  202202  黑龙江省  230000  七台河市  230900 -12.185103
2  202203  黑龙江省  230000  七台河市  230900  -1.651742
3  202204  黑龙江省  230000  七台河市  230900   8.299090
4  202205  黑龙江省  230000  七台河市  230900  13.812101
5  202206  黑龙江省  230000  七台河市  230900  18.892825
6  202201   海南省  460000   万宁市  469006  20.879517
7  202202   海南省  460000   万宁市  469006  18.601914
8  202203   海南省  460000   万宁市  469006  24.080552
9  202204   海南省  460000   万宁市  469006  24.658708

接下来我们需要对原始的数据进行解析处理,构建所需要的数据集,这里我选择对不同层级的数据构建映射字典来方便地存取计算。

pc_code_map = {}
data_dict = {}
city_name_map = {}
pro_name_map = {}
for one_list in datas:
    ym, pro, pcode, city, ccode, temp = one_list
    city_name_map[ccode] = city
    pro_name_map[pcode] = pro
    if ccode in data_dict:
        data_dict[ccode][ym] = float(temp)
    else:
        data_dict[ccode] = {}
        data_dict[ccode][ym] = float(temp)
    if pcode in pc_code_map:
        if ccode not in pc_code_map[pcode]:
            pc_code_map[pcode].append(ccode)
    else:
        pc_code_map[pcode] = [ccode]
# 省份-城市编码映射表
with open("pc_code_map.json", "w") as f:
    f.write(json.dumps(pc_code_map))
# 省份编码-名称映射表
with open("pro_name_map.json", "w") as f:
    f.write(json.dumps(pro_name_map))
# 城市编码-名称映射表
with open("city_name_map.json", "w") as f:
    f.write(json.dumps(city_name_map))
# 城市编码-数据映射表
with open("data_dict.json", "w") as f:
    f.write(json.dumps(data_dict))

执行完成后即可得到所需要的数据字典。

省份编码-名称映射表如下:

{
	"230000": "黑龙江省",
	"460000": "海南省",
	"350000": "福建省",
	"410000": "河南省",
	"310000": "上海市",
	"360000": "江西省",
	"440000": "广东省",
	"370000": "山东省",
	"640000": "宁夏回族自治区",
	"620000": "甘肃省",
	"140000": "山西省",
	"530000": "云南省",
	"210000": "辽宁省",
	"330000": "浙江省",
	"150000": "内蒙古自治区",
	"650000": "新疆维吾尔自治区",
	"510000": "四川省",
	"340000": "安徽省",
	"420000": "湖北省",
	"130000": "河北省",
	"520000": "贵州省",
	"110000": "北京市",
	"450000": "广西壮族自治区",
	"320000": "江苏省",
	"710000": "台湾省",
	"220000": "吉林省",
	"610000": "陕西省",
	"120000": "天津市",
	"430000": "湖南省",
	"540000": "西藏自治区",
	"630000": "青海省",
	"500000": "重庆市",
	"810000": "香港特别行政区",
	"820000": "澳门特别行政区"
}

城市编码-名称映射表比较大,这里截图如下所示:

mlp 天气 预测 功率 python python气温预测_mlp 天气 预测 功率 python

最重要的用于表征省份-城市所属关系的省份-城市编码映射表如下所示:

mlp 天气 预测 功率 python python气温预测_算法_02

最后是我们的数据映射表,如下所示:

mlp 天气 预测 功率 python python气温预测_机器学习_03

这里我们以第一个城市为例,给出来数据实例,如下所示:

"230900": {
    "202201": -17.17509676,
    "202202": -12.18510315,
    "202203": -1.651742341,
    "202204": 8.299090227,
    "202205": 13.81210092,
    "202206": 18.89282461,
    "202207": 22.5161290322581,
    "202208": 21.2258064516129,
    "202209": 15.4,
    "202210": 5.46666666666667,
    "202211": -6.89285714285714,
    "202212": -13.3225806451613,
    "202101": -18.22501341,
    "202102": -12.28888705,
    "202103": -0.8183623,
    "202104": 7.751819452,
    "202105": 14.60449136,
    "202106": 19.57243154,
    "202107": 25.11467703,
    "202108": 20.8148189,
    "202109": 15.17717288,
    "202110": 6.577239561,
    "202111": -2.633364113,
    "202112": -14.26601362,
    "199601": -15.6399061175411,
    "199602": -10.8490324459857,
    "199603": -2.73487703742385,
    "199604": 6.77624656346054,
    "199605": 14.7480329803293,
    "199606": 18.741110237741,
    "199607": 21.8227186207128,
    "199608": 19.8129428169199,
    "199609": 13.9293662062715,
    "199610": 4.87227430922752,
    "199611": -5.75766094158093,
    "199612": -15.2558164472958,
    "199701": -17.0151278701834,
    "199702": -10.1700525389631,
    "199703": -4.00013012709894,
    "199704": 7.57908673146959,
    "199705": 13.1037022544812,
    "199706": 19.5144368612577,
    "199707": 23.8898612839882,
    "199708": 20.9913130130252,
    "199709": 13.2378954326523,
    "199710": 3.90252111873771,
    "199711": -2.57367660135431,
    "199712": -11.8481315965942,
    "199801": -18.4068766579621,
    "199802": -8.22564867970505,
    "199803": -0.717881091689514,
    "199804": 10.4591723442796,
    "199805": 16.2436263210651,
    "199806": 19.87804687861,
    "199807": 22.4432237963072,
    "199808": 19.3297502453057,
    "199809": 15.5532539887023,
    "199810": 8.98750636938723,
    "199811": -8.31419612615707,
    "199812": -13.065483602281,
    "199901": -14.3036750834318,
    "199902": -11.6973310164259,
    "199903": -7.68328891718841,
    "199904": 6.64887418344983,
    "199905": 12.4825940009938,
    "199906": 18.6097842396113,
    "199907": 24.318817637778,
    "199908": 20.892966640922,
    "199909": 15.2460178359725,
    "199910": 5.37214900923093,
    "199911": -3.94559904567139,
    "199912": -13.1201753452356,
    "200001": -19.1030736305323,
    "200002": -13.5096400881974,
    "200003": -3.87596338961345,
    "200004": 5.42818873393441,
    "200005": 14.4595480970601,
    "200006": 21.1697590583826,
    "200007": 23.3939194708654,
    "200008": 22.4560168898168,
    "200009": 15.6835960079016,
    "200010": 5.33414761875174,
    "200011": -6.97706680859187,
    "200012": -17.8210475121602,
    "200101": -20.1388059253368,
    "200102": -14.787474221591,
    "200103": -5.32133407793727,
    "200104": 7.82743154160904,
    "200105": 15.2847985253454,
    "200106": 19.9560671099202,
    "200107": 23.0852386581602,
    "200108": 21.2225749829277,
    "200109": 14.5846078492648,
    "200110": 7.95760390739298,
    "200111": -2.42007397024145,
    "200112": -13.4813095639178,
    "200201": -13.5939068387526,
    "200202": -7.34173525981689,
    "200203": -0.233700028014121,
    "200204": 7.5959529414626,
    "200205": 15.737571692425,
    "200206": 17.7341122175261,
    "200207": 21.3937887420561,
    "200208": 18.9262384055316,
    "200209": 15.1341121459354,
    "200210": 3.86838359758175,
    "200211": -9.91009024156768,
    "200212": -14.9391234738333,
    "200301": -15.8558837609752,
    "200302": -11.1172298454566,
    "200303": -0.884583456864082,
    "200304": 8.91731797636618,
    "200305": 15.3175305752245,
    "200306": 20.466183886004,
    "200307": 20.6435510291096,
    "200308": 20.1389982491099,
    "200309": 15.7656182772486,
    "200310": 6.80093328934769,
    "200311": -5.35017192739297,
    "200312": -12.4382605970563,
    "200401": -16.161431016735,
    "200402": -10.6591540364631,
    "200403": -3.24075662433218,
    "200404": 6.0265178749232,
    "200405": 14.0593823358904,
    "200406": 20.735180352674,
    "200407": 21.4848367883347,
    "200408": 20.3676144741795,
    "200409": 16.2031984840744,
    "200410": 7.86029215529169,
    "200411": -1.50282217862262,
    "200412": -15.6096590048404,
    "200501": -15.5840144289644,
    "200502": -14.9848092760602,
    "200503": -4.39028028412153,
    "200504": 6.17825423405859,
    "200505": 11.9487828150888,
    "200506": 21.3162182720097,
    "200507": 21.5767493114552,
    "200508": 21.252397376787,
    "200509": 15.6453687547535,
    "200510": 7.07448287423985,
    "200511": -2.81564207585917,
    "200512": -16.3414678602839,
    "200601": -17.1414624049409,
    "200602": -12.6045068798972,
    "200603": -4.45483560996889,
    "200604": 3.85045858043847,
    "200605": 15.47886914675,
    "200606": 17.9060071662806,
    "200607": 22.2829811213763,
    "200608": 22.226202191106,
    "200609": 15.4316028552512,
    "200610": 6.60720356709298,
    "200611": -4.98519541059605,
    "200612": -12.3402560494168,
    "200701": -10.9233167248898,
    "200702": -9.20223256995648,
    "200703": -4.76507837858786,
    "200704": 5.80471054000558,
    "200705": 13.0112096980299,
    "200706": 21.3219405329277,
    "200707": 21.8972189813208,
    "200708": 21.9011801370854,
    "200709": 15.8642576070777,
    "200710": 6.84349337906247,
    "200711": -4.25115765802184,
    "200712": -10.8667010111846,
    "200801": -16.86872285839,
    "200802": -10.4305610512003,
    "200803": 1.29352780411865,
    "200804": 10.0219535910654,
    "200805": 11.8595057186186,
    "200806": 20.5839131697934,
    "200807": 22.8916266610283,
    "200808": 20.740608936226,
    "200809": 15.5914179874009,
    "200810": 7.62394885056038,
    "200811": -5.24272307162044,
    "200812": -11.8187382120454,
    "200901": -16.0879528274568,
    "200902": -12.8669374095161,
    "200903": -5.1243527484677,
    "200904": 7.62148689264423,
    "200905": 16.9129819658406,
    "200906": 17.1350830370896,
    "200907": 20.5713803492934,
    "200908": 20.8143828218523,
    "200909": 14.479100648962,
    "200910": 6.74166935650382,
    "200911": -7.2839048133726,
    "200912": -17.4216048535113,
    "201001": -16.5181374821615,
    "201002": -15.3885911570271,
    "201003": -7.43819498355497,
    "201004": 3.40319074928322,
    "201005": 14.8994169980266,
    "201006": 23.5336468263204,
    "201007": 22.1642059968242,
    "201008": 21.5261399863279,
    "201009": 15.3014670035077,
    "201010": 6.19597334215438,
    "201011": -3.88975373294606,
    "201012": -16.293905392013,
    "201101": -18.6383442034344,
    "201102": -10.8242112910774,
    "201103": -3.8526875769709,
    "201104": 5.43990886038269,
    "201105": 12.9026193582865,
    "201106": 18.6249206829726,
    "201107": 23.1166580814646,
    "201108": 21.7815000953305,
    "201109": 14.3171715345182,
    "201110": 8.61500022260052,
    "201111": -3.90527000147059,
    "201112": -14.6971294559892,
    "201201": -19.2170327644329,
    "201202": -13.4485062366396,
    "201203": -4.3542751386122,
    "201204": 6.37062930705704,
    "201205": 15.1209917984733,
    "201206": 20.0819674887397,
    "201207": 22.3468330059722,
    "201208": 21.3503007529542,
    "201209": 15.9626167333807,
    "201210": 5.77970646814959,
    "201211": -4.65404081306043,
    "201212": -18.0524462383532,
    "201301": -19.1056120310673,
    "201302": -14.5711372636033,
    "201303": -5.98225236920353,
    "201304": 3.55438380466907,
    "201305": 15.9106076632177,
    "201306": 20.8662264689511,
    "201307": 22.5672709350665,
    "201308": 21.2276419823929,
    "201309": 14.8433802685303,
    "201310": 6.71203289603078,
    "201311": -1.96737054471101,
    "201312": -13.6724142159894,
    "201401": -17.5567513764841,
    "201402": -14.1885294124841,
    "201403": -2.02247094749546,
    "201404": 9.14324976172626,
    "201405": 13.6206002379863,
    "201406": 20.9092332365779,
    "201407": 22.482804206026,
    "201408": 20.894762804907,
    "201409": 14.7171206805914,
    "201410": 5.90892831276899,
    "201411": -2.13542997775412,
    "201412": -16.0693780796525,
    "201501": -14.1750510539983,
    "201502": -10.166207283683,
    "201503": -0.572699643662685,
    "201504": 7.61793225234899,
    "201505": 13.05827363592,
    "201506": 20.0776463284231,
    "201507": 21.8858386062309,
    "201508": 21.4283733605604,
    "201509": 15.3202416961285,
    "201510": 6.31968932349534,
    "201511": -5.40949708189554,
    "201512": -12.3320910323936,
    "201601": -17.0315551683299,
    "201602": -12.5566674623577,
    "201603": -0.975391675817315,
    "201604": 6.60310537941138,
    "201605": 14.2842873464881,
    "201606": 18.1668820050165,
    "201607": 22.8188637364763,
    "201608": 21.5656483689152,
    "201609": 15.7290895231049,
    "201610": 4.21123821157741,
    "201611": -9.20460399980297,
    "201612": -12.3938321119691,
    "201701": -14.3028140784659,
    "201702": -9.47902238990518,
    "201703": -1.47084530158176,
    "201704": 7.76431746206703,
    "201705": 15.2819791571205,
    "201706": 18.2186029473995,
    "201707": 23.2990204977831,
    "201708": 20.8530229816558,
    "201709": 14.9739486281703,
    "201710": 5.59989002810805,
    "201711": -6.06464857624191,
    "201712": -15.9644998322323,
    "201801": -17.3453689520356,
    "201802": -14.982387981793,
    "201803": -3.30781445828037,
    "201804": 7.83733928654644,
    "201805": 14.9377637771225,
    "201806": 19.5411789153982,
    "201807": 23.980412305595,
    "201808": 19.6609498090851,
    "201809": 14.6910227631623,
    "201810": 7.88228395425143,
    "201811": -3.2221266175684,
    "201812": -12.0686651386512,
    "201901": -12.3326921416886,
    "201902": -8.66135822504557,
    "201903": -0.131484663069162,
    "201904": 7.71111855586347,
    "201905": 15.185935796864,
    "201906": 18.0235058578736,
    "201907": 22.5231228868017,
    "201908": 20.1451252374758,
    "201909": 15.9761182329917,
    "201910": 7.96678551380932,
    "201911": -5.27560214569764,
    "201912": -14.7560434953234,
    "202001": -13.8535281381279,
    "202002": -10.4773745109542,
    "202003": -0.898360809957969,
    "202004": 6.12593835509329,
    "202005": 14.5690819463134,
    "202006": 17.7175124289082,
    "202007": 22.8796166491393,
    "202008": 20.8529949000312,
    "202009": 15.7087979425659,
    "202010": 7.02369481992306,
    "202011": -3.37084803211951,
    "202012": -14.2457753700818
}

拿到了近27年的数据还是比较充足的了。

接下来我们想基于皮尔斯系数来计算同一个省份下面不同城市的气温走势相关性程度,并通过热力图的形式呈现出来,核心代码实现如下所示:

# 热力图
for one_pcode in pc_code_map:
    one_code_list = pc_code_map[one_pcode]
    print(one_pcode, one_code_list)
    data_factors = [city_name_map[one] for one in one_code_list]
    print(data_factors)
    matrix = []
    for one_ccode in one_code_list:
        one_dict = data_dict[one_ccode]
        one_sorted = sorted(one_dict.items(), key=lambda e: e[0])
        one_value_list = [one[1] for one in one_sorted]
        matrix.append(one_value_list)
    title = "Different City Temperature Relation Analysis HeatMap in The Same Province"
    relationAnalysis(
        matrix,
        data_factors,
        title,
        savepath="heatmap/" + pro_name_map[one_pcode] + ".png",
    )

因为省份较多,这里给出来部分实例:
【福建省】

mlp 天气 预测 功率 python python气温预测_算法_04

【广东省】

mlp 天气 预测 功率 python python气温预测_json_05

【河南省】

mlp 天气 预测 功率 python python气温预测_mlp 天气 预测 功率 python_06

【黑龙江省】

mlp 天气 预测 功率 python python气温预测_机器学习_07

我这里的选择呈现顺序是子南向北的顺序,可以非常明显的看到:同一个省份下面的不同城市之间的气温走势呈现出来非常强的正相关关系,这一点倒也不难理解。

接下来可以对单个城市数据进行可视化,核心代码实现如下所示:

for one_city in data_dict:
    one_dict = data_dict[one_city]
    one_sorted = sorted(one_dict.items(), key=lambda e: e[0])
    one_value_list = [one[1] for one in one_sorted]
    one_trick_List = [one[0] for one in one_sorted]
    print("one_city: ", one_city, ", one_num: ", len(one_value_list))
    plt.clf()
    plt.figure(figsize=(12, 6))
    plt.plot(one_value_list)
    plt.xticks(list(range(len(one_value_list))), one_trick_List, rotation=60)
    plt.title(str(one_city) + " Temperature Cruve")
    plt.savefig("picture/" + str(one_city) + ".png")
    plt.close()

由于城市数量较多,这里给出部分实例:

mlp 天气 预测 功率 python python气温预测_随机森林_08

【15900】

mlp 天气 预测 功率 python python气温预测_算法_09

【44200】

mlp 天气 预测 功率 python python气温预测_mlp 天气 预测 功率 python_10

【360400】

mlp 天气 预测 功率 python python气温预测_json_11

到这里初步的可视化分析就结束了,接下来我们想要基于城市的月平均温度数据来进行建模预测未来的时段的平均温度数据,我们这里选择的模型是随机森林模型。

随机森林(Random Forest,简称RF)是一种在机器学习中广泛使用的模型,尤其在分类和回归问题中表现出色。其构建原理主要基于集成学习的思想,特别是其中的Bagging方法。在模型构建过程中主要有以下几个关键点:

1、集成学习:随机森林是集成学习的一个特例,集成学习的主要思想是通过组合多个“弱学习器”来构建一个“强学习器”。在随机森林中,这些弱学习器就是决策树。通过组合多个决策树的预测结果,随机森林可以提高模型的稳定性和准确性。
2、自助采样法(Bootstrap Sampling):随机森林在构建每一棵决策树时,都采用了自助采样法。即从原始数据集中有放回地随机抽取一定数量的样本,作为该决策树的训练集。这样,每棵决策树的训练集都是不同的,但都包含了原始数据集中的部分信息。这种方法可以有效地减少模型过拟合的风险。
3、特征选择:在构建决策树的过程中,随机森林还引入了特征的随机选择。对于每一个节点,都从一个随机子集中选择最优特征进行分裂,而不是从所有特征中选择。这个随机子集的大小通常设为总特征数的平方根。这种特征选择方法可以增加模型的多样性,进一步提高模型的泛化能力。
决策树的构建:基于上述采样得到的训练集和特征选择方法,可以构建多棵决策树。每棵决策树都尽可能地进行生长,直到满足某个停止条件(如叶子节点中的样本数小于预设阈值,或达到预设的最大深度等)。
基于sklearn可以非常快捷地完成模型的搭建初始化:

X_train, y_train, X_test, y_test = splitData(X, y, ratio=0.10)
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

完成预测后,我们对其进行评估计算预测值和真实值的拟合程度,主要计算的是MSE和R2,代码实现如下所示:

X_train, y_train, X_test, y_test = splitData(X, y, ratio=0.10)
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred = model.predict(X_test)
eva = evaluation(y_test, y_pred)
info = "MSE: " + str(eva[2]) + ", R2: " + str(eva[-1])
plt.clf()
plt.figure(figsize=(12, 6))
plt.plot(y_test, label="True Data Cruve")
plt.plot(y_pred, label="Predict Data Cruve")
plt.title(info)
plt.legend(loc="upper right", ncol=1)
plt.show()

结果如下所示:

mlp 天气 预测 功率 python python气温预测_json_12

最后我们顺带预测了未来一年的月平均气温,如下所示:

mlp 天气 预测 功率 python python气温预测_算法_13

看走势的话还是蛮符合的。

感兴趣的话都可以动手试试吧!