从本文你将学到如何利用pandas模块下的concat函数进行数据拼接。
pd.concat
背景
现在我有136price202006temp, 136price202007temp, 136price202008temp, 136price202009temp, 136price2020010temp, 136price2020011temp六份数据,我想要将其上下拼接起来形成一张总表,各表名和表头如下表所示
表名 | 表头 |
136price202006temp | newdiskid, plate, newdiskname, lon, lat, ras202005, p0, label |
136price202007temp | newdiskid, plate, newdiskname, lon, lat, price202006, p0, label |
136price202008temp | newdiskid, plate, newdiskname, lon, lat, price202007, p0, label |
136price202009temp | newdiskid, plate, newdiskname, lon, lat, price202008, p0, label |
136price2020010temp | newdiskid, plate, newdiskname, lon, lat, price202009, p0, label |
136price2020011temp | newdiskid, plate, newdiskname, lon, lat, price2020010, p0, label |
从表中可以看到每个表一共有8个字段,其中7个相同,1个不同,具体要求如下
- 关键是把p0拼接,表示各个月份算出来的价格,希望把下面7个字段对齐;
表名 | 字段 |
df | newdiskid, plate, newdiskname, lon, lat, p0,label |
- 每个表增加表示时间的字段pricedate
- 增加表示计算人的字段computman;
- 将所有label为1或者2的改成0,label为0的改成1;
完整代码
# -*- coding: utf-8 -*-
"""
Project_name:join
Description:
Created on Tue Nov 10 14:44:44 2020
@author: 帅帅de三叔
"""
import pandas as pd #导入数据分析模块
import pymysql #导入连接MySQL模块
data202006 = pd.read_excel("136price202006temp.xlsx") #读取数据
data202006["pricedate"]= ["2020-06-01"]*len(data202006) #增加一个pricedate的字段
data202007 = pd.read_excel("136price202007temp.xlsx")
data202007["pricedate"]= ["2020-07-01"]*len(data202007)
data202008 = pd.read_excel("136price202008temp.xlsx")
data202008["pricedate"]= ["2020-08-01"]*len(data202008)
data202009 = pd.read_excel("136price202009temp.xlsx")
data202009["pricedate"]= ["2020-09-01"]*len(data202009)
data2020010 = pd.read_excel("136price2020010temp.xlsx")
data2020010["pricedate"]= ["2020-10-01"]*len(data2020010)
data2020011 = pd.read_excel("136price2020011temp.xlsx")
data2020011["pricedate"]= ["2020-11-01"]*len(data2020011)
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011],join = "inner", axis = 0, ignore_index=True)[["newdiskid", "newdiskname", "pricedate","p0", "label"]]
df["computman"] = ["zbw"]*len(df) #新增computman字段
df["remark"] = [""]*len(df) #新增remark字段
df["label"] = [2 if i == 1 or i == 2 else 0 for i in df["label"]] #推导式语句,如果为1或2则统一标2,其余标0
df["label"] = [0 if i == 2 else 1 for i in df["label"]] #接上一句,如果为2赋值为0,反之为1
df.columns = ["newdiskid", "newdiskname", "pricedate", "price", "casetype", "computman", "remark"] #重新给表头
df = df[["newdiskid", "newdiskname","pricedate", "price", "computman", "casetype", "remark"]] #调整一下表头顺序
df.to_excel("136各楼盘各月份基价.xlsx") #保存本地文件夹下
结果预览
代码解读
主要用到panda.concat()函数,其基本调用方式如下
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)
参数解释
- objs
需要连接的对象,如series,dataframe或者是panel构成的序列list ,常用是两个数据框 [df1, df2],我这里给了6个数据框;
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011]) #拼接6个数据框
- axis
表示拼接的方向,axis = 0, 表示在按行拼接,即上下方向; axis = 1, 表示按列方向拼接,即左右方向;
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011], axis = 0) #按行拼接
- join
表示连接方式,outer, 表示并集,即全部需要; inner,表示交集,只取重合的部分;
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011],join = "inner", axis = 0)#取交集
注: 取交集后ras202005,price202006,price202007, price202008, price202009, price2020010, price2020011这几个字段就会被干掉;
- join_axes
传入需要保留的index
- ignore_index
忽略需要连接的数据框本身的index,当原本的index没有特别意义的时候可以使用,尤其这些数据框的index不完全一致的时候特别有用;
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011],join = "inner", axis = 0, ignore_index=True) #选择忽略
- key
keys参数可以用来标注合并后的表数据来源,比如我这里可以给6个key值
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011],join = "inner", axis = 0, ignore_index=True, keys=['06', ',07', '08','09', ',010', '011']) #标注数据来源
- levels
序列列表,用于构建MultiIndex的特定级别(唯一值),否则,它们将从index推断。
- copy
bool值,表示要不要复制数据,默认为True,如果为False,则不要复制数据。
后面2个参数用的比较少
参考文献
1,https://jingyan.baidu.com/article/91f5db1b79205a1c7f05e3ae.html
2,https://www.jianshu.com/p/421f040dfe2f