本文介绍使用pandas对清洗完的数据进行整理以便后续的统计和分析工作,主要包括数据表的合并、排序、数值分列、数据分组及标记等工作。
一、数据表合并
内连接:
import pandas as pd |
id int64
date datetime64[ns]
city object
category object
age int64
price float64
gender object
pay object
m-point int64
dtype: object
[[1001 Timestamp(‘2013-01-02 00:00:00’) ‘Beijing ‘ ‘100-A’ 23 1200.0
‘male’ ‘Y’ 10]
[1002 Timestamp(‘2013-01-03 00:00:00’) ‘SH’ ‘100-B’ 44 nan ‘female’ ‘N’
12]
[1003 Timestamp(‘2013-01-04 00:00:00’) ‘ guangzhou ‘ ‘110-A’ 54 2133.0
‘male’ ‘Y’ 20]
[1004 Timestamp(‘2013-01-05 00:00:00’) ‘Shenzhen’ ‘110-C’ 32 5433.0
‘female’ ‘Y’ 40]
[1005 Timestamp(‘2013-01-06 00:00:00’) ‘shanghai’ ‘210-A’ 34 nan
‘male’ ‘N’ 40]
[1006 Timestamp(‘2013-01-07 00:00:00’) ‘BEIJING ‘ ‘130-F’ 32 4432.0
‘female’ ‘Y’ 40]]
左连接:df_left=pd.merge(df,df1,how='left')
右连接df_right=pd.merge(df,df1,how='right')
外连接:df_outer=pd.merge(df,df1,how='outer')
二、设置索引列
import pandas as pd |
三、排序(按索引、按数值)
import pandas as pd |
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 female Y 40
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40
4 1005 2013-01-06 shanghai 210-A 34 NaN male N 40
1 1002 2013-01-03 SH 100-B 44 NaN female N 12
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B 44 NaN female N 12
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 female Y 40
4 1005 2013-01-06 shanghai 210-A 34 NaN male N 40
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40
四、数据分组
import pandas as pd |
id date city category age price gender pay m-point priceLevel
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10 low
1 1002 2013-01-03 SH 100-B 44 NaN female N 12 low
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20 low
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 female Y 40 high
4 1005 2013-01-06 shanghai 210-A 34 NaN male N 40 low
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40 high
id date city category age price gender pay m-point priceLevel highFemale
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10 low NaN
1 1002 2013-01-03 SH 100-B 44 NaN female N 12 low NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20 low NaN
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 female Y 40 high 1.0
4 1005 2013-01-06 shanghai 210-A 34 NaN male N 40 low NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40 high 1.0
五、数据分列
import pandas as pd |
id date city category age price gender pay m-point
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B 44 NaN female N 12
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 female Y 40
4 1005 2013-01-06 shanghai 210-A 34 NaN male N 40
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40
id date city category age price gender pay m-point category_id size
0 1001 2013-01-02 Beijing 100-A 23 1200.0 male Y 10 100 A
1 1002 2013-01-03 SH 100-B 44 NaN female N 12 100 B
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 male Y 20 110 A
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 female Y 40 110 C
4 1005 2013-01-06 shanghai 210-A 34 NaN male N 40 210 A
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 female Y 40 130 F