Python可以利用pandas对数据表进行检查,当数据量巨大,常用工具无法打开时,我们可以使用pandas模块获取数据的概况,数据表的大小、所占空间、数据格式、是否有空值重复项等,为后面的清洗和预处理做准备。
一、查看数据维度
import pandas as pd |
ValueError: Excel file format cannot be determined, you must specify an engine manually. 解决方法:
import pandas as pd |
(6, 6)
二、查看数据表信息
import pandas as pd |
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 6 entries, 0 to 5 Data columns (total 6 columns):#
Column Non-Null Count Dtype
0 id 6 non-null int64
1 date 6 non-null datetime64[ns]
2 city 6 non-null object
3 category 6 non-null object
4 age 6 non-null int64
5 price 4 non-null float64
dtypes: datetime64ns, float64(1), int64(2), object(2)
memory usage: 416.0+ bytes
三、查看数据格式
import pandas as pd |
id int64 date datetime64[ns]
city object category object age int64 price float64 dtype: object
四、查看空值
import pandas as pd |
id date city category age price
False False False False False False
False False False False False True
False False False False False False
False False False False False False
False False False False False True
False False False False False False
五、查看唯一值
import pandas as pd |
[‘东莞’ ‘深圳’ ‘广州’ ‘北京’ ‘上海’ ‘南京’]
六、查看数据表数值
import pandas as pd |
[[1001 Timestamp(‘2024-01-02 00:00:00’) ‘东莞’ ‘100-A’ 23 1200.0]
[1002 Timestamp(‘2024-01-03 00:00:00’) ‘深圳’ ‘100-B’ 44 nan]
[1003 Timestamp(‘2024-01-04 00:00:00’) ‘广州’ ‘110-A’ 54 2133.0]
[1004 Timestamp(‘2024-01-05 00:00:00’) ‘北京’ ‘110-C’ 32 5433.0]
[1005 Timestamp(‘2024-01-06 00:00:00’) ‘上海’ ‘210-A’ 34 nan]
[1006 Timestamp(‘2024-01-07 00:00:00’) ‘南京’ ‘130-F’ 32 4432.0]]
七、查看列名称
import pandas as pd |
Index([‘id’, ‘date’, ‘city’, ‘category’, ‘age’, ‘price’], dtype=’object’)
八、查看前10行数据
import pandas as pd |
id date city category age price
1001 2024-01-02 东莞 100-A 23 1200.0
1002 2024-01-03 深圳 100-B 44 NaN
1003 2024-01-04 广州 110-A 54 2133.0
1004 2024-01-05 北京 110-C 32 5433.0
九、查看后10行数据
import pandas as pd |
id date city category age price
1003 2024-01-04 广州 110-A 54 2133.0
1004 2024-01-05 北京 110-C 32 5433.0
1005 2024-01-06 上海 210-A 34 NaN
1006 2024-01-07 南京 130-F 32 4432.0