Pandas: 如何读写文件

2个月前 210次点击 来自 其他

收录专题: Python边学边译

标签: Python

原文链接: Pandas: How to Read and Write Files

Pandas 是一个功能强大且灵活的Python软件包,可让您使用带标签的数据和时间序列数据。它还提供统计方法,启用绘图等功能。Pandas的一项重要功能是它具有读写Excel,CSV和许多其他类型文件的能力。诸如read_csv()方法之类的功能使您可以有效地处理文件。您可以使用它们将Pandas对象中的数据和标签保存到文件中,并在以后将它们作为Pandas的Series或DataFrame实例加载。

在本教程中,您将学习:

  • Pandas IO工具API初体验
  • 如何在文件中读写数据
  • 如何使用各种文件格式
  • 如何与大数据高效地一起工作

安装Pandas与准备资料

本教程中的代码是基于CPython 3.7.4和Pandas 0.25.1编写。

首先,使用pip安装它:

$ pip install pandas

在本教程中,您将使用与20个国家/地区有关的数据。这是您将使用的数据和源的概述:

  • 国家Country 由国家名称表示
  • 人口Population 以百万表示
  • 面积Area 以数千平方公里表示
  • 国内生产总值Gross domestic product 以百万美元表示
  • 大陆Continent 是非洲,亚洲,大洋洲,欧洲,北美或南美
  • 独立日Independence day 是纪念一个国家独立的日子

您可以使用嵌套字典在Python中组织这些数据:

data = {
    'CHN': {'COUNTRY': 'China', 'POP': 1_398.72, 'AREA': 9_596.96,
            'GDP': 12_234.78, 'CONT': 'Asia'},
    'IND': {'COUNTRY': 'India', 'POP': 1_351.16, 'AREA': 3_287.26,
            'GDP': 2_575.67, 'CONT': 'Asia', 'IND_DAY': '1947-08-15'},
    'USA': {'COUNTRY': 'US', 'POP': 329.74, 'AREA': 9_833.52,
            'GDP': 19_485.39, 'CONT': 'N.America',
            'IND_DAY': '1776-07-04'},
    'IDN': {'COUNTRY': 'Indonesia', 'POP': 268.07, 'AREA': 1_910.93,
            'GDP': 1_015.54, 'CONT': 'Asia', 'IND_DAY': '1945-08-17'},
    'BRA': {'COUNTRY': 'Brazil', 'POP': 210.32, 'AREA': 8_515.77,
            'GDP': 2_055.51, 'CONT': 'S.America', 'IND_DAY': '1822-09-07'},
    'PAK': {'COUNTRY': 'Pakistan', 'POP': 205.71, 'AREA': 881.91,
            'GDP': 302.14, 'CONT': 'Asia', 'IND_DAY': '1947-08-14'},
    'NGA': {'COUNTRY': 'Nigeria', 'POP': 200.96, 'AREA': 923.77,
            'GDP': 375.77, 'CONT': 'Africa', 'IND_DAY': '1960-10-01'},
    'BGD': {'COUNTRY': 'Bangladesh', 'POP': 167.09, 'AREA': 147.57,
            'GDP': 245.63, 'CONT': 'Asia', 'IND_DAY': '1971-03-26'},
    'RUS': {'COUNTRY': 'Russia', 'POP': 146.79, 'AREA': 17_098.25,
            'GDP': 1_530.75, 'IND_DAY': '1992-06-12'},
    'MEX': {'COUNTRY': 'Mexico', 'POP': 126.58, 'AREA': 1_964.38,
            'GDP': 1_158.23, 'CONT': 'N.America', 'IND_DAY': '1810-09-16'},
    'JPN': {'COUNTRY': 'Japan', 'POP': 126.22, 'AREA': 377.97,
            'GDP': 4_872.42, 'CONT': 'Asia'},
    'DEU': {'COUNTRY': 'Germany', 'POP': 83.02, 'AREA': 357.11,
            'GDP': 3_693.20, 'CONT': 'Europe'},
    'FRA': {'COUNTRY': 'France', 'POP': 67.02, 'AREA': 640.68,
            'GDP': 2_582.49, 'CONT': 'Europe', 'IND_DAY': '1789-07-14'},
    'GBR': {'COUNTRY': 'UK', 'POP': 66.44, 'AREA': 242.50,
            'GDP': 2_631.23, 'CONT': 'Europe'},
    'ITA': {'COUNTRY': 'Italy', 'POP': 60.36, 'AREA': 301.34,
            'GDP': 1_943.84, 'CONT': 'Europe'},
    'ARG': {'COUNTRY': 'Argentina', 'POP': 44.94, 'AREA': 2_780.40,
            'GDP': 637.49, 'CONT': 'S.America', 'IND_DAY': '1816-07-09'},
    'DZA': {'COUNTRY': 'Algeria', 'POP': 43.38, 'AREA': 2_381.74,
            'GDP': 167.56, 'CONT': 'Africa', 'IND_DAY': '1962-07-05'},
    'CAN': {'COUNTRY': 'Canada', 'POP': 37.59, 'AREA': 9_984.67,
            'GDP': 1_647.12, 'CONT': 'N.America', 'IND_DAY': '1867-07-01'},
    'AUS': {'COUNTRY': 'Australia', 'POP': 25.47, 'AREA': 7_692.02,
            'GDP': 1_408.68, 'CONT': 'Oceania'},
    'KAZ': {'COUNTRY': 'Kazakhstan', 'POP': 18.53, 'AREA': 2_724.90,
            'GDP': 159.41, 'CONT': 'Asia', 'IND_DAY': '1991-12-16'}
}

columns = ('COUNTRY', 'POP', 'AREA', 'GDP', 'CONT', 'IND_DAY')

接下来我们使用 data 创建一个 Pandas DataFrame实例。

import pandas as pd
df = pd.DataFrame(data=data)

data以国家/地区代码对应于列的方式进行组织。您可以DataFrame使用属性.T反转a的行和列:

>>> df.T
        COUNTRY      POP     AREA      GDP       CONT     IND_DAY
CHN       China  1398.72  9596.96  12234.8       Asia         NaN
IND       India  1351.16  3287.26  2575.67       Asia  1947-08-15
USA          US   329.74  9833.52  19485.4  N.America  1776-07-04
IDN   Indonesia   268.07  1910.93  1015.54       Asia  1945-08-17
BRA      Brazil   210.32  8515.77  2055.51  S.America  1822-09-07
PAK    Pakistan   205.71   881.91   302.14       Asia  1947-08-14
NGA     Nigeria   200.96   923.77   375.77     Africa  1960-10-01
BGD  Bangladesh   167.09   147.57   245.63       Asia  1971-03-26
RUS      Russia   146.79  17098.2  1530.75        NaN  1992-06-12
MEX      Mexico   126.58  1964.38  1158.23  N.America  1810-09-16
JPN       Japan   126.22   377.97  4872.42       Asia         NaN
DEU     Germany    83.02   357.11   3693.2     Europe         NaN
FRA      France    67.02   640.68  2582.49     Europe  1789-07-14
GBR          UK    66.44    242.5  2631.23     Europe         NaN
ITA       Italy    60.36   301.34  1943.84     Europe         NaN
ARG   Argentina    44.94   2780.4   637.49  S.America  1816-07-09
DZA     Algeria    43.38  2381.74   167.56     Africa  1962-07-05
CAN      Canada    37.59  9984.67  1647.12  N.America  1867-07-01
AUS   Australia    25.47  7692.02  1408.68    Oceania         NaN
KAZ  Kazakhstan    18.53   2724.9   159.41       Asia  1991-12-16

现在,您已经用有关每个国家/地区的数据填充了DataFrame对象。

低于3.6的Python版本无法保证字典中键的顺序。为了确保维护旧版本的Python和Pandas的列顺序,您可以指定index=columns:

>>> df = pd.DataFrame(data=data, index=columns).T

read_csv()和to_csv()

CSV文件的每一行代表一个表格行。默认情况下,同一行中的值用逗号分隔,但是您可以将分隔符更改为分号,制表符,空格或其他字符。

写CSV文件

您可以使用 .to_csv()DataFrame保存为CSV文件 :

>>> df.to_csv('data.csv')

打开本地工作目录中的 data.csv ,它如以下所示:

,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16
JPN,Japan,126.22,377.97,4872.42,Asia,
DEU,Germany,83.02,357.11,3693.2,Europe,
FRA,France,67.02,640.68,2582.49,Europe,1789-07-14
GBR,UK,66.44,242.5,2631.23,Europe,
ITA,Italy,60.36,301.34,1943.84,Europe,
ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09
DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05
CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01
AUS,Australia,25.47,7692.02,1408.68,Oceania,
KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16

该文本文件包含用逗号分隔的数据。第一列为行标签。如果您不想保留它们,则可以将参数传递index=False.to_csv()

读CSV文件

将数据保存到CSV文件后,您可能会不时加载和使用它。您可以使用Pandas read_csv() 函数来做到这一点:

>>> df = pd.read_csv('data.csv', index_col=0)
>>> df
 COUNTRY      POP      AREA       GDP       CONT     IND_DAY
CHN       China  1398.72   9596.96  12234.78       Asia         NaN
IND       India  1351.16   3287.26   2575.67       Asia  1947-08-15
USA          US   329.74   9833.52  19485.39  N.America  1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia  1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America  1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia  1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa  1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia  1971-03-26
RUS      Russia   146.79  17098.25   1530.75        NaN  1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America  1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia         NaN
DEU     Germany    83.02    357.11   3693.20     Europe         NaN
FRA      France    67.02    640.68   2582.49     Europe  1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe         NaN
ITA       Italy    60.36    301.34   1943.84     Europe         NaN
ARG   Argentina    44.94   2780.40    637.49  S.America  1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa  1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America  1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania         NaN
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia  1991-12-16

参数index_col指定CSV文件中包含行标签的列。您为该参数分配一个从零开始的列索引。当CSV文件包含行标签的时间值,您应该确定index_col的值,以避免将它们作为数据加载。

使用 Pandas 读写 Excel

Microsoft Excel 可能是使用最广泛的电子表格软件。旧版本使用二进制.xls文件,而Excel 2007引入了新的基于XML的.xlsx文件。安装以下包即可读写Excel:

使用pip或者Conda安装它们:

$ pip install xlwt openpyxl xlsxwriter xlrd
$ conda install xlwt openpyxl xlsxwriter xlrd

或者,请根据您的需求安装某个包。

写Excel文件

如下所示:

>>> df.to_excel('data.xlsx')

该参数'data.xlsx'表示目标文件,也可以表示其路径。上面的语句应该data.xlsx在当前工作目录中创建文件。该文件应如下所示:

mmst-pandas-rw-files-excel

文件的第一列包含行的标签,而其他列存储数据。

读Excel文件

您可以使用read_excel()命令从Excel文件加载数据:

>>> df = pd.read_excel('data.xlsx', index_col=0)
>>> df
 COUNTRY      POP      AREA       GDP       CONT     IND_DAY
CHN       China  1398.72   9596.96  12234.78       Asia         NaN
IND       India  1351.16   3287.26   2575.67       Asia  1947-08-15
USA          US   329.74   9833.52  19485.39  N.America  1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia  1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America  1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia  1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa  1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia  1971-03-26
RUS      Russia   146.79  17098.25   1530.75        NaN  1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America  1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia         NaN
DEU     Germany    83.02    357.11   3693.20     Europe         NaN
FRA      France    67.02    640.68   2582.49     Europe  1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe         NaN
ITA       Italy    60.36    301.34   1943.84     Europe         NaN
ARG   Argentina    44.94   2780.40    637.49  S.America  1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa  1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America  1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania         NaN
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia  1991-12-16

了解Pandas IO API

Pandas IO Tools 是一种API,可让您将SeriesDataFrame对象的内容保存到剪贴板,对象或各种类型的文件中。它还可以从剪贴板,对象或文件中加载数据。

写文件

SeriesDataFrame对象具有允许将数据和标签写入剪贴板或文件的方法。它们以 **.to_<file-type>()**格式命名, 其中 <file-type> 是目标文件的类型。

您已经了解了.to_csv().to_excel(), 但是还有其他一些,包括:

  • .to_json()
  • .to_html()
  • .to_sql()
  • .to_pickle()

您还可以写入更多文件类型,在此就不在详谈。

这些方法的参数指定保存数据和标签的目标文件路径。在某些情况下这是必需的,而在其他情况下则是可选的。如果此选项可用,而您选择忽略它,则方法将返回带有DataFrame实例内容的对象(如字符串或可迭代对象)。

可选参数compression决定如何使用数据和标签压缩文件。

读文件

它们以 **.read_<file-type>()**格式命名, 其中 <file-type> 是目标文件的类型。

您已经了解了.read_csv().read_excel(), 但是还有其他一些,包括:

  • read_json()
  • read_html()
  • read_sql()
  • read_pickle()

这些函数具有一个指定目标文件路径的参数。它可以是表示路径的任何有效字符串,无论是在本地计算机上还是在URL中。根据文件类型,其他对象也是可以接受的。

可选参数compression确定用于压缩文件的解压缩类型。

使用不同的文件类型

Pandas库提供了将数据保存到文件以及从文件加载数据的多种可能性。在本部分中,您将了解有关使用CSV和Excel文件的更多信息。您还将看到如何使用其他类型的文件,例如JSON,网页,数据库和Python pickle文件。

CSV文件

您已经学习了如何读写CSV文件。现在让我们更深入地研究细节。当您使用.to_csv()保存您的DataFrame,提供一个参数path_or_buff用于指定路径,名称和目标文件的扩展名。

You’ve already learned how to read and write CSV files. Now let’s dig a little deeper into the details. When you use .to_csv() to save your DataFrame, you can provide an argument for the parameter path_or_buff to specify the path, name, and extension of the target file.

path_or_buff.to_csv()得到的第一个参数.它可以是表示包含文件名及其扩展名的有效文件路径的任何字符串。但是,如果省略path_or_buff,则.to_csv()不会创建任何文件。相反,它将返回相应的字符串:

>>> df = pd.DataFrame(data=data).T
>>> s = df.to_csv()
>>> print(s)
,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,,1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16
JPN,Japan,126.22,377.97,4872.42,Asia,
DEU,Germany,83.02,357.11,3693.2,Europe,
FRA,France,67.02,640.68,2582.49,Europe,1789-07-14
GBR,UK,66.44,242.5,2631.23,Europe,
ITA,Italy,60.36,301.34,1943.84,Europe,
ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09
DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05
CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01
AUS,Australia,25.47,7692.02,1408.68,Oceania,
KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16

现在,您有了字符串s而不是CSV文件。您的DataFrame对象中还缺少一些值。例如,没有俄罗斯的大洲和几个国家(中国,日本等)的独立日。在数据科学和机器学习中,您必须谨慎处理缺失的值。默认情况下,Pandas使用NaN值替换缺少的值。

** 注意:** nan,代表“非数字”,是Python中一个特殊的浮点值。

您可以nan使用以下任何功能获取值:

df中与俄罗斯相对应的大陆是nan

>>> df.loc['RUS', 'CONT']
nan

本示例用于.loc[]获取具有指定行和列名称的数据。

当您将DataFrame文件保存到CSV文件时,空字符串('')将代表丢失的数据。您可以在文件data.csv和字符串s中看到它。如果要更改此行为,请使用可选参数na_rep

>>> df.to_csv('new-data.csv', na_rep='(missing)')

现在,'(missing)'文件中的字符串对应于中的nan值df。

,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,(missing)
IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15
USA,US,329.74,9833.52,19485.39,N.America,1776-07-04
IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17
BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14
NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01
BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26
RUS,Russia,146.79,17098.25,1530.75,(missing),1992-06-12
MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16
JPN,Japan,126.22,377.97,4872.42,Asia,(missing)
DEU,Germany,83.02,357.11,3693.2,Europe,(missing)
FRA,France,67.02,640.68,2582.49,Europe,1789-07-14
GBR,UK,66.44,242.5,2631.23,Europe,(missing)
ITA,Italy,60.36,301.34,1943.84,Europe,(missing)
ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09
DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05
CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01
AUS,Australia,25.47,7692.02,1408.68,Oceania,(missing)
KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16

当 Pandas 读取文件时, 默认情况下会将空字符串 ('') 和其他一些字符串视为缺失值:

  • 'nan'
  • '-nan'
  • 'NA'
  • 'N/A'
  • 'NaN'
  • 'null'

如果您不希望出现这种情况,则可以传递keep_default_na=Falseread_csv()函数。要为缺少的值指定其他标签,请使用参数na_values

>>> pd.read_csv('new-data.csv', index_col=0, na_values='(missing)')
 COUNTRY      POP      AREA       GDP       CONT     IND_DAY
CHN       China  1398.72   9596.96  12234.78       Asia         NaN
IND       India  1351.16   3287.26   2575.67       Asia  1947-08-15
USA          US   329.74   9833.52  19485.39  N.America  1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia  1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America  1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia  1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa  1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia  1971-03-26
RUS      Russia   146.79  17098.25   1530.75        NaN  1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America  1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia         NaN
DEU     Germany    83.02    357.11   3693.20     Europe         NaN
FRA      France    67.02    640.68   2582.49     Europe  1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe         NaN
ITA       Italy    60.36    301.34   1943.84     Europe         NaN
ARG   Argentina    44.94   2780.40    637.49  S.America  1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa  1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America  1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania         NaN
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia  1991-12-16

在这里,您已将该字符串'(missing)'标记为新的缺少的数据标签,并且Pandas在读取文件时将其替换为nan

从文件加载数据时,Pandas默认将数据类型分配给每一列的值。您可以使用以下命令检查这些类型.dtypes

>>> df = pd.read_csv('data.csv', index_col=0)
>>> df.dtypes
COUNTRY     object
POP        float64
AREA       float64
GDP        float64
CONT        object
IND_DAY     object
dtype: object

您可以使用参数dtype指定所需的数据类型,使用参数parse_dates强制使用 datetimes

>>> dtypes = {'POP': 'float32', 'AREA': 'float32', 'GDP': 'float32'}
>>> df = pd.read_csv('data.csv', index_col=0, dtype=dtypes,
...                  parse_dates=['IND_DAY'])
>>> df.dtypes
COUNTRY            object
POP               float32
AREA              float32
GDP               float32
CONT               object
IND_DAY    datetime64[ns]
dtype: object
>>> df['IND_DAY']
CHN          NaT
IND   1947-08-15
USA   1776-07-04
IDN   1945-08-17
BRA   1822-09-07
PAK   1947-08-14
NGA   1960-10-01
BGD   1971-03-26
RUS   1992-06-12
MEX   1810-09-16
JPN          NaT
DEU          NaT
FRA   1789-07-14
GBR          NaT
ITA          NaT
ARG   1816-07-09
DZA   1962-07-05
CAN   1867-07-01
AUS          NaT
KAZ   1991-12-16
Name: IND_DAY, dtype: datetime64[ns]

现在您有了真实的日期,可以将它们保存为所需的格式:

>>> df = pd.read_csv('data.csv', index_col=0, parse_dates=['IND_DAY'])
>>> df.to_csv('formatted-data.csv', date_format='%B %d, %Y')

查看生成的文件:

,COUNTRY,POP,AREA,GDP,CONT,IND_DAY
CHN,China,1398.72,9596.96,12234.78,Asia,
IND,India,1351.16,3287.26,2575.67,Asia,"August 15, 1947"
USA,US,329.74,9833.52,19485.39,N.America,"July 04, 1776"
IDN,Indonesia,268.07,1910.93,1015.54,Asia,"August 17, 1945"
BRA,Brazil,210.32,8515.77,2055.51,S.America,"September 07, 1822"
PAK,Pakistan,205.71,881.91,302.14,Asia,"August 14, 1947"
NGA,Nigeria,200.96,923.77,375.77,Africa,"October 01, 1960"
BGD,Bangladesh,167.09,147.57,245.63,Asia,"March 26, 1971"
RUS,Russia,146.79,17098.25,1530.75,,"June 12, 1992"
MEX,Mexico,126.58,1964.38,1158.23,N.America,"September 16, 1810"
JPN,Japan,126.22,377.97,4872.42,Asia,
DEU,Germany,83.02,357.11,3693.2,Europe,
FRA,France,67.02,640.68,2582.49,Europe,"July 14, 1789"
GBR,UK,66.44,242.5,2631.23,Europe,
ITA,Italy,60.36,301.34,1943.84,Europe,
ARG,Argentina,44.94,2780.4,637.49,S.America,"July 09, 1816"
DZA,Algeria,43.38,2381.74,167.56,Africa,"July 05, 1962"
CAN,Canada,37.59,9984.67,1647.12,N.America,"July 01, 1867"
AUS,Australia,25.47,7692.02,1408.68,Oceania,
KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,"December 16, 1991"

日期格式现在有所不同。该格式'%B %d, %Y'表示日期将首先显示月份的全名,然后是日期,然后是逗号,最后是年。

您还可以使用其他几个可选参数 .to_csv():

  • sep 表示值分隔符。
  • decimal 表示小数点分隔符。
  • encoding 设置文件编码。
  • header 指定是否要在文件中写入列标签。

以下是传递参数sepheader的实例:

>>> s = df.to_csv(sep=';', header=False)
>>> print(s)
CHN;China;1398.72;9596.96;12234.78;Asia;
IND;India;1351.16;3287.26;2575.67;Asia;1947-08-15
USA;US;329.74;9833.52;19485.39;N.America;1776-07-04
IDN;Indonesia;268.07;1910.93;1015.54;Asia;1945-08-17
BRA;Brazil;210.32;8515.77;2055.51;S.America;1822-09-07
PAK;Pakistan;205.71;881.91;302.14;Asia;1947-08-14
NGA;Nigeria;200.96;923.77;375.77;Africa;1960-10-01
BGD;Bangladesh;167.09;147.57;245.63;Asia;1971-03-26
RUS;Russia;146.79;17098.25;1530.75;;1992-06-12
MEX;Mexico;126.58;1964.38;1158.23;N.America;1810-09-16
JPN;Japan;126.22;377.97;4872.42;Asia;
DEU;Germany;83.02;357.11;3693.2;Europe;
FRA;France;67.02;640.68;2582.49;Europe;1789-07-14
GBR;UK;66.44;242.5;2631.23;Europe;
ITA;Italy;60.36;301.34;1943.84;Europe;
ARG;Argentina;44.94;2780.4;637.49;S.America;1816-07-09
DZA;Algeria;43.38;2381.74;167.56;Africa;1962-07-05
CAN;Canada;37.59;9984.67;1647.12;N.America;1867-07-01
AUS;Australia;25.47;7692.02;1408.68;Oceania;
KAZ;Kazakhstan;18.53;2724.9;159.41;Asia;1991-12-16

JSON文件

使用.to_json()保存数据到json中:

>>> df = pd.DataFrame(data=data).T
>>> df.to_json('data-columns.json')

结果如下:

{"COUNTRY":{"CHN":"China","IND":"India","USA":"US","IDN":"Indonesia","BRA":"Brazil","PAK":"Pakistan","NGA":"Nigeria","BGD":"Bangladesh","RUS":"Russia","MEX":"Mexico","JPN":"Japan","DEU":"Germany","FRA":"France","GBR":"UK","ITA":"Italy","ARG":"Argentina","DZA":"Algeria","CAN":"Canada","AUS":"Australia","KAZ":"Kazakhstan"},"POP":{"CHN":1398.72,"IND":1351.16,"USA":329.74,"IDN":268.07,"BRA":210.32,"PAK":205.71,"NGA":200.96,"BGD":167.09,"RUS":146.79,"MEX":126.58,"JPN":126.22,"DEU":83.02,"FRA":67.02,"GBR":66.44,"ITA":60.36,"ARG":44.94,"DZA":43.38,"CAN":37.59,"AUS":25.47,"KAZ":18.53},"AREA":{"CHN":9596.96,"IND":3287.26,"USA":9833.52,"IDN":1910.93,"BRA":8515.77,"PAK":881.91,"NGA":923.77,"BGD":147.57,"RUS":17098.25,"MEX":1964.38,"JPN":377.97,"DEU":357.11,"FRA":640.68,"GBR":242.5,"ITA":301.34,"ARG":2780.4,"DZA":2381.74,"CAN":9984.67,"AUS":7692.02,"KAZ":2724.9},"GDP":{"CHN":12234.78,"IND":2575.67,"USA":19485.39,"IDN":1015.54,"BRA":2055.51,"PAK":302.14,"NGA":375.77,"BGD":245.63,"RUS":1530.75,"MEX":1158.23,"JPN":4872.42,"DEU":3693.2,"FRA":2582.49,"GBR":2631.23,"ITA":1943.84,"ARG":637.49,"DZA":167.56,"CAN":1647.12,"AUS":1408.68,"KAZ":159.41},"CONT":{"CHN":"Asia","IND":"Asia","USA":"N.America","IDN":"Asia","BRA":"S.America","PAK":"Asia","NGA":"Africa","BGD":"Asia","RUS":null,"MEX":"N.America","JPN":"Asia","DEU":"Europe","FRA":"Europe","GBR":"Europe","ITA":"Europe","ARG":"S.America","DZA":"Africa","CAN":"N.America","AUS":"Oceania","KAZ":"Asia"},"IND_DAY":{"CHN":null,"IND":"1947-08-15","USA":"1776-07-04","IDN":"1945-08-17","BRA":"1822-09-07","PAK":"1947-08-14","NGA":"1960-10-01","BGD":"1971-03-26","RUS":"1992-06-12","MEX":"1810-09-16","JPN":null,"DEU":null,"FRA":"1789-07-14","GBR":null,"ITA":null,"ARG":"1816-07-09","DZA":"1962-07-05","CAN":"1867-07-01","AUS":null,"KAZ":"1991-12-16"}}

该orient参数默认为'columns'。在这里,您将其设置为index。

>>> df.to_json('data-index.json', orient='index')

结果如下:

{"CHN":{"COUNTRY":"China","POP":1398.72,"AREA":9596.96,"GDP":12234.78,"CONT":"Asia","IND_DAY":null},"IND":{"COUNTRY":"India","POP":1351.16,"AREA":3287.26,"GDP":2575.67,"CONT":"Asia","IND_DAY":"1947-08-15"},"USA":{"COUNTRY":"US","POP":329.74,"AREA":9833.52,"GDP":19485.39,"CONT":"N.America","IND_DAY":"1776-07-04"},"IDN":{"COUNTRY":"Indonesia","POP":268.07,"AREA":1910.93,"GDP":1015.54,"CONT":"Asia","IND_DAY":"1945-08-17"},"BRA":{"COUNTRY":"Brazil","POP":210.32,"AREA":8515.77,"GDP":2055.51,"CONT":"S.America","IND_DAY":"1822-09-07"},"PAK":{"COUNTRY":"Pakistan","POP":205.71,"AREA":881.91,"GDP":302.14,"CONT":"Asia","IND_DAY":"1947-08-14"},"NGA":{"COUNTRY":"Nigeria","POP":200.96,"AREA":923.77,"GDP":375.77,"CONT":"Africa","IND_DAY":"1960-10-01"},"BGD":{"COUNTRY":"Bangladesh","POP":167.09,"AREA":147.57,"GDP":245.63,"CONT":"Asia","IND_DAY":"1971-03-26"},"RUS":{"COUNTRY":"Russia","POP":146.79,"AREA":17098.25,"GDP":1530.75,"CONT":null,"IND_DAY":"1992-06-12"},"MEX":{"COUNTRY":"Mexico","POP":126.58,"AREA":1964.38,"GDP":1158.23,"CONT":"N.America","IND_DAY":"1810-09-16"},"JPN":{"COUNTRY":"Japan","POP":126.22,"AREA":377.97,"GDP":4872.42,"CONT":"Asia","IND_DAY":null},"DEU":{"COUNTRY":"Germany","POP":83.02,"AREA":357.11,"GDP":3693.2,"CONT":"Europe","IND_DAY":null},"FRA":{"COUNTRY":"France","POP":67.02,"AREA":640.68,"GDP":2582.49,"CONT":"Europe","IND_DAY":"1789-07-14"},"GBR":{"COUNTRY":"UK","POP":66.44,"AREA":242.5,"GDP":2631.23,"CONT":"Europe","IND_DAY":null},"ITA":{"COUNTRY":"Italy","POP":60.36,"AREA":301.34,"GDP":1943.84,"CONT":"Europe","IND_DAY":null},"ARG":{"COUNTRY":"Argentina","POP":44.94,"AREA":2780.4,"GDP":637.49,"CONT":"S.America","IND_DAY":"1816-07-09"},"DZA":{"COUNTRY":"Algeria","POP":43.38,"AREA":2381.74,"GDP":167.56,"CONT":"Africa","IND_DAY":"1962-07-05"},"CAN":{"COUNTRY":"Canada","POP":37.59,"AREA":9984.67,"GDP":1647.12,"CONT":"N.America","IND_DAY":"1867-07-01"},"AUS":{"COUNTRY":"Australia","POP":25.47,"AREA":7692.02,"GDP":1408.68,"CONT":"Oceania","IND_DAY":null},"KAZ":{"COUNTRY":"Kazakhstan","POP":18.53,"AREA":2724.9,"GDP":159.41,"CONT":"Asia","IND_DAY":"1991-12-16"}}

通过设置orient='records',会产生一个列表结构的数据。

>>> df.to_json('data-records.json', orient='records')

结果如下:

[{"COUNTRY":"China","POP":1398.72,"AREA":9596.96,"GDP":12234.78,"CONT":"Asia","IND_DAY":null},{"COUNTRY":"India","POP":1351.16,"AREA":3287.26,"GDP":2575.67,"CONT":"Asia","IND_DAY":"1947-08-15"},{"COUNTRY":"US","POP":329.74,"AREA":9833.52,"GDP":19485.39,"CONT":"N.America","IND_DAY":"1776-07-04"},{"COUNTRY":"Indonesia","POP":268.07,"AREA":1910.93,"GDP":1015.54,"CONT":"Asia","IND_DAY":"1945-08-17"},{"COUNTRY":"Brazil","POP":210.32,"AREA":8515.77,"GDP":2055.51,"CONT":"S.America","IND_DAY":"1822-09-07"},{"COUNTRY":"Pakistan","POP":205.71,"AREA":881.91,"GDP":302.14,"CONT":"Asia","IND_DAY":"1947-08-14"},{"COUNTRY":"Nigeria","POP":200.96,"AREA":923.77,"GDP":375.77,"CONT":"Africa","IND_DAY":"1960-10-01"},{"COUNTRY":"Bangladesh","POP":167.09,"AREA":147.57,"GDP":245.63,"CONT":"Asia","IND_DAY":"1971-03-26"},{"COUNTRY":"Russia","POP":146.79,"AREA":17098.25,"GDP":1530.75,"CONT":null,"IND_DAY":"1992-06-12"},{"COUNTRY":"Mexico","POP":126.58,"AREA":1964.38,"GDP":1158.23,"CONT":"N.America","IND_DAY":"1810-09-16"},{"COUNTRY":"Japan","POP":126.22,"AREA":377.97,"GDP":4872.42,"CONT":"Asia","IND_DAY":null},{"COUNTRY":"Germany","POP":83.02,"AREA":357.11,"GDP":3693.2,"CONT":"Europe","IND_DAY":null},{"COUNTRY":"France","POP":67.02,"AREA":640.68,"GDP":2582.49,"CONT":"Europe","IND_DAY":"1789-07-14"},{"COUNTRY":"UK","POP":66.44,"AREA":242.5,"GDP":2631.23,"CONT":"Europe","IND_DAY":null},{"COUNTRY":"Italy","POP":60.36,"AREA":301.34,"GDP":1943.84,"CONT":"Europe","IND_DAY":null},{"COUNTRY":"Argentina","POP":44.94,"AREA":2780.4,"GDP":637.49,"CONT":"S.America","IND_DAY":"1816-07-09"},{"COUNTRY":"Algeria","POP":43.38,"AREA":2381.74,"GDP":167.56,"CONT":"Africa","IND_DAY":"1962-07-05"},{"COUNTRY":"Canada","POP":37.59,"AREA":9984.67,"GDP":1647.12,"CONT":"N.America","IND_DAY":"1867-07-01"},{"COUNTRY":"Australia","POP":25.47,"AREA":7692.02,"GDP":1408.68,"CONT":"Oceania","IND_DAY":null},{"COUNTRY":"Kazakhstan","POP":18.53,"AREA":2724.9,"GDP":159.41,"CONT":"Asia","IND_DAY":"1991-12-16"}]

您可以通过设置orient='split'获得另一个有趣的文件结构:

>>> df.to_json('data-split.json', orient='split')

结果如下:

{"columns":["COUNTRY","POP","AREA","GDP","CONT","IND_DAY"],"index":["CHN","IND","USA","IDN","BRA","PAK","NGA","BGD","RUS","MEX","JPN","DEU","FRA","GBR","ITA","ARG","DZA","CAN","AUS","KAZ"],"data":[["China",1398.72,9596.96,12234.78,"Asia",null],["India",1351.16,3287.26,2575.67,"Asia","1947-08-15"],["US",329.74,9833.52,19485.39,"N.America","1776-07-04"],["Indonesia",268.07,1910.93,1015.54,"Asia","1945-08-17"],["Brazil",210.32,8515.77,2055.51,"S.America","1822-09-07"],["Pakistan",205.71,881.91,302.14,"Asia","1947-08-14"],["Nigeria",200.96,923.77,375.77,"Africa","1960-10-01"],["Bangladesh",167.09,147.57,245.63,"Asia","1971-03-26"],["Russia",146.79,17098.25,1530.75,null,"1992-06-12"],["Mexico",126.58,1964.38,1158.23,"N.America","1810-09-16"],["Japan",126.22,377.97,4872.42,"Asia",null],["Germany",83.02,357.11,3693.2,"Europe",null],["France",67.02,640.68,2582.49,"Europe","1789-07-14"],["UK",66.44,242.5,2631.23,"Europe",null],["Italy",60.36,301.34,1943.84,"Europe",null],["Argentina",44.94,2780.4,637.49,"S.America","1816-07-09"],["Algeria",43.38,2381.74,167.56,"Africa","1962-07-05"],["Canada",37.59,9984.67,1647.12,"N.America","1867-07-01"],["Australia",25.47,7692.02,1408.68,"Oceania",null],["Kazakhstan",18.53,2724.9,159.41,"Asia","1991-12-16"]]}

data-split.json 包含一本字典,其中包含以下列表:

  • 列的名称
  • 行的标签
  • 包含数据值的内部列表(二维序列)

您还可以使用其他可选参数。例如,您可以设置index=False为放弃保存行标签。你可以操纵精度double_precision,并使用date_format和date_unit控制日期。当您在数据之间有时间序列时,这最后两个参数特别重要:

>>> df = pd.DataFrame(data=data).T
>>> df['IND_DAY'] = pd.to_datetime(df['IND_DAY'])
>>> df.dtypes
COUNTRY            object
POP                object
AREA               object
GDP                object
CONT               object
IND_DAY    datetime64[ns]
dtype: object

>>> df.to_json('data-time.json')

结果如下:

{"COUNTRY":{"CHN":"China","IND":"India","USA":"US","IDN":"Indonesia","BRA":"Brazil","PAK":"Pakistan","NGA":"Nigeria","BGD":"Bangladesh","RUS":"Russia","MEX":"Mexico","JPN":"Japan","DEU":"Germany","FRA":"France","GBR":"UK","ITA":"Italy","ARG":"Argentina","DZA":"Algeria","CAN":"Canada","AUS":"Australia","KAZ":"Kazakhstan"},"POP":{"CHN":1398.72,"IND":1351.16,"USA":329.74,"IDN":268.07,"BRA":210.32,"PAK":205.71,"NGA":200.96,"BGD":167.09,"RUS":146.79,"MEX":126.58,"JPN":126.22,"DEU":83.02,"FRA":67.02,"GBR":66.44,"ITA":60.36,"ARG":44.94,"DZA":43.38,"CAN":37.59,"AUS":25.47,"KAZ":18.53},"AREA":{"CHN":9596.96,"IND":3287.26,"USA":9833.52,"IDN":1910.93,"BRA":8515.77,"PAK":881.91,"NGA":923.77,"BGD":147.57,"RUS":17098.25,"MEX":1964.38,"JPN":377.97,"DEU":357.11,"FRA":640.68,"GBR":242.5,"ITA":301.34,"ARG":2780.4,"DZA":2381.74,"CAN":9984.67,"AUS":7692.02,"KAZ":2724.9},"GDP":{"CHN":12234.78,"IND":2575.67,"USA":19485.39,"IDN":1015.54,"BRA":2055.51,"PAK":302.14,"NGA":375.77,"BGD":245.63,"RUS":1530.75,"MEX":1158.23,"JPN":4872.42,"DEU":3693.2,"FRA":2582.49,"GBR":2631.23,"ITA":1943.84,"ARG":637.49,"DZA":167.56,"CAN":1647.12,"AUS":1408.68,"KAZ":159.41},"CONT":{"CHN":"Asia","IND":"Asia","USA":"N.America","IDN":"Asia","BRA":"S.America","PAK":"Asia","NGA":"Africa","BGD":"Asia","RUS":null,"MEX":"N.America","JPN":"Asia","DEU":"Europe","FRA":"Europe","GBR":"Europe","ITA":"Europe","ARG":"S.America","DZA":"Africa","CAN":"N.America","AUS":"Oceania","KAZ":"Asia"},"IND_DAY":{"CHN":null,"IND":-706320000000,"USA":-6106060800000,"IDN":-769219200000,"BRA":-4648924800000,"PAK":-706406400000,"NGA":-291945600000,"BGD":38793600000,"RUS":708307200000,"MEX":-5026838400000,"JPN":null,"DEU":null,"FRA":-5694969600000,"GBR":null,"ITA":null,"ARG":-4843411200000,"DZA":-236476800000,"CAN":-3234729600000,"AUS":null,"KAZ":692841600000}}\

如果您通过date_format='iso',那么您将获得ISO 8601格式的日期。另外,date_unit确定时间单位:

>>> df = pd.DataFrame(data=data).T
>>> df['IND_DAY'] = pd.to_datetime(df['IND_DAY'])
>>> df.to_json('new-data-time.json', date_format='iso', date_unit='s')

结果如下:

{"COUNTRY":{"CHN":"China","IND":"India","USA":"US","IDN":"Indonesia","BRA":"Brazil","PAK":"Pakistan","NGA":"Nigeria","BGD":"Bangladesh","RUS":"Russia","MEX":"Mexico","JPN":"Japan","DEU":"Germany","FRA":"France","GBR":"UK","ITA":"Italy","ARG":"Argentina","DZA":"Algeria","CAN":"Canada","AUS":"Australia","KAZ":"Kazakhstan"},"POP":{"CHN":1398.72,"IND":1351.16,"USA":329.74,"IDN":268.07,"BRA":210.32,"PAK":205.71,"NGA":200.96,"BGD":167.09,"RUS":146.79,"MEX":126.58,"JPN":126.22,"DEU":83.02,"FRA":67.02,"GBR":66.44,"ITA":60.36,"ARG":44.94,"DZA":43.38,"CAN":37.59,"AUS":25.47,"KAZ":18.53},"AREA":{"CHN":9596.96,"IND":3287.26,"USA":9833.52,"IDN":1910.93,"BRA":8515.77,"PAK":881.91,"NGA":923.77,"BGD":147.57,"RUS":17098.25,"MEX":1964.38,"JPN":377.97,"DEU":357.11,"FRA":640.68,"GBR":242.5,"ITA":301.34,"ARG":2780.4,"DZA":2381.74,"CAN":9984.67,"AUS":7692.02,"KAZ":2724.9},"GDP":{"CHN":12234.78,"IND":2575.67,"USA":19485.39,"IDN":1015.54,"BRA":2055.51,"PAK":302.14,"NGA":375.77,"BGD":245.63,"RUS":1530.75,"MEX":1158.23,"JPN":4872.42,"DEU":3693.2,"FRA":2582.49,"GBR":2631.23,"ITA":1943.84,"ARG":637.49,"DZA":167.56,"CAN":1647.12,"AUS":1408.68,"KAZ":159.41},"CONT":{"CHN":"Asia","IND":"Asia","USA":"N.America","IDN":"Asia","BRA":"S.America","PAK":"Asia","NGA":"Africa","BGD":"Asia","RUS":null,"MEX":"N.America","JPN":"Asia","DEU":"Europe","FRA":"Europe","GBR":"Europe","ITA":"Europe","ARG":"S.America","DZA":"Africa","CAN":"N.America","AUS":"Oceania","KAZ":"Asia"},"IND_DAY":{"CHN":null,"IND":"1947-08-15T00:00:00Z","USA":"1776-07-04T00:00:00Z","IDN":"1945-08-17T00:00:00Z","BRA":"1822-09-07T00:00:00Z","PAK":"1947-08-14T00:00:00Z","NGA":"1960-10-01T00:00:00Z","BGD":"1971-03-26T00:00:00Z","RUS":"1992-06-12T00:00:00Z","MEX":"1810-09-16T00:00:00Z","JPN":null,"DEU":null,"FRA":"1789-07-14T00:00:00Z","GBR":null,"ITA":null,"ARG":"1816-07-09T00:00:00Z","DZA":"1962-07-05T00:00:00Z","CAN":"1867-07-01T00:00:00Z","AUS":null,"KAZ":"1991-12-16T00:00:00Z"}}

您可以使用以下命令从JSON文件加载数据read_json():

>>> df = pd.read_json('data-index.json', orient='index',convert_dates=['IND_DAY'])

该参数convert_dates的用途与parse_dates使用它读取CSV文件时的用途相似。可选参数orient非常重要,因为它指定Pandas如何理解文件的结构。

HTML文档

HTML是一个纯文本文件,它使用超文本标记语言,以帮助浏览器显示网页。HTML文件的扩展名是.html和.htm。您将需要安装HTML解析器库(例如lxml或html5lib)才能使用HTML文件:

$pip install lxml html5lib
$conda install lxml html5lib

安装完这些库,就可以使用以下命令将您的内容另存DataFrame为HTML文件.to_html()

df = pd.DataFrame(data=data).T
df.to_html('data.html')

.to_html()如果不提供可选参数buf,则不会创建文件,该参数表示要写入的缓冲区。如果不使用此参数,则您的代码将返回与.to_json().to_csv() 相同的字符串。

以下是一些其他可选参数:

  • header 确定是否保存列名
  • index 确定是否保存行标签
  • classes 分配级联样式表(CSS)类
  • render_links 指定是否将URL转换为HTML链接
  • table_id 将CSS分配idtable
  • escape 决定是否将字符转换<, >, &等对HTML安全的字符串

您可以DataFrame使用read_html(),从合适的HTML文件中创建一个对象,该对象将返回一个DataFrame实例或它们的列表:

>>> df = pd.read_html('data.html', index_col=0, parse_dates=['IND_DAY'])

这与读取CSV文件时的操作非常相似。您还具有可帮助您处理日期,缺失值,精度,编码,HTML解析器等的参数。

Excel文件

您已经学习了如何使用Pandas读写Excel文件。但是,还有一些其他选项值得考虑。首先,当您使用时.to_excel(),可以使用可选参数指定目标工作表的名称sheet_name

>>> df = pd.DataFrame(data=data).T
>>> df.to_excel('data.xlsx', sheet_name='COUNTRIES')

在这里,您将创建data.xlsx,其工作表名为COUNTRIES

可选参数startrowstartcol两者默认都为0,意为开始写入数据的最左上单元:

>>> df.to_excel('data-shifted.xlsx', sheet_name='COUNTRIES', startrow=2, startcol=4)

在这里,您指定该表应从第三行和第五列开始。

现在生成的工作表如下所示:
mmst-pandas-rw-files-excel-shifted

SQL 文件

在下一个示例中,您将数据写入名为的数据库data.db。首先,您需要安装SQLAlchemy和数据库驱动程序。Python具有SQLite的内置驱动程序。

$ pip install sqlalchemy
$ conda install sqlalchemy

一旦安装了SQLAlchemy,使用create_engine()创建数据库引擎:

>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite:///data.db', echo=False)

使用to_sql()指定数据类型并应用:

>>> dtypes = {'POP': 'float64', 'AREA': 'float64', 'GDP': 'float64','IND_DAY': 'datetime64'}
>>> df = pd.DataFrame(data=data).T.astype(dtype=dtypes)
>>> df.dtypes
COUNTRY            object
POP               float64
AREA              float64
GDP               float64
CONT               object
IND_DAY    datetime64[ns]
dtype: object

.astype() 是一种非常方便的方法,可用于一次设置多种数据类型。

创建DataFrame完之后,可以使用 .to_sql()命令将其保存到数据库中.to_sql()

>>> df.to_sql('data.db', con=engine, index_label='ID')

参数con用于指定要使用的数据库连接或引擎。可选参数index_label指定使用哪个行标签调用数据库列,其值一般为ID, Idid

数据库中存储的数据结构大致如下:

mmst-pandas-rw-files-db

第一列包含行标签。要忽略将它们写入数据库,请传递参数index=False.to_sql()

另外,您可以schema用来指定数据库模式和dtype确定数据库列的类型。您还可以使用if_exists,它表示如果已经存在具有相同名称和路径的数据库,该怎么办:

  • if_exists='fail' 引发ValueError,默认值
  • if_exists='replace' 删除表并插入新值
  • if_exists='append' 在表中插入新值

您可以使用 read_sql():命令从数据库加载数据

>>> df = pd.read_sql('data.db', con=engine, index_col='ID')
>>> df
 COUNTRY      POP      AREA       GDP       CONT    IND_DAY
ID
CHN       China  1398.72   9596.96  12234.78       Asia        NaT
IND       India  1351.16   3287.26   2575.67       Asia 1947-08-15
USA          US   329.74   9833.52  19485.39  N.America 1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26
RUS      Russia   146.79  17098.25   1530.75       None 1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia        NaT
DEU     Germany    83.02    357.11   3693.20     Europe        NaT
FRA      France    67.02    640.68   2582.49     Europe 1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe        NaT
ITA       Italy    60.36    301.34   1943.84     Europe        NaT
ARG   Argentina    44.94   2780.40    637.49  S.America 1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa 1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America 1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania        NaT
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia 1991-12-16

参数index_col指定带有行标签的列的名称。请注意,这会插入额外的一行,其以ID开头。您可以使用以下代码行解决此问题:

>>> df.index.name = None
>>> df
 COUNTRY      POP      AREA       GDP       CONT    IND_DAY
CHN       China  1398.72   9596.96  12234.78       Asia        NaT
IND       India  1351.16   3287.26   2575.67       Asia 1947-08-15
USA          US   329.74   9833.52  19485.39  N.America 1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26
RUS      Russia   146.79  17098.25   1530.75       None 1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia        NaT
DEU     Germany    83.02    357.11   3693.20     Europe        NaT
FRA      France    67.02    640.68   2582.49     Europe 1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe        NaT
ITA       Italy    60.36    301.34   1943.84     Europe        NaT
ARG   Argentina    44.94   2780.40    637.49  S.America 1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa 1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America 1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania        NaT
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia 1991-12-16

如果您想用来填充缺少的值nan,则可以使用.fillna()

>>> df.fillna(value=float('nan'), inplace=True)

.fillna() 将所有缺失的值替换为您传递给的任何值value。在这里,您通过了float('nan'),说用来填充所有缺少的值nan

另请注意,您不必传递 parse_dates=['IND_DAY']read_sql()。因为您的数据库能够检测到最后一列包含日期。但是,您可以根据需要通过parse_dates获得相同的结果。

您还可以使用其他功能来读取数据库,例如read_sql_table()read_sql_query()

Pickle文件

Pickling 是将Python对象转换为字节流的行为。 Unpickling是相反的过程。 Python pickle files 是保留Python对象的数据和层次结构的二进制文件。它们通常具有扩展名.pickle或.pkl。

可以使用.to_pickle()方法将DataFrame保存到一个pickle文件中:

>>> dtypes = {'POP': 'float64', 'AREA': 'float64', 'GDP': 'float64','IND_DAY': 'datetime64'}
>>> df = pd.DataFrame(data=data).T.astype(dtype=dtypes)
>>> df.to_pickle('data.pickle')

通过 read_pickle()读取数据:

>>> df = pd.read_pickle('data.pickle')
>>> df
 COUNTRY      POP      AREA       GDP       CONT    IND_DAY
CHN       China  1398.72   9596.96  12234.78       Asia        NaT
IND       India  1351.16   3287.26   2575.67       Asia 1947-08-15
USA          US   329.74   9833.52  19485.39  N.America 1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26
RUS      Russia   146.79  17098.25   1530.75        NaN 1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia        NaT
DEU     Germany    83.02    357.11   3693.20     Europe        NaT
FRA      France    67.02    640.68   2582.49     Europe 1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe        NaT
ITA       Italy    60.36    301.34   1943.84     Europe        NaT
ARG   Argentina    44.94   2780.40    637.49  S.America 1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa 1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America 1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania        NaT
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia 1991-12-16

检查数据类型:

>>> df.dtypes
COUNTRY            object
POP               float64
AREA              float64
GDP               float64
CONT               object
IND_DAY    datetime64[ns]
dtype: object

处理大数据

如果文件太大,无法保存或处理,则可以采用几种方法来减少所需的磁盘空间:

  • 压缩文件
  • 仅选择所需的列
  • 省略不需要的行
  • 强制使用不太精确的数据类型
  • 拆分数据分成块

压缩和解压缩文件

添加一个与所需压缩类型相对应的后缀,Panda即可判断文件类型:

  • '.gz'
  • '.bz2'
  • '.zip'
  • '.xz'
>>> df = pd.DataFrame(data=data).T
>>> df.to_csv('data.csv.zip')

常规.csv文件的大小为1048字节,而压缩文件只有766字节。

read_csv()函数可以照常打开此压缩文件:

>>> df = pd.read_csv('data.csv.zip', index_col=0,parse_dates=['IND_DAY'])
>>> df
 COUNTRY      POP      AREA       GDP       CONT    IND_DAY
CHN       China  1398.72   9596.96  12234.78       Asia        NaT
IND       India  1351.16   3287.26   2575.67       Asia 1947-08-15
USA          US   329.74   9833.52  19485.39  N.America 1776-07-04
IDN   Indonesia   268.07   1910.93   1015.54       Asia 1945-08-17
BRA      Brazil   210.32   8515.77   2055.51  S.America 1822-09-07
PAK    Pakistan   205.71    881.91    302.14       Asia 1947-08-14
NGA     Nigeria   200.96    923.77    375.77     Africa 1960-10-01
BGD  Bangladesh   167.09    147.57    245.63       Asia 1971-03-26
RUS      Russia   146.79  17098.25   1530.75        NaN 1992-06-12
MEX      Mexico   126.58   1964.38   1158.23  N.America 1810-09-16
JPN       Japan   126.22    377.97   4872.42       Asia        NaT
DEU     Germany    83.02    357.11   3693.20     Europe        NaT
FRA      France    67.02    640.68   2582.49     Europe 1789-07-14
GBR          UK    66.44    242.50   2631.23     Europe        NaT
ITA       Italy    60.36    301.34   1943.84     Europe        NaT
ARG   Argentina    44.94   2780.40    637.49  S.America 1816-07-09
DZA     Algeria    43.38   2381.74    167.56     Africa 1962-07-05
CAN      Canada    37.59   9984.67   1647.12  N.America 1867-07-01
AUS   Australia    25.47   7692.02   1408.68    Oceania        NaT
KAZ  Kazakhstan    18.53   2724.90    159.41       Asia 1991-12-16

您可以使用可选参数指定压缩类型,该参数compression可以采用以下任何值:

  • 'infer'
  • 'gzip'
  • 'bz2'
  • 'zip'
  • 'xz'
  • None

默认值compression='infer'指示Pandas应该从文件扩展名推断出压缩类型。

指定类型压缩pickle文件的方法如下:

>>> df = pd.DataFrame(data=data).T
>>> df.to_pickle('data.pickle.compress', compression='gzip')

指定类型压缩读取 data.pickle.compress

>>> df = pd.read_pickle('data.pickle.compress', compression='gzip')

选择列

Pandas read_csv()read_excel()函数具有可选参数usecols,可用于指定要从文件中加载的列。您可以将列名列表作为相应的参数传递:

>>> df = pd.read_csv('data.csv', usecols=['COUNTRY', 'AREA'])
>>> df
 COUNTRY      AREA
0        China   9596.96
1        India   3287.26
2           US   9833.52
3    Indonesia   1910.93
4       Brazil   8515.77
5     Pakistan    881.91
6      Nigeria    923.77
7   Bangladesh    147.57
8       Russia  17098.25
9       Mexico   1964.38
10       Japan    377.97
11     Germany    357.11
12      France    640.68
13          UK    242.50
14       Italy    301.34
15   Argentina   2780.40
16     Algeria   2381.74
17      Canada   9984.67
18   Australia   7692.02
19  Kazakhstan   2724.90

除了列名,您还可以传递其索引:

>>> df = pd.read_csv('data.csv',index_col=0, usecols=[0, 1, 3])
>>> df
 COUNTRY      AREA
CHN       China   9596.96
IND       India   3287.26
USA          US   9833.52
IDN   Indonesia   1910.93
BRA      Brazil   8515.77
PAK    Pakistan    881.91
NGA     Nigeria    923.77
BGD  Bangladesh    147.57
RUS      Russia  17098.25
MEX      Mexico   1964.38
JPN       Japan    377.97
DEU     Germany    357.11
FRA      France    640.68
GBR          UK    242.50
ITA       Italy    301.34
ARG   Argentina   2780.40
DZA     Algeria   2381.74
CAN      Canada   9984.67
AUS   Australia   7692.02
KAZ  Kazakhstan   2724.90

同样,read_sql()具有可选参数columns,该参数接受要读取的列名列表:

>>> df = pd.read_sql('data.db', con=engine, index_col='ID',columns=['COUNTRY', 'AREA'])
>>> df.index.name = None
>>> df
 COUNTRY      AREA
CHN       China   9596.96
IND       India   3287.26
USA          US   9833.52
IDN   Indonesia   1910.93
BRA      Brazil   8515.77
PAK    Pakistan    881.91
NGA     Nigeria    923.77
BGD  Bangladesh    147.57
RUS      Russia  17098.25
MEX      Mexico   1964.38
JPN       Japan    377.97
DEU     Germany    357.11
FRA      France    640.68
GBR          UK    242.50
ITA       Italy    301.34
ARG   Argentina   2780.40
DZA     Algeria   2381.74
CAN      Canada   9984.67
AUS   Australia   7692.02
KAZ  Kazakhstan   2724.90

省略行

在测试用于数据处理或机器学习的算法时,通常不需要整个数据集。仅加载一部分数据来加快处理过程很方便。Pandas read_csv()read_excel() 函数具有一些可选参数,可用于选择要加载的行:

  • skiprows: 如果是整数,则跳过文件开头的行数;如果是类似列表的对象,则跳过要从零开始的行的索引
  • skipfooter: 文件末尾要跳过的行数
  • nrows: 要读取的行数

以下是如何跳过具有从零开始的奇数索引,保留偶数的行的方法:

>>> df = pd.read_csv('data.csv', index_col=0, skiprows=range(1, 20, 2))
>>> df
 COUNTRY      POP     AREA      GDP       CONT     IND_DAY
IND       India  1351.16  3287.26  2575.67       Asia  1947-08-15
IDN   Indonesia   268.07  1910.93  1015.54       Asia  1945-08-17
PAK    Pakistan   205.71   881.91   302.14       Asia  1947-08-14
BGD  Bangladesh   167.09   147.57   245.63       Asia  1971-03-26
MEX      Mexico   126.58  1964.38  1158.23  N.America  1810-09-16
DEU     Germany    83.02   357.11  3693.20     Europe         NaN
GBR          UK    66.44   242.50  2631.23     Europe         NaN
ARG   Argentina    44.94  2780.40   637.49  S.America  1816-07-09
CAN      Canada    37.59  9984.67  1647.12  N.America  1867-07-01
KAZ  Kazakhstan    18.53  2724.90   159.41       Asia  1991-12-16

强制减少精确数据类型

首先,.dtypes再次获取数据类型:

>>> df = pd.read_csv('data.csv', index_col=0, parse_dates=['IND_DAY'])
>>> df.dtypes
COUNTRY            object
POP               float64
AREA              float64
GDP               float64
CONT               object
IND_DAY    datetime64[ns]
dtype: object

具有浮点数的列是64位浮点数。此类型的每个数字float64消耗64位或8个字节。每列有20个数字,需要160个字节。您可以使用以下方法进行验证.memory_usage()

>>> df.memory_usage()
Index      160
COUNTRY    160
POP        160
AREA       160
GDP        160
CONT       160
IND_DAY    160
dtype: int64

.memory_usage() 返回一个 Series实例,其中带有每列的内存使用情况(以字节为单位)。您可以方便地将其与.loc[]和组合.sum()以获取一组列的内存:

>>> df.loc[:, ['POP', 'AREA', 'GDP']].memory_usage(index=False).sum()
480

您也可以使用 .to_numpy().values从NumPy数组提取数据值。然后,使用该.nbytes属性获取数组项消耗的总字节数:

>>> df.loc[:, ['POP', 'AREA', 'GDP']].to_numpy().nbytes
480

结果是相同的480个字节。那么,如何节省内存?

在这种情况下,你可以指定你的数字列'POP','AREA'和'GDP'应具备的类型float32。使用可选参数dtype执行此操作:

>>> dtypes = {'POP': 'float32', 'AREA': 'float32', 'GDP': 'float32'}
>>> df = pd.read_csv('data.csv', index_col=0, dtype=dtypes,parse_dates=['IND_DAY'])

现在,您可以验证每个数字列需要80个字节,或者每个项目4个字节:

>>> df.dtypes
COUNTRY            object
POP               float32
AREA              float32
GDP               float32
CONT               object
IND_DAY    datetime64[ns]
dtype: object
>>> df.memory_usage()
Index      160
COUNTRY    160
POP         80
AREA        80
GDP         80
CONT       160
IND_DAY    160
dtype: int64
>>> df.loc[:, ['POP', 'AREA', 'GDP']].memory_usage(index=False).sum()
240
>>> df.loc[:, ['POP', 'AREA', 'GDP']].to_numpy().nbytes
240

除了节省内存外,您还可以通过使用float32代替float64,在某些情况下,大大减少处理数据所需的时间。

使用块遍历文件

处理超大型数据集的另一种方法是将数据分成较小的块,然后一次处理一个块。如果使用read_csv(), read_json()read_sql(),你可以指定可选参数chunksize:

>>> data_chunk = pd.read_csv('data.csv', index_col=0, chunksize=8)
>>> type(data_chunk)
<class 'pandas.io.parsers.TextFileReader'>
>>> hasattr(data_chunk, '__iter__')
True
>>> hasattr(data_chunk, '__next__')
True

chunksize 默认为 None ,其可为整数值,该整数值指示单个块中的项目数。当chunksize为整数时,read_csv()返回一个可迭代的变量,您可以在for循环中使用该可迭代的变量,以在每次迭代中仅获取和处理数据集的一部分:

>>> for df_chunk in pd.read_csv('data.csv', index_col=0, chunksize=8):
...     print(df_chunk, end='\n\n')
...     print('memory:', df_chunk.memory_usage().sum(), 'bytes',
...           end='\n\n\n')
...
 COUNTRY      POP     AREA       GDP       CONT     IND_DAY
CHN       China  1398.72  9596.96  12234.78       Asia         NaN
IND       India  1351.16  3287.26   2575.67       Asia  1947-08-15
USA          US   329.74  9833.52  19485.39  N.America  1776-07-04
IDN   Indonesia   268.07  1910.93   1015.54       Asia  1945-08-17
BRA      Brazil   210.32  8515.77   2055.51  S.America  1822-09-07
PAK    Pakistan   205.71   881.91    302.14       Asia  1947-08-14
NGA     Nigeria   200.96   923.77    375.77     Africa  1960-10-01
BGD  Bangladesh   167.09   147.57    245.63       Asia  1971-03-26

memory: 448 bytes

 COUNTRY     POP      AREA      GDP       CONT     IND_DAY
RUS     Russia  146.79  17098.25  1530.75        NaN  1992-06-12
MEX     Mexico  126.58   1964.38  1158.23  N.America  1810-09-16
JPN      Japan  126.22    377.97  4872.42       Asia         NaN
DEU    Germany   83.02    357.11  3693.20     Europe         NaN
FRA     France   67.02    640.68  2582.49     Europe  1789-07-14
GBR         UK   66.44    242.50  2631.23     Europe         NaN
ITA      Italy   60.36    301.34  1943.84     Europe         NaN
ARG  Argentina   44.94   2780.40   637.49  S.America  1816-07-09

memory: 448 bytes

 COUNTRY    POP     AREA      GDP       CONT     IND_DAY
DZA     Algeria  43.38  2381.74   167.56     Africa  1962-07-05
CAN      Canada  37.59  9984.67  1647.12  N.America  1867-07-01
AUS   Australia  25.47  7692.02  1408.68    Oceania         NaN
KAZ  Kazakhstan  18.53  2724.90   159.41       Asia  1991-12-16

memory: 224 bytes

注意: 您也可以通过iterator=True强制 read_csv()函数返回迭代器对象而不是DataFrame对象。

结论

您现在知道了如何将Pandas DataFrame对象中的数据和标签保存到不同类型的文件中。您还知道如何从文件加载数据并创建DataFrame对象。

使用Pandas read_csv().to_csv()方法读取和写入CSV文件。您还使用了类似的方法来读取和写入Excel,JSON,HTML,SQL和pickle文件。这些功能非常方便并且被广泛使用。它们使您可以在单个函数或方法调用中保存或加载数据。

您还学习了在处理大型数据文件时如何节省时间,内存和磁盘空间:

  • 压缩或解压缩文件
  • 选择要加载的行和列
  • 使用不太精确的数据类型
  • 将数据拆分为多个块,并一一处理

您已经在机器学习和数据科学过程中迈出了重要的一步!

Card image cap
开发者雷

尘世间一个小小的开发者,每天增加一些无聊的知识,就不会无聊了

要加油~~~

技术文档 >> 系列应用 >>
热推应用
Let'sLearnSwift
学习Swift的入门教程
PyPie
Python is as good as Pie
标签