Pandas
Contenido
Contenido
Pandas¶
Tomamos una muestra de la guía rápida de Pandas (10 minutes to pandas), junto con un par de complementos.
Creamos objetos¶
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
dates = pd.date_range("20130101", periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.255445 | 1.309822 | -0.168210 | -0.584129 |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 1.245641 |
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 1.255482 |
2013-01-05 | -1.197831 | -0.139475 | 1.067496 | -0.351955 |
2013-01-06 | 0.203198 | -0.573590 | 1.523726 | -0.788816 |
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
Dependiendo del editor de texto, se puede utilizar <TAB>
para tener un despliegue de métodos.
Normalmente tenemos un archivo (local o remoto), con los datos. Tomaremos un ejemplo del repositorio de datos de UCI, Machine Learning Repository
Se tomará el siguiente conjunto de datos Adult (1996). UCI Machine Learning Repository.
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
datos = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')
datos
39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
1 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
2 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
4 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32555 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
32556 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32557 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32558 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32559 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
32560 rows × 15 columns
datos = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
header = None)
datos
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
32561 rows × 15 columns
columnas = ["age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation","relationship", "race", "sex",
"capital_gain","capital_loss","hours-per-week", "native-country", "class"]
datos = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
header = None,
names = columnas
)
datos
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours-per-week | native-country | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
32561 rows × 15 columns
Desplegar los datos¶
df.head()
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.255445 | 1.309822 | -0.168210 | -0.584129 |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 1.245641 |
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 1.255482 |
2013-01-05 | -1.197831 | -0.139475 | 1.067496 | -0.351955 |
df.tail()
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 1.245641 |
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 1.255482 |
2013-01-05 | -1.197831 | -0.139475 | 1.067496 | -0.351955 |
2013-01-06 | 0.203198 | -0.573590 | 1.523726 | -0.788816 |
df.to_numpy()
array([[-0.25544475, 1.30982242, -0.16820985, -0.58412855],
[ 0.44063178, 0.47391843, 0.09211497, 0.99485562],
[ 1.14075548, -0.28870699, 1.07547046, 1.24564126],
[ 1.47140771, -1.53987018, 1.27288894, 1.25548203],
[-1.19783097, -0.13947515, 1.06749632, -0.35195464],
[ 0.20319828, -0.57359001, 1.52372616, -0.78881627]])
df2.to_numpy()
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.300453 | -0.126317 | 0.810581 | 0.295180 |
std | 0.965425 | 0.964107 | 0.683038 | 0.967690 |
min | -1.197831 | -1.539870 | -0.168210 | -0.788816 |
25% | -0.140784 | -0.502369 | 0.335960 | -0.526085 |
50% | 0.321915 | -0.214091 | 1.071483 | 0.321450 |
75% | 0.965725 | 0.320570 | 1.223534 | 1.182945 |
max | 1.471408 | 1.309822 | 1.523726 | 1.255482 |
df.T
2013-01-01 | 2013-01-02 | 2013-01-03 | 2013-01-04 | 2013-01-05 | 2013-01-06 | |
---|---|---|---|---|---|---|
A | -0.255445 | 0.440632 | 1.140755 | 1.471408 | -1.197831 | 0.203198 |
B | 1.309822 | 0.473918 | -0.288707 | -1.539870 | -0.139475 | -0.573590 |
C | -0.168210 | 0.092115 | 1.075470 | 1.272889 | 1.067496 | 1.523726 |
D | -0.584129 | 0.994856 | 1.245641 | 1.255482 | -0.351955 | -0.788816 |
df.sort_index(axis = 1, ascending = False)
D | C | B | A | |
---|---|---|---|---|
2013-01-01 | -0.584129 | -0.168210 | 1.309822 | -0.255445 |
2013-01-02 | 0.994856 | 0.092115 | 0.473918 | 0.440632 |
2013-01-03 | 1.245641 | 1.075470 | -0.288707 | 1.140755 |
2013-01-04 | 1.255482 | 1.272889 | -1.539870 | 1.471408 |
2013-01-05 | -0.351955 | 1.067496 | -0.139475 | -1.197831 |
2013-01-06 | -0.788816 | 1.523726 | -0.573590 | 0.203198 |
df.sort_values(by="B")
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 1.255482 |
2013-01-06 | 0.203198 | -0.573590 | 1.523726 | -0.788816 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 1.245641 |
2013-01-05 | -1.197831 | -0.139475 | 1.067496 | -0.351955 |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-01 | -0.255445 | 1.309822 | -0.168210 | -0.584129 |
Selección¶
df["A"]
2013-01-01 -0.255445
2013-01-02 0.440632
2013-01-03 1.140755
2013-01-04 1.471408
2013-01-05 -1.197831
2013-01-06 0.203198
Freq: D, Name: A, dtype: float64
df[0:3]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -0.255445 | 1.309822 | -0.168210 | -0.584129 |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 1.245641 |
df["20130102":"20130104"]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 1.245641 |
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 1.255482 |
Selección con etiqueta¶
df.loc[dates[0]]
A -0.255445
B 1.309822
C -0.168210
D -0.584129
Name: 2013-01-01 00:00:00, dtype: float64
df.loc[:, ["A", "B"]]
A | B | |
---|---|---|
2013-01-01 | -0.255445 | 1.309822 |
2013-01-02 | 0.440632 | 0.473918 |
2013-01-03 | 1.140755 | -0.288707 |
2013-01-04 | 1.471408 | -1.539870 |
2013-01-05 | -1.197831 | -0.139475 |
2013-01-06 | 0.203198 | -0.573590 |
df.loc["20130102":"20130104", ["A", "B"]]
A | B | |
---|---|---|
2013-01-02 | 0.440632 | 0.473918 |
2013-01-03 | 1.140755 | -0.288707 |
2013-01-04 | 1.471408 | -1.539870 |
df.loc["20130102", ["A", "B"]]
A 0.440632
B 0.473918
Name: 2013-01-02 00:00:00, dtype: float64
df.loc[dates[0], "A"]
-0.2554447489300441
Selección por posición¶
df.iloc[3]
A 1.471408
B -1.539870
C 1.272889
D 1.255482
Name: 2013-01-04 00:00:00, dtype: float64
df.iloc[3:5, 0:2]
A | B | |
---|---|---|
2013-01-04 | 1.471408 | -1.539870 |
2013-01-05 | -1.197831 | -0.139475 |
df.iloc[[1, 2, 4], [0, 2]]
A | C | |
---|---|---|
2013-01-02 | 0.440632 | 0.092115 |
2013-01-03 | 1.140755 | 1.075470 |
2013-01-05 | -1.197831 | 1.067496 |
df.iloc[1:3, :]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 1.245641 |
Indexado lógico¶
df[df["A"] > 0]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 1.245641 |
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 1.255482 |
2013-01-06 | 0.203198 | -0.573590 | 1.523726 | -0.788816 |
df[df > 0]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | NaN | 1.309822 | NaN | NaN |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 0.994856 |
2013-01-03 | 1.140755 | NaN | 1.075470 | 1.245641 |
2013-01-04 | 1.471408 | NaN | 1.272889 | 1.255482 |
2013-01-05 | NaN | NaN | 1.067496 | NaN |
2013-01-06 | 0.203198 | NaN | 1.523726 | NaN |
Asignación¶
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
df.loc[:, "D"] = np.array([5] * len(df))
df
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.000000 | 1.309822 | -0.168210 | 5 |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 5 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 5 |
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 5 |
2013-01-05 | -1.197831 | -0.139475 | 1.067496 | 5 |
2013-01-06 | 0.203198 | -0.573590 | 1.523726 | 5 |
Datos Faltantes¶
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
df1
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 1.309822 | -0.168210 | 5 | 1.0 |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 5 | 1.0 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 5 | NaN |
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 5 | NaN |
df1.dropna(how="any")
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 1.309822 | -0.168210 | 5 | 1.0 |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 5 | 1.0 |
df1.fillna(value=5)
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 1.309822 | -0.168210 | 5 | 1.0 |
2013-01-02 | 0.440632 | 0.473918 | 0.092115 | 5 | 1.0 |
2013-01-03 | 1.140755 | -0.288707 | 1.075470 | 5 | 5.0 |
2013-01-04 | 1.471408 | -1.539870 | 1.272889 | 5 | 5.0 |
pd.isna(df1)
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | False | False | False | False | False |
2013-01-02 | False | False | False | False | False |
2013-01-03 | False | False | False | False | True |
2013-01-04 | False | False | False | False | True |
Operaciones¶
df.mean()
A 0.343027
B -0.126317
C 0.810581
D 5.000000
dtype: float64
df.mean(1)
2013-01-01 1.535403
2013-01-02 1.501666
2013-01-03 1.731880
2013-01-04 1.551107
2013-01-05 1.182548
2013-01-06 1.538334
Freq: D, dtype: float64
df.apply(np.cumsum)
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.000000 | 1.309822 | -0.168210 | 5 |
2013-01-02 | 0.440632 | 1.783741 | -0.076095 | 10 |
2013-01-03 | 1.581387 | 1.495034 | 0.999376 | 15 |
2013-01-04 | 3.052795 | -0.044836 | 2.272265 | 20 |
2013-01-05 | 1.854964 | -0.184311 | 3.339761 | 25 |
2013-01-06 | 2.058162 | -0.757901 | 4.863487 | 30 |