Pandas ¶

Tomamos una muestra de la guía rápida de Pandas (10 minutes to pandas), junto con un par de complementos.

Creamos objetos ¶

import numpy as np
import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

  1.0
  3.0
  5.0
  NaN
  6.0
  8.0
dtype: float64

dates = pd.date_range("20130101", periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

	A	B	C	D
2013-01-01	-0.255445	1.309822	-0.168210	-0.584129
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-03	1.140755	-0.288707	1.075470	1.245641
2013-01-04	1.471408	-1.539870	1.272889	1.255482
2013-01-05	-1.197831	-0.139475	1.067496	-0.351955
2013-01-06	0.203198	-0.573590	1.523726	-0.788816

df2 = pd.DataFrame(

    {

        "A": 1.0,

        "B": pd.Timestamp("20130102"),

        "C": pd.Series(1, index=list(range(4)), dtype="float32"),

        "D": np.array([3] * 4, dtype="int32"),

        "E": pd.Categorical(["test", "train", "test", "train"]),

        "F": "foo",

    }

)

df2

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Dependiendo del editor de texto, se puede utilizar <TAB> para tener un despliegue de métodos.

Normalmente tenemos un archivo (local o remoto), con los datos. Tomaremos un ejemplo del repositorio de datos de UCI, Machine Learning Repository

Se tomará el siguiente conjunto de datos Adult (1996). UCI Machine Learning Repository.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

datos = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')
datos

	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
0	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
1	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
2	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
3	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
4	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32555	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States	<=50K
32556	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States	>50K
32557	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
32558	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States	<=50K
32559	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States	>50K

32560 rows × 15 columns

datos = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
    header = None)
datos

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States	<=50K
32557	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States	>50K
32558	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
32559	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States	<=50K
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States	>50K

32561 rows × 15 columns

columnas = ["age", "workclass", "fnlwgt", "education", "education_num",
            "marital_status", "occupation","relationship", "race", "sex",
            "capital_gain","capital_loss","hours-per-week", "native-country", "class"]

datos = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
    header = None,
    names = columnas
    )
datos

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours-per-week	native-country	class
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27	Private	257302	Assoc-acdm	12	Married-civ-spouse	Tech-support	Wife	White	Female	0	0	38	United-States	<=50K
32557	40	Private	154374	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	United-States	>50K
32558	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
32559	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	0	20	United-States	<=50K
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	0	40	United-States	>50K

32561 rows × 15 columns

Desplegar los datos ¶

df.head()

	A	B	C	D
2013-01-01	-0.255445	1.309822	-0.168210	-0.584129
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-03	1.140755	-0.288707	1.075470	1.245641
2013-01-04	1.471408	-1.539870	1.272889	1.255482
2013-01-05	-1.197831	-0.139475	1.067496	-0.351955

df.tail()

	A	B	C	D
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-03	1.140755	-0.288707	1.075470	1.245641
2013-01-04	1.471408	-1.539870	1.272889	1.255482
2013-01-05	-1.197831	-0.139475	1.067496	-0.351955
2013-01-06	0.203198	-0.573590	1.523726	-0.788816

df.to_numpy()

array([[-0.25544475,  1.30982242, -0.16820985, -0.58412855],
       [ 0.44063178,  0.47391843,  0.09211497,  0.99485562],
       [ 1.14075548, -0.28870699,  1.07547046,  1.24564126],
       [ 1.47140771, -1.53987018,  1.27288894,  1.25548203],
       [-1.19783097, -0.13947515,  1.06749632, -0.35195464],
       [ 0.20319828, -0.57359001,  1.52372616, -0.78881627]])

df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.300453	-0.126317	0.810581	0.295180
std	0.965425	0.964107	0.683038	0.967690
min	-1.197831	-1.539870	-0.168210	-0.788816
25%	-0.140784	-0.502369	0.335960	-0.526085
50%	0.321915	-0.214091	1.071483	0.321450
75%	0.965725	0.320570	1.223534	1.182945
max	1.471408	1.309822	1.523726	1.255482

df.T

	2013-01-01	2013-01-02	2013-01-03	2013-01-04	2013-01-05	2013-01-06
A	-0.255445	0.440632	1.140755	1.471408	-1.197831	0.203198
B	1.309822	0.473918	-0.288707	-1.539870	-0.139475	-0.573590
C	-0.168210	0.092115	1.075470	1.272889	1.067496	1.523726
D	-0.584129	0.994856	1.245641	1.255482	-0.351955	-0.788816

df.sort_index(axis = 1, ascending = False)

	D	C	B	A
2013-01-01	-0.584129	-0.168210	1.309822	-0.255445
2013-01-02	0.994856	0.092115	0.473918	0.440632
2013-01-03	1.245641	1.075470	-0.288707	1.140755
2013-01-04	1.255482	1.272889	-1.539870	1.471408
2013-01-05	-0.351955	1.067496	-0.139475	-1.197831
2013-01-06	-0.788816	1.523726	-0.573590	0.203198

df.sort_values(by="B")

	A	B	C	D
2013-01-04	1.471408	-1.539870	1.272889	1.255482
2013-01-06	0.203198	-0.573590	1.523726	-0.788816
2013-01-03	1.140755	-0.288707	1.075470	1.245641
2013-01-05	-1.197831	-0.139475	1.067496	-0.351955
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-01	-0.255445	1.309822	-0.168210	-0.584129

Selección ¶

df["A"]

2013-01-01   -0.255445
2013-01-02    0.440632
2013-01-03    1.140755
2013-01-04    1.471408
2013-01-05   -1.197831
2013-01-06    0.203198
Freq: D, Name: A, dtype: float64

df[0:3]

	A	B	C	D
2013-01-01	-0.255445	1.309822	-0.168210	-0.584129
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-03	1.140755	-0.288707	1.075470	1.245641

df["20130102":"20130104"]

	A	B	C	D
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-03	1.140755	-0.288707	1.075470	1.245641
2013-01-04	1.471408	-1.539870	1.272889	1.255482

Selección con etiqueta ¶

df.loc[dates[0]]

A   -0.255445
B    1.309822
C   -0.168210
D   -0.584129
Name: 2013-01-01 00:00:00, dtype: float64

df.loc[:, ["A", "B"]]

	A	B
2013-01-01	-0.255445	1.309822
2013-01-02	0.440632	0.473918
2013-01-03	1.140755	-0.288707
2013-01-04	1.471408	-1.539870
2013-01-05	-1.197831	-0.139475
2013-01-06	0.203198	-0.573590

df.loc["20130102":"20130104", ["A", "B"]]

	A	B
2013-01-02	0.440632	0.473918
2013-01-03	1.140755	-0.288707
2013-01-04	1.471408	-1.539870

df.loc["20130102", ["A", "B"]]

A    0.440632
B    0.473918
Name: 2013-01-02 00:00:00, dtype: float64

df.loc[dates[0], "A"]

-0.2554447489300441

Selección por posición ¶

df.iloc[3]

A    1.471408
B   -1.539870
C    1.272889
D    1.255482
Name: 2013-01-04 00:00:00, dtype: float64

df.iloc[3:5, 0:2]

	A	B
2013-01-04	1.471408	-1.539870
2013-01-05	-1.197831	-0.139475

df.iloc[[1, 2, 4], [0, 2]]

	A	C
2013-01-02	0.440632	0.092115
2013-01-03	1.140755	1.075470
2013-01-05	-1.197831	1.067496

df.iloc[1:3, :]

	A	B	C	D
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-03	1.140755	-0.288707	1.075470	1.245641

Indexado lógico ¶

df[df["A"] > 0]

	A	B	C	D
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-03	1.140755	-0.288707	1.075470	1.245641
2013-01-04	1.471408	-1.539870	1.272889	1.255482
2013-01-06	0.203198	-0.573590	1.523726	-0.788816

df[df > 0]

	A	B	C	D
2013-01-01	NaN	1.309822	NaN	NaN
2013-01-02	0.440632	0.473918	0.092115	0.994856
2013-01-03	1.140755	NaN	1.075470	1.245641
2013-01-04	1.471408	NaN	1.272889	1.255482
2013-01-05	NaN	NaN	1.067496	NaN
2013-01-06	0.203198	NaN	1.523726	NaN

Asignación ¶

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df.loc[:, "D"] = np.array([5] * len(df))
df

	A	B	C	D
2013-01-01	0.000000	1.309822	-0.168210	5
2013-01-02	0.440632	0.473918	0.092115	5
2013-01-03	1.140755	-0.288707	1.075470	5
2013-01-04	1.471408	-1.539870	1.272889	5
2013-01-05	-1.197831	-0.139475	1.067496	5
2013-01-06	0.203198	-0.573590	1.523726	5

Datos Faltantes ¶

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
df1

	A	B	C	D	E
2013-01-01	0.000000	1.309822	-0.168210	5	1.0
2013-01-02	0.440632	0.473918	0.092115	5	1.0
2013-01-03	1.140755	-0.288707	1.075470	5	NaN
2013-01-04	1.471408	-1.539870	1.272889	5	NaN

df1.dropna(how="any")

	A	B	C	D	E
2013-01-01	0.000000	1.309822	-0.168210	5	1.0
2013-01-02	0.440632	0.473918	0.092115	5	1.0

df1.fillna(value=5)

	A	B	C	D	E
2013-01-01	0.000000	1.309822	-0.168210	5	1.0
2013-01-02	0.440632	0.473918	0.092115	5	1.0
2013-01-03	1.140755	-0.288707	1.075470	5	5.0
2013-01-04	1.471408	-1.539870	1.272889	5	5.0

pd.isna(df1)

	A	B	C	D	E
2013-01-01	False	False	False	False	False
2013-01-02	False	False	False	False	False
2013-01-03	False	False	False	False	True
2013-01-04	False	False	False	False	True

Operaciones ¶

df.mean()

A    0.343027
B   -0.126317
C    0.810581
D    5.000000
dtype: float64

df.mean(1)

2013-01-01    1.535403
2013-01-02    1.501666
2013-01-03    1.731880
2013-01-04    1.551107
2013-01-05    1.182548
2013-01-06    1.538334
Freq: D, dtype: float64

df.apply(np.cumsum)

	A	B	C	D
2013-01-01	0.000000	1.309822	-0.168210	5
2013-01-02	0.440632	1.783741	-0.076095	10
2013-01-03	1.581387	1.495034	0.999376	15
2013-01-04	3.052795	-0.044836	2.272265	20
2013-01-05	1.854964	-0.184311	3.339761	25
2013-01-06	2.058162	-0.757901	4.863487	30

Gráficas ¶

ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
ts.plot(); # Investigar tarea moral

df = pd.DataFrame(

    np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]

)

df = df.cumsum()
df.plot();

Guardar ¶

df.to_csv("foo.csv")

df.to_excel("foo.xlsx", sheet_name="Sheet1")

Machine Learning

Pandas

Contenido