import pandas as pd
.DataFrame
. When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame
.DataFrame
is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame
in R: df = pd.DataFrame({<data>})
.DataFrame
, the result is a pandas Series
. To select the column, use the column label in between square brackets []
. You can create a Series
from scratch as well: ages = pd.Series([22, 35, 58], name="Age")
.Remember
import pandas as pd
DataFrame
DataFrame
is a Series
DataFrame
or Series
read_csv()
function to read data stored as a csv file into a pandas DataFrame
. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*
: titanic = pd.read_csv("data/titanic.csv")
.DataFrame
, the first and last 5 rows will be shown by default.DataFrame
, use the head()
method with the required number of rows as argument: titanic.head(8)
.tail()
method. For example, titanic.tail(10)
will return the last 10 rows of the DataFrame.dtypes
attribute: titanic.dtypes
.read_*
functions are used to read data to pandas, the to_*
methods are used to store data. The to_excel()
method stores the data as an excel file. In the example here, the sheet_name
is named passengers instead of the default Sheet1. By setting index=False
the row index labels are not saved in the spreadsheet: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
.read function
read_excel() will reload the data to a DataFrame
: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
.Remember
read_*
functions.to_*
methods.head
/tail
/info
methods and the dtypes
attribute are convenient for a first check.[]
with the column name of the column of interest: ages = titanic["Age"]
.
DataFrame
: age_sex = titanic[["Age", "Sex"]]
.DataFrame.shape
is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of a pandas Series
and DataFrame
containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned: titanic["Age"].shape
.[]
: above_35 = titanic[titanic["Age"] > 35]
.
titanic["Age"] > 35
checks for which rows the Age
column has a value larger than 35.isin()
conditional function returns a True
for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []
. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3])
checks for which rows the Pclass
column is either 2 or 3: class_23 = titanic[titanic["Pclass"].isin([2, 3])]
.
|
(or) operator: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
.notna()
conditional function returns a True
for each row the values are not a Null
value. As such, this can be combined with the selection brackets []
to filter the data table: age_no_na = titanic[titanic["Age"].notna()]
.[]
is not sufficient anymore. The loc
/iloc
operators are required in front of the selection brackets []
. When using loc
/iloc
, the part before the comma is the rows you want, and the part after the comma is the columns you want to select: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
.iloc
operator in front of the selection brackets []
: ``titanic.iloc[9:25, 2:5]`.Remember
[]
are used.loc
when using the row and column names.iloc
when using the positions in the table.loc
/iloc
.import matplotlib.pyplot as plt
.index_col
and parse_dates
parameters of the read_csv
function to define the first (0th) column as index of the resulting DataFrame
and convert the dates in the column to Timestamp
objects, respectively: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)
.DataFrame
, pandas creates by default one line plot for each of the columns with numeric data:air_quality.plot()
plt.show()
plot()
method. Hence, the plot()
method works on both Series
and DataFrame
:air_quality["station_paris"].plot()
plt.show()
line
plot when using the plot
function, a number of alternatives are available to plot data. Let’s use some standard Python to get an overview of the available plot methods:air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
plt.show()
DataFrame.plot.box()
, which refers to a boxplot
. The box
method is applicable on the air quality example data:air_quality.plot.box()
plt.show()
subplots
argument of the plot
functions:axs = air_quality.plot.area(figsize=(12, 4), subplots=True)
plt.show()
Remember
.plot.*
methods are applicable on both Series and DataFrames.Source: https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html