import pandas as pd.DataFrame. When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame.DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R: df = pd.DataFrame({<data>}).DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets []. You can create a Series from scratch as well: ages = pd.Series([22, 35, 58], name="Age").Remember
import pandas as pdDataFrameDataFrame is a SeriesDataFrame or Seriesread_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*: titanic = pd.read_csv("data/titanic.csv").DataFrame, the first and last 5 rows will be shown by default.DataFrame, use the head() method with the required number of rows as argument: titanic.head(8).tail() method. For example, titanic.tail(10) will return the last 10 rows of the DataFrame.dtypes attribute: titanic.dtypes.read_* functions are used to read data to pandas, the to_* methods are used to store data. The to_excel() method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False).read function read_excel() will reload the data to a DataFrame: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers").Remember
read_* functions.to_* methods.head/tail/info methods and the dtypes attribute are convenient for a first check.[] with the column name of the column of interest: ages = titanic["Age"].
DataFrame: age_sex = titanic[["Age", "Sex"]].DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned: titanic["Age"].shape.[]: above_35 = titanic[titanic["Age"] > 35].
titanic["Age"] > 35 checks for which rows the Age column has a value larger than 35.isin() conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which rows the Pclass column is either 2 or 3: class_23 = titanic[titanic["Pclass"].isin([2, 3])].
| (or) operator: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)].notna() conditional function returns a True for each row the values are not a Null value. As such, this can be combined with the selection brackets [] to filter the data table: age_no_na = titanic[titanic["Age"].notna()].[] is not sufficient anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select: adult_names = titanic.loc[titanic["Age"] > 35, "Name"].iloc operator in front of the selection brackets []: ``titanic.iloc[9:25, 2:5]`.Remember
[] are used.loc when using the row and column names.iloc when using the positions in the table.loc/iloc.import matplotlib.pyplot as plt.index_col and parse_dates parameters of the read_csv function to define the first (0th) column as index of the resulting DataFrame and convert the dates in the column to Timestamp objects, respectively: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True).DataFrame, pandas creates by default one line plot for each of the columns with numeric data:air_quality.plot()
plt.show()
plot() method. Hence, the plot() method works on both Series and DataFrame:air_quality["station_paris"].plot()
plt.show()
line plot when using the plot function, a number of alternatives are available to plot data. Let’s use some standard Python to get an overview of the available plot methods:air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
plt.show()
DataFrame.plot.box(), which refers to a boxplot. The box method is applicable on the air quality example data:air_quality.plot.box()
plt.show()
subplots argument of the plot functions:axs = air_quality.plot.area(figsize=(12, 4), subplots=True)
plt.show()
Remember
.plot.* methods are applicable on both Series and DataFrames.Source: https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html