NumPy - Python library that provides functionality comparable to mathematical tools such as MATLAB and R. Significantly simplifies the user experience and offers comprehensive mathematical functions.
import numpy as np
Pandas - popular Python library for data analysis and manipulation. Like a spreadsheet application for Python that provides easy-to-use functionality for data tables.
import pandas as pd`
Jupyter notebooks - popular way of running basic scripts using your web browser.
Data exploration and analysis is typically an iterative process, in which the data scientist takes a sample of data and performs the following kinds of tasks to analyze it and test hypotheses:
NumPy provides a lot of the functionality and tools you need to work with numbers, such as arrays of numeric values. However, when you start to deal with two-dimensional tables of data, the Pandas package offers a more convenient structure to work with: the DataFrame.
df_students = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic', 'Jimmie',
'Rhonda', 'Giovanni', 'Francesca', 'Rajab', 'Naiyana', 'Kian', 'Jenny',
'Jakeem','Helena','Ismat','Anila','Skye','Daniel','Aisha'],
'StudyHours':student_data[0],
'Grade':student_data[1]})
You can use the DataFrame’s loc
method to retrieve data for a specific index value. In addition to being able to use the loc
method to find rows based on the index, you can use the iloc
method to find rows based on their ordinal position in the DataFrame (regardless of the index). iloc
identifies data values in a DataFrame by position, which extends beyond rows to columns. The loc
method will return rows with index label in the list of values from 0 to 5, which includes 0, 1, 2, 3, 4, and 5 (six rows). However, the iloc
method returns the rows in the positions included in the range 0 to 5. Since integer ranges don’t include the upper-bound value, this includes positions 0, 1, 2, 3, and 4 (five rows). There are also many different ways in Pandas to achieve the same result (such as looking up a value).
# Get the data for index value 5
df_students.loc[5]
# Get the rows with index values from 0 to 5
df_students.loc[0:5]
# Get data in the first five rows
df_students.iloc[0:5]
# 4 different ways to look up data
df_students.loc[df_students['Name']=='Aisha']
df_students[df_students['Name']=='Aisha']
df_students.query('Name=="Aisha"')
df_students[df_students.Name == 'Aisha']
In many real-world scenarios, data is loaded from sources such as files. The DataFrame’s read_csv
method is used to load data from text files.
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/grades.csv
df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
df_students.head()
One of the most common issues data scientists need to deal with is incomplete or missing data. You can use the isnull
method to identify which individual values are null. When the DataFrame is retrieved, the missing numeric values show up as NaN (not a number).
One common approach is to impute replacement values. We can use the fillna
method:
df_students.StudyHours = df_students.StudyHours.fillna(df_students.StudyHours.mean())
Alternatively, it might be important to ensure that you only use data you know to be absolutely correct. You can drop rows or columns that contain null values by using the dropna
method:
df_students = df_students.dropna(axis=0, how='any')
We can create a new column by creating a Pandas Series containing the pass/fail indicator (True or False), and then we’ll concatenate that series as a new column (axis 1) in the DataFrame:
passes = pd.Series(df_students['Grade'] >= 60)
df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)
DataFrames provide a great way to explore and analyze tabular data, but sometimes a picture is worth a thousand rows and columns. The Matplotlib library provides the foundation for plotting data visualizations that can greatly enhance your ability to analyze the data.
# Ensure plots are displayed inline in the notebook
%matplotlib inline
from matplotlib import pyplot as plt
# Create a bar plot of name vs grade
plt.bar(x=df_students.Name, height=df_students.Grade)
# Display the plot
plt.show()
Note that we used the pyplot
class from Matplotlib to plot the chart. This class provides many ways to improve the visual elements of the plot. A plot is technically contained within a Figure. In the previous examples, the figure was created implicitly for you, but you can create it explicitly with a specific size.
# Create a figure for 2 subplots (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize = (10,4))
# Create a bar plot of name vs grade on the first axis
ax[0].bar(x=df_students.Name, height=df_students.Grade, color='orange')
ax[0].set_title('Grades')
ax[0].set_xticklabels(df_students.Name, rotation=90)
# Create a pie chart of pass counts on the second axis
pass_counts = df_students['Pass'].value_counts()
ax[1].pie(pass_counts, labels=pass_counts)
ax[1].set_title('Passing Grades')
ax[1].legend(pass_counts.keys().tolist())
# Add a title to the Figure
fig.suptitle('Student Data')
# Show the figure
fig.show()
When examining a variable (for example, a sample of student grades), data scientists are particularly interested in its distribution (in other words, how are all the different grade values spread across the sample). The starting point for this exploration is often to visualize the data as a histogram and see how frequently each value for the variable occurs.
# Get the variable to examine
var_data = df_students['Grade']
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.hist(var_data)
# Add titles and labels
plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Show the figure
fig.show()
To understand the distribution better, we can examine measures of central tendency, which is a way of describing statistics that represent the “middle” of the data. The goal of this analysis is to try to find a “typical” value. Common ways to define the middle of the data include:
The mode: The most commonly occurring value in the sample set (in some sample sets, there may be a tie for the most common value. In those cases, the dataset is described as bimodal or even multimodal).
# Get the variable to examine
var = df_students['Grade']
# Get statistics
min_val = var.min()
max_val = var.max()
mean_val = var.mean()
med_val = var.median()
mod_val = var.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\n'.format(min_val,
mean_val,
med_val,
mod_val,
max_val))
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.hist(var)
# Add lines for the statistics
plt.axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
plt.axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
plt.axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
plt.axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
plt.axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)
# Add titles and labels
plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Show the figure
fig.show()
Another way to visualize the distribution of a variable is to use a box plot (sometimes called a box-and-whiskers plot). The box part of the plot shows where the inner two quartiles of the data reside. The whiskers extending from the box show the outer two quartiles. The line in the box indicates the median value.
# Get the variable to examine
var = df_students['Grade']
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.boxplot(var)
# Add titles and labels
plt.title('Data Distribution')
# Show the figure
fig.show()
If we have enough samples, we can calculate something called a probability density function, which estimates the distribution of grades for the full population. The pyplot
class from Matplotlib provides a helpful plot function to show this density. As expected from the histogram of the sample, the density shows the characteristic “bell curve” of what statisticians call a normal distribution with the mean and mode at the center and symmetric tails.
def show_density(var_data):
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(10,4))
# Plot density
var_data.plot.density()
# Add titles and labels
plt.title('Data Density')
# Show the mean, median, and mode
plt.axvline(x=var_data.mean(), color = 'cyan', linestyle='dashed', linewidth = 2)
plt.axvline(x=var_data.median(), color = 'red', linestyle='dashed', linewidth = 2)
plt.axvline(x=var_data.mode()[0], color = 'yellow', linestyle='dashed', linewidth = 2)
# Show the figure
plt.show()
# Get the density of Grade
col = df_students['Grade']
show_density(col)
Data presented in educational material is often remarkably perfect, designed to show students how to find clear relationships between variables. “Real world” data is a bit less simple. Real world data can contain many different issues that can affect the utility of the data and our interpretation of the results.
Real-world data will always have issues, but data scientists can often overcome these issues by:
When we have more data available, our sample becomes more reliable. This makes it easier to consider outliers as being values that fall below or above percentiles within which most of the data lie. For example, the following code uses the Pandas quantile function to exclude observations below the 0.01th percentile (the value above which 99% of the data reside).
# calculate the 0.01th percentile
q01 = df_students.StudyHours.quantile(0.01)
# Get the variable to examine
col = df_students[df_students.StudyHours>q01]['StudyHours']
# Call the function
show_distribution(col)
Typical statistics that measure variability in the data include:
min
and max
functions.var
function to find this.std
function to find this.Of these statistics, the standard deviation is generally the most useful. It provides a measure of variance in the data on the same scale as the data itself (so grade points for the Grade distribution and hours for the StudyHours distribution). The higher the standard deviation, the more variance there is when comparing values in the distribution to the distribution mean—in other words, the data is more spread out.
Because descriptive statistics are such an important part of exploring data, there’s a built-in describe
method of the DataFrame object that returns the main descriptive statistics for all numeric columns.
df_students.describe()
To see if there’s a correlation between study time and grade, we can use a statistical correlation measurement to quantify the relationship between two columns. The correlation statistic is a value between -1 and 1 that indicates the strength of a relationship. Values above 0 indicate a positive correlation (high values of one variable tend to coincide with high values of the other), while values below 0 indicate a negative correlation (high values of one variable tend to coincide with low values of the other).
df_normalized.Grade.corr(df_normalized.StudyHours)
Another way to visualize the apparent correlation between two numeric columns is to use a scatter plot.
# Create a scatter plot
df_sample.plot.scatter(title='Study Time vs Grade', x='StudyHours', y='Grade')
Source: Microsoft Learn