reading-notes

Module 1 - Create machine learning models

Explore and analyze data with Python

Explore data with NumPy and Pandas

Data exploration and analysis is typically an iterative process, in which the data scientist takes a sample of data and performs the following kinds of tasks to analyze it and test hypotheses:

  1. Clean data to handle errors, missing values, and other issues.
  2. Apply statistical techniques to better understand the data and how the sample might be expected to represent the real-world population of data, allowing for random variation.
  3. Visualize data to determine relationships between variables, and in the case of a machine learning project, identify features that are potentially predictive of the label.
  4. Revise the hypothesis and repeat the process.

NumPy provides a lot of the functionality and tools you need to work with numbers, such as arrays of numeric values. However, when you start to deal with two-dimensional tables of data, the Pandas package offers a more convenient structure to work with: the DataFrame.

df_students = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic', 'Jimmie', 
                                     'Rhonda', 'Giovanni', 'Francesca', 'Rajab', 'Naiyana', 'Kian', 'Jenny',
                                     'Jakeem','Helena','Ismat','Anila','Skye','Daniel','Aisha'],
                            'StudyHours':student_data[0],
                            'Grade':student_data[1]})

In many real-world scenarios, data is loaded from sources such as files. The DataFrame’s read_csv method is used to load data from text files.

  !wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/grades.csv
  df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
  df_students.head()

One of the most common issues data scientists need to deal with is incomplete or missing data. You can use the isnull method to identify which individual values are null. When the DataFrame is retrieved, the missing numeric values show up as NaN (not a number).

We can create a new column by creating a Pandas Series containing the pass/fail indicator (True or False), and then we’ll concatenate that series as a new column (axis 1) in the DataFrame:

  passes  = pd.Series(df_students['Grade'] >= 60)
  df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

Visualize data

DataFrames provide a great way to explore and analyze tabular data, but sometimes a picture is worth a thousand rows and columns. The Matplotlib library provides the foundation for plotting data visualizations that can greatly enhance your ability to analyze the data.

  # Ensure plots are displayed inline in the notebook
  %matplotlib inline

  from matplotlib import pyplot as plt

  # Create a bar plot of name vs grade
  plt.bar(x=df_students.Name, height=df_students.Grade)

  # Display the plot
  plt.show()

Note that we used the pyplot class from Matplotlib to plot the chart. This class provides many ways to improve the visual elements of the plot. A plot is technically contained within a Figure. In the previous examples, the figure was created implicitly for you, but you can create it explicitly with a specific size.

  # Create a figure for 2 subplots (1 row, 2 columns)
  fig, ax = plt.subplots(1, 2, figsize = (10,4))

  # Create a bar plot of name vs grade on the first axis
  ax[0].bar(x=df_students.Name, height=df_students.Grade, color='orange')
  ax[0].set_title('Grades')
  ax[0].set_xticklabels(df_students.Name, rotation=90)

  # Create a pie chart of pass counts on the second axis
  pass_counts = df_students['Pass'].value_counts()
  ax[1].pie(pass_counts, labels=pass_counts)
  ax[1].set_title('Passing Grades')
  ax[1].legend(pass_counts.keys().tolist())

  # Add a title to the Figure
  fig.suptitle('Student Data')

  # Show the figure
  fig.show()

When examining a variable (for example, a sample of student grades), data scientists are particularly interested in its distribution (in other words, how are all the different grade values spread across the sample). The starting point for this exploration is often to visualize the data as a histogram and see how frequently each value for the variable occurs.

  # Get the variable to examine
  var_data = df_students['Grade']

  # Create a Figure
  fig = plt.figure(figsize=(10,4))

  # Plot a histogram
  plt.hist(var_data)

  # Add titles and labels
  plt.title('Data Distribution')
  plt.xlabel('Value')
  plt.ylabel('Frequency')

  # Show the figure
  fig.show()

To understand the distribution better, we can examine measures of central tendency, which is a way of describing statistics that represent the “middle” of the data. The goal of this analysis is to try to find a “typical” value. Common ways to define the middle of the data include:

Another way to visualize the distribution of a variable is to use a box plot (sometimes called a box-and-whiskers plot). The box part of the plot shows where the inner two quartiles of the data reside. The whiskers extending from the box show the outer two quartiles. The line in the box indicates the median value.

  # Get the variable to examine
  var = df_students['Grade']

  # Create a Figure
  fig = plt.figure(figsize=(10,4))

  # Plot a histogram
  plt.boxplot(var)

  # Add titles and labels
  plt.title('Data Distribution')

  # Show the figure
  fig.show()

If we have enough samples, we can calculate something called a probability density function, which estimates the distribution of grades for the full population. The pyplot class from Matplotlib provides a helpful plot function to show this density. As expected from the histogram of the sample, the density shows the characteristic “bell curve” of what statisticians call a normal distribution with the mean and mode at the center and symmetric tails.

  def show_density(var_data):
    from matplotlib import pyplot as plt

    fig = plt.figure(figsize=(10,4))

    # Plot density
    var_data.plot.density()

    # Add titles and labels
    plt.title('Data Density')

    # Show the mean, median, and mode
    plt.axvline(x=var_data.mean(), color = 'cyan', linestyle='dashed', linewidth = 2)
    plt.axvline(x=var_data.median(), color = 'red', linestyle='dashed', linewidth = 2)
    plt.axvline(x=var_data.mode()[0], color = 'yellow', linestyle='dashed', linewidth = 2)

    # Show the figure
    plt.show()

  # Get the density of Grade
  col = df_students['Grade']
  show_density(col)

Examine real world data

Data presented in educational material is often remarkably perfect, designed to show students how to find clear relationships between variables. “Real world” data is a bit less simple. Real world data can contain many different issues that can affect the utility of the data and our interpretation of the results.

Real-world data will always have issues, but data scientists can often overcome these issues by:

When we have more data available, our sample becomes more reliable. This makes it easier to consider outliers as being values that fall below or above percentiles within which most of the data lie. For example, the following code uses the Pandas quantile function to exclude observations below the 0.01th percentile (the value above which 99% of the data reside).

  # calculate the 0.01th percentile
  q01 = df_students.StudyHours.quantile(0.01)
  # Get the variable to examine
  col = df_students[df_students.StudyHours>q01]['StudyHours']
  # Call the function
  show_distribution(col)

Typical statistics that measure variability in the data include:

Of these statistics, the standard deviation is generally the most useful. It provides a measure of variance in the data on the same scale as the data itself (so grade points for the Grade distribution and hours for the StudyHours distribution). The higher the standard deviation, the more variance there is when comparing values in the distribution to the distribution mean—in other words, the data is more spread out.

Because descriptive statistics are such an important part of exploring data, there’s a built-in describe method of the DataFrame object that returns the main descriptive statistics for all numeric columns.

  df_students.describe()

To see if there’s a correlation between study time and grade, we can use a statistical correlation measurement to quantify the relationship between two columns. The correlation statistic is a value between -1 and 1 that indicates the strength of a relationship. Values above 0 indicate a positive correlation (high values of one variable tend to coincide with high values of the other), while values below 0 indicate a negative correlation (high values of one variable tend to coincide with low values of the other).

  df_normalized.Grade.corr(df_normalized.StudyHours)

Another way to visualize the apparent correlation between two numeric columns is to use a scatter plot.

  # Create a scatter plot
  df_sample.plot.scatter(title='Study Time vs Grade', x='StudyHours', y='Grade')

Summary

Source: Microsoft Learn