Example usage
In this demonstration, we will show how to use eda_mds for conducting Exploratory Data Analysis (EDA).
Imagine we are beginning a new data science project.
As with any project, exploratory data analysis (EDA) is a crucial first step to understand the nature of the data you are working with. eda_mds helps with this by:
characterizing
nullvalues usinginfo_nahighlighting outliers with
describe_outlierssummarizing categorical variables with
cat_var_statscalculating variable correlations with
cor_eda
We will walk through each of these steps using the titanic dataset from seaborn-datasets, which is a messy dataset containing information about survivors from the RMS Titanic.
# import modules
import pandas as pd
import numpy as np
from eda_mds import info_na, describe_outliers, cat_var_stats, cor_eda
# import the titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
info_na()
In this section, we will explore the functionality of info_na(), a function within eda_mds that expands the behaviour of pd.DataFrame.info().
We will do so by beginning the Exploratory Data Analysis process using both functions, and compare the output and necessary steps to acquire the same information, motivating its use.
Missing datapoints can significantly affect model performance, largely causing them to break, and characterizing these values is essential to quantifying data quality.
This will inform strategies to either remove, input, or otherwise replace data with missing values.
In some cases, specific rows or columns will be fragmented.
Let’s see how we can achieve this functionality using base pandas:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null object
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null object
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB
pandas.DataFrame.info shows us how many values in a dataset are non-null by column, alongside the data types.
Here, we can see that some columns, particularly deck, are missing significant amounts of data.
While this may seem like enough information at first glance, there are more questions to ask:
What about rows of data?
How much data will be lost if we remove, say, all rows with null values?
Is missing data randomly dispersed or is it focused in some rows?
Let’s see if we can answer these questions:
n_rows_any_null = df.isna().any(axis=1).sum()
n_rows = df.shape[0]
print(f"{n_rows_any_null} rows with any null value. ({n_rows_any_null / n_rows * 100:.2f}%)")
709 rows with any null value. (79.57%)
If we remove all rows with null values, we will lose 80% of our datset!
Thankfully, we can see that this is mostly in the column deck.
Are there any rows that have more than one null value, or all null values?
n_rows_all_null = df.isna().all(axis=1).sum()
mean_null_rows = df.isna().sum(axis=1).mean().round(2)
max_null_rows = df.isna().sum(axis=1).max()
print(f"{n_rows_all_null} rows have all-null values")
print(f"{mean_null_rows:0.2f}: average null values per row")
print(f"{max_null_rows}: max number of null values in a row")
0 rows have all-null values
0.98: average null values per row
2: max number of null values in a row
It appears that deck is the primary contributor for null values.
In this case, we can see that the most amount of null values in any of the rows is two, and on average, we’re missing one value in each row.
This exercise shows the extra steps needed to more fully characterize a dataset.
While this is only a few extra lines of code, it becomes tedious over time.
info_na simplifies this process:
info_na(df)
type: <class 'pandas.core.frame.DataFrame'>
shape: (891, 15)
memory usage: 398.1 KB
--------
columns:
# column null count null % dtype
0 survived 0 0.00 int64
1 pclass 0 0.00 int64
2 sex 0 0.00 object
3 age 177 19.87 float64
4 sibsp 0 0.00 int64
5 parch 0 0.00 int64
6 fare 0 0.00 float64
7 embarked 2 0.22 object
8 class 0 0.00 object
9 who 0 0.00 object
10 adult_male 0 0.00 bool
11 deck 688 77.22 object
12 embark_town 2 0.22 object
13 alive 0 0.00 object
14 alone 0 0.00 bool
-----
rows:
total rows 891.00
any null count 709.00
any null % 79.57
all null count 0.00
all null % 0.00
mean null count 0.98
std.dev null count 0.62
max null count 2.00
min null count 0.00
We can see that many of the values we computed before are provided, alongside the information given by pandas.DataFrame.info.
This summarizes the primary use case of info_na(): characterizing missing values in a dataset in more detail - an essential task in most data science projects.
describe_outliers()
Numerical Insights
We’ll use describe_outliers() to first observe the distributions of each numeric columns in the titanic dataset. This can simply be done by passing in our dataframe, df, without any additional parameters.
describe_outliers(df)
| survived | pclass | age | sibsp | parch | fare | |
|---|---|---|---|---|---|---|
| dtype | int64 | int64 | float64 | int64 | int64 | float64 |
| Non-null count | 891 | 891 | 714 | 891 | 891 | 891 |
| mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| standard deviation | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min value | 0.0 | 1.0 | 0.42 | 0.0 | 0.0 | 0.0 |
| 25% percentile | 0.0 | 2.0 | 20.125 | 0.0 | 0.0 | 7.9104 |
| 50% (median) | 0.0 | 3.0 | 28.0 | 0.0 | 0.0 | 14.4542 |
| 75% percentile | 1.0 | 3.0 | 38.0 | 1.0 | 0.0 | 31.0 |
| max value | 1.0 | 3.0 | 80.0 | 8.0 | 6.0 | 512.3292 |
| lower-tail outliers | 0 | 0 | 0 | 0 | 0 | 0 |
| upper-tail outliers | 0 | 0 | 11 | 46 | 213 | 116 |
The output resembles the result of pandas.Dataframe.describe(df). It additionally includes counts of lower-tail and upper-tail outliers, along with data types for each column.
Looking at float64 data columns, we can see that age has some null values and 11 upper-tail outliers.
From this and the mean, median, and standard deviation, we have a better idea of the dataset shape: a right-skew.
Similarly, fare was more heavily right-skewed with even more upper-tail outliers.
These distributions could be explored further, including possible correlations.
Adjusting Outlier Detection
Adjusting the threshold argument allows for tuning the sensitivity of outlier detection. A higher value (above the default of 1.5) decreases sensitivity. In the example below, the upper-tail outliers for age reduce from 11 to 5 with an increased threshold.
Note that outlier detection uses this standard formula: Lower < Q1 - threshold*IQR, Upper > Q3 + threshold*IQR
describe_outliers(df, threshold=1.8)
| survived | pclass | age | sibsp | parch | fare | |
|---|---|---|---|---|---|---|
| dtype | int64 | int64 | float64 | int64 | int64 | float64 |
| Non-null count | 891 | 891 | 714 | 891 | 891 | 891 |
| mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| standard deviation | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min value | 0.0 | 1.0 | 0.42 | 0.0 | 0.0 | 0.0 |
| 25% percentile | 0.0 | 2.0 | 20.125 | 0.0 | 0.0 | 7.9104 |
| 50% (median) | 0.0 | 3.0 | 28.0 | 0.0 | 0.0 | 14.4542 |
| 75% percentile | 1.0 | 3.0 | 38.0 | 1.0 | 0.0 | 31.0 |
| max value | 1.0 | 3.0 | 80.0 | 8.0 | 6.0 | 512.3292 |
| lower-tail outliers | 0 | 0 | 0 | 0 | 0 | 0 |
| upper-tail outliers | 0 | 0 | 5 | 46 | 213 | 102 |
Options for Categorical Columns
While these summary statistics are primarily important for numerical columns, the option to return non-numerical columns is possible through the use of the numeric argument.
describe_outliers(df, threshold=1.8, numeric=False)
| adult_male | age | alive | alone | class | deck | embark_town | embarked | fare | parch | pclass | sex | sibsp | survived | who | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dtype | bool | float64 | object | bool | object | object | object | object | float64 | int64 | int64 | object | int64 | int64 | object |
| Non-null count | 891 | 714 | 891 | 891 | 891 | 203 | 889 | 889 | 891 | 891 | 891 | 891 | 891 | 891 | 891 |
| mean | NaN | 29.699118 | NaN | NaN | NaN | NaN | NaN | NaN | 32.204208 | 0.381594 | 2.308642 | NaN | 0.523008 | 0.383838 | NaN |
| standard deviation | NaN | 14.526497 | NaN | NaN | NaN | NaN | NaN | NaN | 49.693429 | 0.806057 | 0.836071 | NaN | 1.102743 | 0.486592 | NaN |
| min value | NaN | 0.42 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 0.0 | NaN |
| 25% percentile | NaN | 20.125 | NaN | NaN | NaN | NaN | NaN | NaN | 7.9104 | 0.0 | 2.0 | NaN | 0.0 | 0.0 | NaN |
| 50% (median) | NaN | 28.0 | NaN | NaN | NaN | NaN | NaN | NaN | 14.4542 | 0.0 | 3.0 | NaN | 0.0 | 0.0 | NaN |
| 75% percentile | NaN | 38.0 | NaN | NaN | NaN | NaN | NaN | NaN | 31.0 | 0.0 | 3.0 | NaN | 1.0 | 1.0 | NaN |
| max value | NaN | 80.0 | NaN | NaN | NaN | NaN | NaN | NaN | 512.3292 | 6.0 | 3.0 | NaN | 8.0 | 1.0 | NaN |
| lower-tail outliers | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| upper-tail outliers | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | 102.0 | 213.0 | 0.0 | NaN | 46.0 | 0.0 | NaN |
This displays all columns in the dataset, sorted alphabetically by column name. Examining the dtypes of both numeric and categorical columns is essential to verify correct encoding in case modifications are necessary.
Regarding categorical columns, a couple of notable observations are: two columns are encoded as booleans, and the deck column predominantly consists of NaN values. Further exploration of categorical columns can be accomplished using the cat_var_stats() function.
cat_var_stats()
This section will go through how to best use cat_var_stats function in eda_mds package. This function is designed to take pandas.DataFrame as argument.
After importing the dataset let’s run our cat_var_stats function
cat_var_stats(df)
Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------
Column: embarked
Number of unique values: 3
Frequency of values:
S: 72.28%
C: 18.86%
Q: 8.64%
nan: 0.22%
------------------------------------
Column: class
Number of unique values: 3
Frequency of values:
Third: 55.11%
First: 24.24%
Second: 20.65%
------------------------------------
Column: who
Number of unique values: 3
Frequency of values:
man: 60.27%
woman: 30.42%
child: 9.32%
------------------------------------
Column: adult_male
Number of unique values: 2
Frequency of values:
True: 60.27%
False: 39.73%
------------------------------------
Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
G, A, F values can be binned into "other" category as they are lower than binning threshold
------------------------------------
Column: embark_town
Number of unique values: 3
Frequency of values:
Southampton: 72.28%
Cherbourg: 18.86%
Queenstown: 8.64%
nan: 0.22%
------------------------------------
Column: alive
Number of unique values: 2
Frequency of values:
no: 61.62%
yes: 38.38%
------------------------------------
Column: alone
Number of unique values: 2
Frequency of values:
False: 39.73%
True: 60.27%
------------------------------------
cat_var_stats iterates over each categorical column and gives out certain information. An example output for column ‘sex’ can be seen below:
Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
It outputs the column name in question. The number of unique values and finally, the percentage of each unique value.
For columns that have values that are underrepresented it also gives binning suggestions according to a threshold. This suggestion can be seen for the deck column for the titanic dataset.
Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
G, A, F values can be binned into "other" category as they are lower than binning threshold
This output was generated according to the default binning threshold of 2% but a user can define their own threshold with the binning_threshold argument.
cat_var_stats(df, binning_threshold=4) # Let's run the function again with a user defined threshold
Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------
Column: embarked
Number of unique values: 3
Frequency of values:
S: 72.28%
C: 18.86%
Q: 8.64%
nan: 0.22%
------------------------------------
Column: class
Number of unique values: 3
Frequency of values:
Third: 55.11%
First: 24.24%
Second: 20.65%
------------------------------------
Column: who
Number of unique values: 3
Frequency of values:
man: 60.27%
woman: 30.42%
child: 9.32%
------------------------------------
Column: adult_male
Number of unique values: 2
Frequency of values:
True: 60.27%
False: 39.73%
------------------------------------
Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
E, G, D, A, F values can be binned into "other" category as they are lower than binning threshold
------------------------------------
Column: embark_town
Number of unique values: 3
Frequency of values:
Southampton: 72.28%
Cherbourg: 18.86%
Queenstown: 8.64%
nan: 0.22%
------------------------------------
Column: alive
Number of unique values: 2
Frequency of values:
no: 61.62%
yes: 38.38%
------------------------------------
Column: alone
Number of unique values: 2
Frequency of values:
False: 39.73%
True: 60.27%
------------------------------------
According to our newly defined threshold value the binning recommendation included ‘E’ and ‘D’ too.
Binning recommendations:
E, G, D, A, F values can be binned into "other" category as they are lower than binning threshold
cor_eda()
Calling the correlation function (cor_eda) leads to the creation of a data frame structured as a correlation matrix. This matrix delineates the correlation coefficients at the intersections of its rows and columns, corresponding to the pairwise correlations among the data frame’s numerical attributes. Essentially, it quantitatively expresses the strength and direction of relationships between the data’s specific numerical features.
cor_eda(df)
| survived | pclass | age | sibsp | parch | fare | |
|---|---|---|---|---|---|---|
| survived | 1.000000 | -0.359653 | -0.077221 | -0.017358 | 0.093317 | 0.268189 |
| pclass | -0.359653 | 1.000000 | -0.369226 | 0.067247 | 0.025683 | -0.554182 |
| age | -0.077221 | -0.369226 | 1.000000 | -0.308247 | -0.189119 | 0.096067 |
| sibsp | -0.017358 | 0.067247 | -0.308247 | 1.000000 | 0.383820 | 0.138329 |
| parch | 0.093317 | 0.025683 | -0.189119 | 0.383820 | 1.000000 | 0.205119 |
| fare | 0.268189 | -0.554182 | 0.096067 | 0.138329 | 0.205119 | 1.000000 |
This function performs the same actions as the one above but changes the handling of NA defaults to replace NAs with the mean of the column, instead of merely dropping them.
cor_eda(df, na_handling="mean")
| survived | pclass | age | sibsp | parch | fare | |
|---|---|---|---|---|---|---|
| survived | 1.000000 | -0.338481 | -0.069809 | -0.035322 | 0.081629 | 0.257307 |
| pclass | -0.338481 | 1.000000 | -0.331339 | 0.083081 | 0.018443 | -0.549500 |
| age | -0.069809 | -0.331339 | 1.000000 | -0.232625 | -0.179191 | 0.091566 |
| sibsp | -0.035322 | 0.083081 | -0.232625 | 1.000000 | 0.414838 | 0.159651 |
| parch | 0.081629 | 0.018443 | -0.179191 | 0.414838 | 1.000000 | 0.216225 |
| fare | 0.257307 | -0.549500 | 0.091566 | 0.159651 | 0.216225 | 1.000000 |
Notice that the values of the correlation function are slightly different when the NA handling method is changed. This indicates that our numerical data contained NA values, and the method we choose to handle them will affect the outcome of this function.
This function changes the handling of NA defaults to replace NAs with the median value of the column, instead of merely dropping them.
cor_eda(df, na_handling="median")
| survived | pclass | age | sibsp | parch | fare | |
|---|---|---|---|---|---|---|
| survived | 1.000000 | -0.338481 | -0.064910 | -0.035322 | 0.081629 | 0.257307 |
| pclass | -0.338481 | 1.000000 | -0.339898 | 0.083081 | 0.018443 | -0.549500 |
| age | -0.064910 | -0.339898 | 1.000000 | -0.233296 | -0.172482 | 0.096688 |
| sibsp | -0.035322 | 0.083081 | -0.233296 | 1.000000 | 0.414838 | 0.159651 |
| parch | 0.081629 | 0.018443 | -0.172482 | 0.414838 | 1.000000 | 0.216225 |
| fare | 0.257307 | -0.549500 | 0.096688 | 0.159651 | 0.216225 | 1.000000 |
We can see that, compared to using the mean for NA handling, some values change slightly, while others remain the same. This is because, in some numerical columns, the mean and median are very similar.