Example usage

In this demonstration, we will show how to use eda_mds for conducting Exploratory Data Analysis (EDA).

Imagine we are beginning a new data science project. As with any project, exploratory data analysis (EDA) is a crucial first step to understand the nature of the data you are working with. eda_mds helps with this by:

characterizing null values using info_na
highlighting outliers with describe_outliers
summarizing categorical variables with cat_var_stats
calculating variable correlations with cor_eda

We will walk through each of these steps using the titanic dataset from seaborn-datasets, which is a messy dataset containing information about survivors from the RMS Titanic.

# import modules
import pandas as pd
import numpy as np

from eda_mds import info_na, describe_outliers, cat_var_stats, cor_eda

# import the titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')

`info_na()`

In this section, we will explore the functionality of info_na(), a function within eda_mds that expands the behaviour of pd.DataFrame.info(). We will do so by beginning the Exploratory Data Analysis process using both functions, and compare the output and necessary steps to acquire the same information, motivating its use.

Missing datapoints can significantly affect model performance, largely causing them to break, and characterizing these values is essential to quantifying data quality. This will inform strategies to either remove, input, or otherwise replace data with missing values. In some cases, specific rows or columns will be fragmented. Let’s see how we can achieve this functionality using base pandas:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB

pandas.DataFrame.info shows us how many values in a dataset are non-null by column, alongside the data types. Here, we can see that some columns, particularly deck, are missing significant amounts of data.

While this may seem like enough information at first glance, there are more questions to ask:

What about rows of data?
How much data will be lost if we remove, say, all rows with null values?
Is missing data randomly dispersed or is it focused in some rows?

Let’s see if we can answer these questions:

n_rows_any_null = df.isna().any(axis=1).sum()
n_rows = df.shape[0]
print(f"{n_rows_any_null} rows with any null value. ({n_rows_any_null / n_rows * 100:.2f}%)")

709 rows with any null value. (79.57%)

If we remove all rows with null values, we will lose 80% of our datset! Thankfully, we can see that this is mostly in the column deck.

Are there any rows that have more than one null value, or all null values?

n_rows_all_null = df.isna().all(axis=1).sum()
mean_null_rows = df.isna().sum(axis=1).mean().round(2)
max_null_rows = df.isna().sum(axis=1).max()

print(f"{n_rows_all_null} rows have all-null values")
print(f"{mean_null_rows:0.2f}: average null values per row")
print(f"{max_null_rows}: max number of null values in a row")

rows have all-null values
98: average null values per row
max number of null values in a row

It appears that deck is the primary contributor for null values.

In this case, we can see that the most amount of null values in any of the rows is two, and on average, we’re missing one value in each row.

This exercise shows the extra steps needed to more fully characterize a dataset. While this is only a few extra lines of code, it becomes tedious over time. info_na simplifies this process:

info_na(df)

type: <class 'pandas.core.frame.DataFrame'>
shape: (891, 15)
memory usage: 398.1 KB
--------
columns:
 #      column  null count  null %   dtype
 0    survived           0    0.00   int64
 1      pclass           0    0.00   int64
 2         sex           0    0.00  object
 3         age         177   19.87 float64
 4       sibsp           0    0.00   int64
 5       parch           0    0.00   int64
 6        fare           0    0.00 float64
 7    embarked           2    0.22  object
 8       class           0    0.00  object
 9         who           0    0.00  object
10  adult_male           0    0.00    bool
11        deck         688   77.22  object
12 embark_town           2    0.22  object
13       alive           0    0.00  object
14       alone           0    0.00    bool
-----
rows:
total rows            891.00
any null count        709.00
any null %             79.57
all null count          0.00
all null %              0.00
mean null count         0.98
std.dev null count      0.62
max null count          2.00
min null count          0.00

We can see that many of the values we computed before are provided, alongside the information given by pandas.DataFrame.info.

This summarizes the primary use case of info_na(): characterizing missing values in a dataset in more detail - an essential task in most data science projects.

`describe_outliers()`

Numerical Insights

We’ll use describe_outliers() to first observe the distributions of each numeric columns in the titanic dataset. This can simply be done by passing in our dataframe, df, without any additional parameters.

describe_outliers(df)

	survived	pclass	age	sibsp	parch	fare
dtype	int64	int64	float64	int64	int64	float64
Non-null count	891	891	714	891	891	891
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
standard deviation	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min value	0.0	1.0	0.42	0.0	0.0	0.0
25% percentile	0.0	2.0	20.125	0.0	0.0	7.9104
50% (median)	0.0	3.0	28.0	0.0	0.0	14.4542
75% percentile	1.0	3.0	38.0	1.0	0.0	31.0
max value	1.0	3.0	80.0	8.0	6.0	512.3292
lower-tail outliers	0	0	0	0	0	0
upper-tail outliers	0	0	11	46	213	116

The output resembles the result of pandas.Dataframe.describe(df). It additionally includes counts of lower-tail and upper-tail outliers, along with data types for each column.

Looking at float64 data columns, we can see that age has some null values and 11 upper-tail outliers. From this and the mean, median, and standard deviation, we have a better idea of the dataset shape: a right-skew. Similarly, fare was more heavily right-skewed with even more upper-tail outliers. These distributions could be explored further, including possible correlations.

Adjusting Outlier Detection

Adjusting the threshold argument allows for tuning the sensitivity of outlier detection. A higher value (above the default of 1.5) decreases sensitivity. In the example below, the upper-tail outliers for age reduce from 11 to 5 with an increased threshold.

Note that outlier detection uses this standard formula: Lower < Q1 - threshold*IQR, Upper > Q3 + threshold*IQR

describe_outliers(df, threshold=1.8)

	survived	pclass	age	sibsp	parch	fare
dtype	int64	int64	float64	int64	int64	float64
Non-null count	891	891	714	891	891	891
mean	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
standard deviation	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min value	0.0	1.0	0.42	0.0	0.0	0.0
25% percentile	0.0	2.0	20.125	0.0	0.0	7.9104
50% (median)	0.0	3.0	28.0	0.0	0.0	14.4542
75% percentile	1.0	3.0	38.0	1.0	0.0	31.0
max value	1.0	3.0	80.0	8.0	6.0	512.3292
lower-tail outliers	0	0	0	0	0	0
upper-tail outliers	0	0	5	46	213	102

Options for Categorical Columns

While these summary statistics are primarily important for numerical columns, the option to return non-numerical columns is possible through the use of the numeric argument.

describe_outliers(df, threshold=1.8, numeric=False)

	adult_male	age	alive	alone	class	deck	embark_town	embarked	fare	parch	pclass	sex	sibsp	survived	who
dtype	bool	float64	object	bool	object	object	object	object	float64	int64	int64	object	int64	int64	object
Non-null count	891	714	891	891	891	203	889	889	891	891	891	891	891	891	891
mean	NaN	29.699118	NaN	NaN	NaN	NaN	NaN	NaN	32.204208	0.381594	2.308642	NaN	0.523008	0.383838	NaN
standard deviation	NaN	14.526497	NaN	NaN	NaN	NaN	NaN	NaN	49.693429	0.806057	0.836071	NaN	1.102743	0.486592	NaN
min value	NaN	0.42	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	1.0	NaN	0.0	0.0	NaN
25% percentile	NaN	20.125	NaN	NaN	NaN	NaN	NaN	NaN	7.9104	0.0	2.0	NaN	0.0	0.0	NaN
50% (median)	NaN	28.0	NaN	NaN	NaN	NaN	NaN	NaN	14.4542	0.0	3.0	NaN	0.0	0.0	NaN
75% percentile	NaN	38.0	NaN	NaN	NaN	NaN	NaN	NaN	31.0	0.0	3.0	NaN	1.0	1.0	NaN
max value	NaN	80.0	NaN	NaN	NaN	NaN	NaN	NaN	512.3292	6.0	3.0	NaN	8.0	1.0	NaN
lower-tail outliers	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	0.0	NaN	0.0	0.0	NaN
upper-tail outliers	NaN	5.0	NaN	NaN	NaN	NaN	NaN	NaN	102.0	213.0	0.0	NaN	46.0	0.0	NaN

This displays all columns in the dataset, sorted alphabetically by column name. Examining the dtypes of both numeric and categorical columns is essential to verify correct encoding in case modifications are necessary.

Regarding categorical columns, a couple of notable observations are: two columns are encoded as booleans, and the deck column predominantly consists of NaN values. Further exploration of categorical columns can be accomplished using the cat_var_stats() function.

`cat_var_stats()`

This section will go through how to best use cat_var_stats function in eda_mds package. This function is designed to take pandas.DataFrame as argument.

After importing the dataset let’s run our cat_var_stats function

cat_var_stats(df)

Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------


Column: embarked
Number of unique values: 3
Frequency of values:
S: 72.28%
C: 18.86%
Q: 8.64%
nan: 0.22%
------------------------------------


Column: class
Number of unique values: 3
Frequency of values:
Third: 55.11%
First: 24.24%
Second: 20.65%
------------------------------------


Column: who
Number of unique values: 3
Frequency of values:
man: 60.27%
woman: 30.42%
child: 9.32%
------------------------------------


Column: adult_male
Number of unique values: 2
Frequency of values:
True: 60.27%
False: 39.73%
------------------------------------


Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
G, A, F values can be binned into "other" category as they are lower than binning threshold
------------------------------------


Column: embark_town
Number of unique values: 3
Frequency of values:
Southampton: 72.28%
Cherbourg: 18.86%
Queenstown: 8.64%
nan: 0.22%
------------------------------------


Column: alive
Number of unique values: 2
Frequency of values:
no: 61.62%
yes: 38.38%
------------------------------------


Column: alone
Number of unique values: 2
Frequency of values:
False: 39.73%
True: 60.27%
------------------------------------

cat_var_stats iterates over each categorical column and gives out certain information. An example output for column ‘sex’ can be seen below:

Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%

It outputs the column name in question. The number of unique values and finally, the percentage of each unique value.

For columns that have values that are underrepresented it also gives binning suggestions according to a threshold. This suggestion can be seen for the deck column for the titanic dataset.

Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
G, A, F values can be binned into "other" category as they are lower than binning threshold

This output was generated according to the default binning threshold of 2% but a user can define their own threshold with the binning_threshold argument.

cat_var_stats(df, binning_threshold=4)  # Let's run the function again with a user defined threshold

Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------


Column: embarked
Number of unique values: 3
Frequency of values:
S: 72.28%
C: 18.86%
Q: 8.64%
nan: 0.22%
------------------------------------


Column: class
Number of unique values: 3
Frequency of values:
Third: 55.11%
First: 24.24%
Second: 20.65%
------------------------------------


Column: who
Number of unique values: 3
Frequency of values:
man: 60.27%
woman: 30.42%
child: 9.32%
------------------------------------


Column: adult_male
Number of unique values: 2
Frequency of values:
True: 60.27%
False: 39.73%
------------------------------------


Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
E, G, D, A, F values can be binned into "other" category as they are lower than binning threshold
------------------------------------


Column: embark_town
Number of unique values: 3
Frequency of values:
Southampton: 72.28%
Cherbourg: 18.86%
Queenstown: 8.64%
nan: 0.22%
------------------------------------


Column: alive
Number of unique values: 2
Frequency of values:
no: 61.62%
yes: 38.38%
------------------------------------


Column: alone
Number of unique values: 2
Frequency of values:
False: 39.73%
True: 60.27%
------------------------------------

According to our newly defined threshold value the binning recommendation included ‘E’ and ‘D’ too.

Binning recommendations:
E, G, D, A, F values can be binned into "other" category as they are lower than binning threshold

`cor_eda()`

Calling the correlation function (cor_eda) leads to the creation of a data frame structured as a correlation matrix. This matrix delineates the correlation coefficients at the intersections of its rows and columns, corresponding to the pairwise correlations among the data frame’s numerical attributes. Essentially, it quantitatively expresses the strength and direction of relationships between the data’s specific numerical features.

cor_eda(df)

	survived	pclass	age	sibsp	parch	fare
survived	1.000000	-0.359653	-0.077221	-0.017358	0.093317	0.268189
pclass	-0.359653	1.000000	-0.369226	0.067247	0.025683	-0.554182
age	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
sibsp	-0.017358	0.067247	-0.308247	1.000000	0.383820	0.138329
parch	0.093317	0.025683	-0.189119	0.383820	1.000000	0.205119
fare	0.268189	-0.554182	0.096067	0.138329	0.205119	1.000000

This function performs the same actions as the one above but changes the handling of NA defaults to replace NAs with the mean of the column, instead of merely dropping them.

cor_eda(df, na_handling="mean")

	survived	pclass	age	sibsp	parch	fare
survived	1.000000	-0.338481	-0.069809	-0.035322	0.081629	0.257307
pclass	-0.338481	1.000000	-0.331339	0.083081	0.018443	-0.549500
age	-0.069809	-0.331339	1.000000	-0.232625	-0.179191	0.091566
sibsp	-0.035322	0.083081	-0.232625	1.000000	0.414838	0.159651
parch	0.081629	0.018443	-0.179191	0.414838	1.000000	0.216225
fare	0.257307	-0.549500	0.091566	0.159651	0.216225	1.000000

Notice that the values of the correlation function are slightly different when the NA handling method is changed. This indicates that our numerical data contained NA values, and the method we choose to handle them will affect the outcome of this function.

This function changes the handling of NA defaults to replace NAs with the median value of the column, instead of merely dropping them.

cor_eda(df, na_handling="median")

	survived	pclass	age	sibsp	parch	fare
survived	1.000000	-0.338481	-0.064910	-0.035322	0.081629	0.257307
pclass	-0.338481	1.000000	-0.339898	0.083081	0.018443	-0.549500
age	-0.064910	-0.339898	1.000000	-0.233296	-0.172482	0.096688
sibsp	-0.035322	0.083081	-0.233296	1.000000	0.414838	0.159651
parch	0.081629	0.018443	-0.172482	0.414838	1.000000	0.216225
fare	0.257307	-0.549500	0.096688	0.159651	0.216225	1.000000

We can see that, compared to using the mean for NA handling, some values change slightly, while others remain the same. This is because, in some numerical columns, the mean and median are very similar.

Example usage

info_na()

describe_outliers()