eda_mds.describe_outliers

Module Contents

Functions

describe_outliers(df[, threshold, numeric])

Enhance pandas.DataFrame.describe() with outlier counts for numeric columns.

eda_mds.describe_outliers.describe_outliers(df, threshold=1.5, numeric=True)[source]

Enhance pandas.DataFrame.describe() with outlier counts for numeric columns.

This function extends the output of pandas.DataFrame.describe() by counting and including lower-tail and upper-tail outliers for each numeric column in the DataFrame. The outlier count is determined using the Interquartile Range (IQR) method, with a customizable threshold for defining what constitutes an outlier.

Parameters:
  • df (pandas.DataFrame) – A DataFrame with at least one numeric column.

  • threshold (float, optional) – A non-negative scalar that adjusts the sensitivity of outlier detection. A higher value decreases the sensitivity. The default is 1.5.

  • numeric (bool, optional) – If True, only numeric columns are included in the output. If False, the output includes the dtype and count for non-numeric columns as well. The default is True.

Returns:

A DataFrame summarizing the descriptive statistics and including outlier counts.

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> data = {'numeric': [1, 2, 3, 4, 5, 100],
             'categorical': ['a', 'b', 'c', 'd', 'e', 'f']}
>>> df = pd.DataFrame(data)
>>> describe_outliers(df, threshold=2, numeric=False)
# Output will display the DataFrame with the descriptive statistics and outlier counts.

Notes

Lower-tail outliers are calculated as values less than Q1 - (threshold * IQR). Upper-tail outliers are calculated as values greater than Q3 + (threshold * IQR).