eda_mds.describe_outliers
Module Contents
Functions
|
Enhance pandas.DataFrame.describe() with outlier counts for numeric columns. |
- eda_mds.describe_outliers.describe_outliers(df, threshold=1.5, numeric=True)[source]
Enhance pandas.DataFrame.describe() with outlier counts for numeric columns.
This function extends the output of pandas.DataFrame.describe() by counting and including lower-tail and upper-tail outliers for each numeric column in the DataFrame. The outlier count is determined using the Interquartile Range (IQR) method, with a customizable threshold for defining what constitutes an outlier.
- Parameters:
df (pandas.DataFrame) – A DataFrame with at least one numeric column.
threshold (float, optional) – A non-negative scalar that adjusts the sensitivity of outlier detection. A higher value decreases the sensitivity. The default is 1.5.
numeric (bool, optional) – If True, only numeric columns are included in the output. If False, the output includes the dtype and count for non-numeric columns as well. The default is True.
- Returns:
A DataFrame summarizing the descriptive statistics and including outlier counts.
- Return type:
pandas.DataFrame
Examples
>>> import pandas as pd >>> data = {'numeric': [1, 2, 3, 4, 5, 100], 'categorical': ['a', 'b', 'c', 'd', 'e', 'f']} >>> df = pd.DataFrame(data) >>> describe_outliers(df, threshold=2, numeric=False) # Output will display the DataFrame with the descriptive statistics and outlier counts.
Notes
Lower-tail outliers are calculated as values less than Q1 - (threshold * IQR). Upper-tail outliers are calculated as values greater than Q3 + (threshold * IQR).