eda_mds

Submodules

Package Contents

Functions

info_na(df)

Extend pandas.DataFrame.info() with row-level null value statistics.

cat_var_stats(df[, binning_threshold])

Generate summary statistics for categorical variables in a DataFrame.

cor_eda

describe_outliers(df[, threshold, numeric])

Enhance pandas.DataFrame.describe() with outlier counts for numeric columns.

Attributes

__version__

eda_mds.__version__
eda_mds.info_na(df)[source]

Extend pandas.DataFrame.info() with row-level null value statistics.

This function enhances the DataFrame.info() method by adding a summary of null values at the row level. It prints type, shape, memory usage, and column information, along with new statistics such as the count and percentage of null values in rows, providing a comprehensive characterization of the DataFrame’s structure.

Parameters:

df (pandas.DataFrame) – The DataFrame to be analyzed for null value statistics.

Returns:

The function prints detailed descriptive information to the console and returns None.

Return type:

None

Examples

>>> df_example = pd.DataFrame(
        [
            [np.nan, 13, "hello"],
            [np.nan, np.nan, "this"],
            [37, 45, "is"],
            [256, 31, ""],
            [1, np.nan, "test"],
        ],
        index=["First", "Second", "Third", "Fourth", "Fifth"],
        columns=["Column1", "ColumnNumber2", "Column3"],
    )
>>> info_na(df_example)
# Expected output format:
type: <class 'pandas.core.frame.DataFrame'>
shape: (5, 3)
memory usage: 692 B
...
eda_mds.cat_var_stats(df, binning_threshold=2)[source]

Generate summary statistics for categorical variables in a DataFrame.

This function analyzes categorical columns in the provided DataFrame and prints out the number of unique values, the frequency of these values, and gives recommendations for binning low frequency categorical values based on a specified threshold.

Parameters:
  • df (pandas.DataFrame) – The DataFrame for which categorical variable stats are calculated.

  • binning_threshold (int, optional) – The percentage frequency threshold below which categories will be recommended for binning. Default is 2.

Returns:

The function prints the statistics and returns None.

Return type:

None

Examples

>>> import pandas as pd
>>> df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
>>> cat_var_stats(df)
Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------
eda_mds.cor_eda(dataset, na_handling='drop')[source]

Calculate the correlation between numerical variables in a DataFrame.

This function processes a given DataFrame to isolate numerical variables, handles missing values according to the specified method, calculates the correlation between each pair of numerical variables, and returns the results in a new DataFrame.

Parameters:
  • dataset (DataFrame) – The DataFrame to be analyzed. It should include a variety of variable types.

  • na_handling (str, optional) – Method for handling missing values (NAs). The following options are available: - ‘drop’: Drop rows with any NAs (default). - ‘mean’: Replace NAs with the mean value of the column. - ‘median’: Replace NAs with the median value of the column.

Returns:

A DataFrame containing the correlation coefficients between each pair of numerical variables.

Return type:

DataFrame

Examples

>>> cor_eda(data, na_handling='mean')
         age    salary
age     1.0000   0.9769
salary  0.9769   1.0000
eda_mds.describe_outliers(df, threshold=1.5, numeric=True)[source]

Enhance pandas.DataFrame.describe() with outlier counts for numeric columns.

This function extends the output of pandas.DataFrame.describe() by counting and including lower-tail and upper-tail outliers for each numeric column in the DataFrame. The outlier count is determined using the Interquartile Range (IQR) method, with a customizable threshold for defining what constitutes an outlier.

Parameters:
  • df (pandas.DataFrame) – A DataFrame with at least one numeric column.

  • threshold (float, optional) – A non-negative scalar that adjusts the sensitivity of outlier detection. A higher value decreases the sensitivity. The default is 1.5.

  • numeric (bool, optional) – If True, only numeric columns are included in the output. If False, the output includes the dtype and count for non-numeric columns as well. The default is True.

Returns:

A DataFrame summarizing the descriptive statistics and including outlier counts.

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> data = {'numeric': [1, 2, 3, 4, 5, 100],
             'categorical': ['a', 'b', 'c', 'd', 'e', 'f']}
>>> df = pd.DataFrame(data)
>>> describe_outliers(df, threshold=2, numeric=False)
# Output will display the DataFrame with the descriptive statistics and outlier counts.

Notes

Lower-tail outliers are calculated as values less than Q1 - (threshold * IQR). Upper-tail outliers are calculated as values greater than Q3 + (threshold * IQR).