eda_mds
Submodules
Package Contents
Functions
|
Extend pandas.DataFrame.info() with row-level null value statistics. |
|
Generate summary statistics for categorical variables in a DataFrame. |
|
Enhance pandas.DataFrame.describe() with outlier counts for numeric columns. |
Attributes
- eda_mds.__version__
- eda_mds.info_na(df)[source]
Extend pandas.DataFrame.info() with row-level null value statistics.
This function enhances the DataFrame.info() method by adding a summary of null values at the row level. It prints type, shape, memory usage, and column information, along with new statistics such as the count and percentage of null values in rows, providing a comprehensive characterization of the DataFrame’s structure.
- Parameters:
df (pandas.DataFrame) – The DataFrame to be analyzed for null value statistics.
- Returns:
The function prints detailed descriptive information to the console and returns None.
- Return type:
None
Examples
>>> df_example = pd.DataFrame( [ [np.nan, 13, "hello"], [np.nan, np.nan, "this"], [37, 45, "is"], [256, 31, ""], [1, np.nan, "test"], ], index=["First", "Second", "Third", "Fourth", "Fifth"], columns=["Column1", "ColumnNumber2", "Column3"], ) >>> info_na(df_example) # Expected output format: type: <class 'pandas.core.frame.DataFrame'> shape: (5, 3) memory usage: 692 B ...
- eda_mds.cat_var_stats(df, binning_threshold=2)[source]
Generate summary statistics for categorical variables in a DataFrame.
This function analyzes categorical columns in the provided DataFrame and prints out the number of unique values, the frequency of these values, and gives recommendations for binning low frequency categorical values based on a specified threshold.
- Parameters:
df (pandas.DataFrame) – The DataFrame for which categorical variable stats are calculated.
binning_threshold (int, optional) – The percentage frequency threshold below which categories will be recommended for binning. Default is 2.
- Returns:
The function prints the statistics and returns None.
- Return type:
None
Examples
>>> import pandas as pd >>> df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv') >>> cat_var_stats(df) Column: sex Number of unique values: 2 Frequency of values: male: 64.76% female: 35.24% ------------------------------------
- eda_mds.cor_eda(dataset, na_handling='drop')[source]
Calculate the correlation between numerical variables in a DataFrame.
This function processes a given DataFrame to isolate numerical variables, handles missing values according to the specified method, calculates the correlation between each pair of numerical variables, and returns the results in a new DataFrame.
- Parameters:
dataset (DataFrame) – The DataFrame to be analyzed. It should include a variety of variable types.
na_handling (str, optional) – Method for handling missing values (NAs). The following options are available: - ‘drop’: Drop rows with any NAs (default). - ‘mean’: Replace NAs with the mean value of the column. - ‘median’: Replace NAs with the median value of the column.
- Returns:
A DataFrame containing the correlation coefficients between each pair of numerical variables.
- Return type:
DataFrame
Examples
>>> cor_eda(data, na_handling='mean') age salary age 1.0000 0.9769 salary 0.9769 1.0000
- eda_mds.describe_outliers(df, threshold=1.5, numeric=True)[source]
Enhance pandas.DataFrame.describe() with outlier counts for numeric columns.
This function extends the output of pandas.DataFrame.describe() by counting and including lower-tail and upper-tail outliers for each numeric column in the DataFrame. The outlier count is determined using the Interquartile Range (IQR) method, with a customizable threshold for defining what constitutes an outlier.
- Parameters:
df (pandas.DataFrame) – A DataFrame with at least one numeric column.
threshold (float, optional) – A non-negative scalar that adjusts the sensitivity of outlier detection. A higher value decreases the sensitivity. The default is 1.5.
numeric (bool, optional) – If True, only numeric columns are included in the output. If False, the output includes the dtype and count for non-numeric columns as well. The default is True.
- Returns:
A DataFrame summarizing the descriptive statistics and including outlier counts.
- Return type:
pandas.DataFrame
Examples
>>> import pandas as pd >>> data = {'numeric': [1, 2, 3, 4, 5, 100], 'categorical': ['a', 'b', 'c', 'd', 'e', 'f']} >>> df = pd.DataFrame(data) >>> describe_outliers(df, threshold=2, numeric=False) # Output will display the DataFrame with the descriptive statistics and outlier counts.
Notes
Lower-tail outliers are calculated as values less than Q1 - (threshold * IQR). Upper-tail outliers are calculated as values greater than Q3 + (threshold * IQR).