Data Exploration

This is text I want to write

tubesml.explore.corr_target(data, target, cols, x_estimator=None, **kwargs)

Scatterplot + linear regression of a list of columns against the target. A correlation matrix is also printed. It is possible to pass an estimator.

Parameters:

data – pandas DataFrame. The input dataframe.
target – str, name of the target column
cols – list of columns to consider for the plot against the target.
x_estimator – (optional) additional input for sns.regplot.

tubesml.explore.find_cats(data, target, thrs=0.1, agg_func='mean', critical=0.05, ks=True, frac=1)

Identify categorical features that show meaningful differences in the target distribution.

The function evaluates each object-type column and determines whether it is potentially predictive. Two approaches are available:

Kolmogorov-Smirnov test (default): For each category level with sufficient frequency, the function checks whether the distribution of the target within that subset differs significantly from the rest of the population.
Descriptive statistic comparison: For each category level, the chosen aggregation statistic of the target is compared across groups. If the variability across groups exceeds a fraction of the overall target standard deviation, the feature is considered relevant.

Parameters:

data – pandas DataFrame. The input dataset containing categorical features and the target.
target – str. Name of the target column.
thrs – float, optional. Minimum relative frequency required for a category level to be considered (default is 0.1).
agg_func – str or callable, optional. Aggregation function applied to the target when not using KS (default is “mean”).
critical – float, optional. Significance threshold for the KS test (default is 0.05).
ks – bool, optional. If True, use the Kolmogorov–Smirnov test; otherwise use descriptive statistics (default is True).
frac – float, optional. Minimum fraction of the target’s overall standard deviation required for a feature to be selected when using descriptive statistics (default is 1).

Returns:

list. A list of categorical column names that show significant differences in the target distribution.

tubesml.explore.list_missing(data, verbose=True)

Find all the columns with missing values and report on the percentage of missing values.

Parameters:

data – pandas Dataframe. The input dataframe
verbose – bool, default=True. If True, it prints the percentage of missing values in each column with missing values

Return mis_cols:

A list of column names with missing values.

tubesml.explore.plot_bivariate(data, x, y, hue=None, **kwargs)

Scatterplot of the feature x vs the feature y with the possibility of adding a hue.

Parameters:

data – pandas DataFrame. The input dataframe.
x – str, name of the feature to plot on the x-axis.
y – str, name of the feature to plot on the y-axis.
hue – (optional) str, feature to use as hue.
kwargs – additional key arguments to pass to sns.scatterplot

tubesml.explore.plot_correlations(data, target=None, limit=50, figsize=(12, 10), **kwargs)

This function plots the correlation matrix of a dataframe. If a target feature is provided, it will display only a certain amount of features, the ones correlated the most with the target. The number of features displayed is controlled by the parameter limit.

Parameters:

data – pandas DataFrame. The input dataframe.
target – str, default=None. If not None, it displays the correlation matrix in order from the most correlated to the target column to the least. It must be present in data.
limit – int, number of feature to display, default=50. This is to avoid plots that are difficult to read.
figsize – tuple, default=(12,10). Size of the output figure.
kwargs – kwargs to be passed to sns.heatmap. For example, to display annotations. See the documentation of Seaborn for more options.

Return cor_target:

Only if target is provided, correlation matrix of the features in data

tubesml.explore.plot_distribution(data, column, bins=50, correlation=None)

Plots a histogram of a given column. If a Pandas Series is provided with the correlation values, it will be displayed in the title.

Parameters:

data – pandas DataFrame. The input dataframe.
column – str, name of the column to plot. if correlation is provided, make sure this column is among the indexes of that input
bins – int, number of bins.
correlation – (optional) pandas Series. Ideally the output of tubesml.plot_correlations

tubesml.explore.segm_target(data, cat, target)

Study the distribution of a continuous target segmented by a categorical feature.

The function computes descriptive statistics of the target for each category level and provides visual support through a boxplot and kernel density plots. This helps assess how the target varies across different segments of the categorical feature.

Parameters:

data – pandas DataFrame. The input dataset containing the categorical feature and target.
cat – str. Name of the categorical column used to segment the data.
target – str. Name of the continuous target variable.

Returns:

pandas DataFrame. A summary table with count, mean, max, min, median, and standard deviation of the target for each category level.