Data Exploration

This is text I want to write

tubesml.explore.corr_target(data, target, cols, x_estimator=None, **kwargs)

Scatterplot + linear regression of a list of columns against the target. A correlation matrix is also printed. It is possible to pass an estimator.

Parameters:
  • data – pandas DataFrame. The input dataframe.

  • target – str, name of the target column

  • cols – list of columns to consider for the plot against the target.

  • x_estimator – (optional) additional input for sns.regplot.

tubesml.explore.find_cats(data, target, thrs=0.1, agg_func='mean', critical=0.05, ks=True, frac=1)

Identify categorical features that show meaningful differences in the target distribution.

The function evaluates each object-type column and determines whether it is potentially predictive. Two approaches are available:

  • Kolmogorov-Smirnov test (default): For each category level with sufficient frequency, the function checks whether the distribution of the target within that subset differs significantly from the rest of the population.

  • Descriptive statistic comparison: For each category level, the chosen aggregation statistic of the target is compared across groups. If the variability across groups exceeds a fraction of the overall target standard deviation, the feature is considered relevant.

Parameters:
  • data – pandas DataFrame. The input dataset containing categorical features and the target.

  • target – str. Name of the target column.

  • thrs – float, optional. Minimum relative frequency required for a category level to be considered (default is 0.1).

  • agg_func – str or callable, optional. Aggregation function applied to the target when not using KS (default is “mean”).

  • critical – float, optional. Significance threshold for the KS test (default is 0.05).

  • ks – bool, optional. If True, use the Kolmogorov–Smirnov test; otherwise use descriptive statistics (default is True).

  • frac – float, optional. Minimum fraction of the target’s overall standard deviation required for a feature to be selected when using descriptive statistics (default is 1).

Returns:

list. A list of categorical column names that show significant differences in the target distribution.

tubesml.explore.list_missing(data, verbose=True)

Find all the columns with missing values and report on the percentage of missing values.

Parameters:
  • data – pandas Dataframe. The input dataframe

  • verbose – bool, default=True. If True, it prints the percentage of missing values in each column with missing values

Return mis_cols:

A list of column names with missing values.

tubesml.explore.plot_bivariate(data, x, y, hue=None, **kwargs)

Scatterplot of the feature x vs the feature y with the possibility of adding a hue.

Parameters:
  • data – pandas DataFrame. The input dataframe.

  • x – str, name of the feature to plot on the x-axis.

  • y – str, name of the feature to plot on the y-axis.

  • hue – (optional) str, feature to use as hue.

  • kwargs – additional key arguments to pass to sns.scatterplot

tubesml.explore.plot_correlations(data, target=None, limit=50, figsize=(12, 10), **kwargs)

This function plots the correlation matrix of a dataframe. If a target feature is provided, it will display only a certain amount of features, the ones correlated the most with the target. The number of features displayed is controlled by the parameter limit.

Parameters:
  • data – pandas DataFrame. The input dataframe.

  • target – str, default=None. If not None, it displays the correlation matrix in order from the most correlated to the target column to the least. It must be present in data.

  • limit – int, number of feature to display, default=50. This is to avoid plots that are difficult to read.

  • figsize – tuple, default=(12,10). Size of the output figure.

  • kwargs – kwargs to be passed to sns.heatmap. For example, to display annotations. See the documentation of Seaborn for more options.

Return cor_target:

Only if target is provided, correlation matrix of the features in data

tubesml.explore.plot_distribution(data, column, bins=50, correlation=None)

Plots a histogram of a given column. If a Pandas Series is provided with the correlation values, it will be displayed in the title.

Parameters:
  • data – pandas DataFrame. The input dataframe.

  • column – str, name of the column to plot. if correlation is provided, make sure this column is among the indexes of that input

  • bins – int, number of bins.

  • correlation – (optional) pandas Series. Ideally the output of tubesml.plot_correlations

tubesml.explore.segm_target(data, cat, target)

Study the distribution of a continuous target segmented by a categorical feature.

The function computes descriptive statistics of the target for each category level and provides visual support through a boxplot and kernel density plots. This helps assess how the target varies across different segments of the categorical feature.

Parameters:
  • data – pandas DataFrame. The input dataset containing the categorical feature and target.

  • cat – str. Name of the categorical column used to segment the data.

  • target – str. Name of the continuous target variable.

Returns:

pandas DataFrame. A summary table with count, mean, max, min, median, and standard deviation of the target for each category level.