Model Inspection

This is text I want to write

tubesml.model_inspection.get_coef(pipe, feats=None)

Get dataframe with coefficients of a model in Pipeline.

The step before the model has to have a get_feature_names_out method.

If a simple estimator is provided, it creates a pipeline with a BaseTransformer. In that case, the feats input is not optional and there is no need for a get_feature_names_out method.

Parameters:

pipe – pipeline or estimator
feats – (optional) list of features the estimator uses.

Return result:

pandas DataFrame with a Feature column with the feature names and a score column with the coefficients values ordere by absolute magnitude.

tubesml.model_inspection.get_feature_importance(pipe, feats=None)

Get dataframe with the feature importance of a model in Pipeline.

The step before the model has to have a get_feature_names_out method.

If a simple estimator is provided, it creates a pipeline with a BaseTransformer. In that case, the feats input is not optional and there is no need for a get_feature_names_out method.

Parameters:

pipe – pipeline or estimator
feats – (optional) list of features the estimator uses.

Return result:

pandas DataFrame with a Feature column with the feature names and a score column with the feature importances values ordere by magnitude.

tubesml.model_inspection.get_pdp(estimator, feature, data, grid_resolution=100)

Calculates the partial dependence of the model to a variable. It is a wrapper around sklearn.inspect.partial_dependence

Parameters:

estimator – model or pipeline with a predict method. If the fit method was not previously called, it will throw an error.
feature – string or tuple of 2 strings. The feature for which to create the partial dependence. If it is a tuple, a 2-way partial dependence will be created.
data – pandas DataFrame. It must contain the features the estimator uses to generate predictions. If feature is not present in this dataframe, an error will be raised.
grid_resolution – Integer, default 100. The number of equally spaced points on the grid.

Returns:

pandas DataFrame with columns x (the feature values in the grid), feat (the feature name), y (the values of the partial dependence). If feature is a tuple, there is also another column x_1 with the values in the grid of the second feature of the tuple. If feature is a string, x_1 is empty.

tubesml.model_inspection.plot_feat_imp(data, n=-1, imp='shap', savename=None)

Plots a barplot with error bars of feature importance. It works with coefficients too.

Parameters:

data – pandas DataFrame with a mean and a std column. A KeyError is raised if any of these columns is missing. If shap is selected, the columns has to be shap_importance and shap_std.
n – int, default=-1. Number of features to display.
imp – string, default=shap. Allowed values are shap, standard, or both. If shap, the importances coming from shap values will be in the plot. If standard, the one coming from the model method. If both, 2 plots will be produced side by side.
savename – (optional) str with the name of the file to use to save the figure. If not provided, the function simply plots the figure.

tubesml.model_inspection.plot_learning_curve(estimator, X, y, scoring=None, ylim=None, cv=None, n_jobs=None, train_sizes=None, title=None)

Plot learning curve and scalability of the model. The estimation is an average across the folds in the cross validation, the uncertainty is the unbiased standard deviation of the mean.

It may create issues when both the estimator and this function have n_jobs>1.

Moreover, it doesn’t behave well with early stopping, which produces no result. In that case, a RuntimeError is raised

Parameters:

estimator – estimator or pipeline.
X – {array-like} of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), optional The target values (class labels) as integers or strings.
scoring – string. Scoring metric for the learning curve, see the sklearn documentation for the available options.
ylim – (optional) tuple with the limits to use in the y-axis of the plots.
cv – int, or KFold generator. The learning curves will be computed with prediction out of folds generated by this cross-validation choice.
n_jobs – int, number of jobs to run in parallel.
train_sizes – array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5) Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.
title – (optional) string for the figure title.

tubesml.model_inspection.plot_partial_dependence(pdps, savename=None)

Plot all the pdps in the dataframe in a plot with 2 columns and as many rows as necessary. The function is a wrapper around tubesml.plot_pdp

Parameters:: pdps – pandas DataFrame with the partial dependences It must contain a feat, a x

tubesml.model_inspection.plot_pdp(data, feature, title, axes)

Plot partial dependence of a feature in an ax. If available, uncertainty plotted around it.

Parameters:

data – pandas Dataframe with the partial dependence It must contain a feat, a x, and either a y or a mean columns If there is an std column, it will be plotted as uncertainty aroudn the mean
feature – string. The feature to plot as x axis in the partial dependence
title – string. The title on top of the plot
axes – matplotlib axes The plot will take place in this axes

Returns:

matplotlib axes with the plot.

tubesml.model_inspection.plot_shap_values(shap_values, features='all', savename=None)

Plots SHAP values and SHAP interaction values for the specified features used in the model.

Parameters:

shap_values – object. The SHAP values object containing attributes such as values and data.
features – str or list, optional. Features to plot. If a list is provided, only those features are plotted. If set to "all", all available features are used (default is “all”).

tubesml.model_inspection.plot_two_pdp(data, feature, title, axes)

This function is still in development. Plot a 2-way partial dependence

Parameters:

data – pandas Dataframe with the partial dependence It must contain a feat, a x, a x_1 and a y columns.
feature – tuple of strings. The features to plot as x and y axis in the partial dependence
title – string. The title on top of the plot
axes – matplotlib axes The plot will take place in this axes

Returns:

matplotlib axes with the plot.

tubesml.report.eval_classification(data, target, preds, proba=False, thrs=0.5, plot=True, **kwargs)

Evaluate a binary classifier using accuracy, ROC AUC, and diagnostic plots.

The function computes standard classification metrics and optionally produces visual diagnostics. When predictions are probabilities, a threshold is applied to derive class labels and an ROC curve can be plotted. Additional plots such as probability distributions and confusion matrices are generated depending on the plot level.

Parameters:

data – pandas DataFrame. Dataset used for plotting when proba=True and plot > 0.
target – array-like. Ground truth binary labels.
preds – array-like. Predicted labels or predicted probabilities.
proba – bool, optional. If True, preds is treated as predicted probabilities and thresholded to obtain class labels (default is False).
thrs – float, optional. Threshold applied to predicted probabilities when proba=True (default is 0.5).
plot – int or bool, optional. Controls the level of plotting: - 0: no plots - 1: confusion matrix or probability diagnostics - 2: also plot ROC curve (default is True, equivalent to 1)
kwargs – dict, optional. Additional keyword arguments passed to plot_classification_probs.

Returns:

None. Prints evaluation metrics and displays plots depending on settings.

tubesml.report.plot_classification_probs(data, true_label, pred_label, thrs=0.5, sample=None, feat=None, hue_feat=None, savename=None)

Visualize predicted probabilities against true labels with multiple diagnostic plots.

The function produces a set of four plots to help evaluate the behavior of a probabilistic classifier:

Histogram of predicted probabilities split by true label
Normalized confusion matrix
Barplot comparing mean true label vs. mean predicted probability
Scatterplot of a continuous feature vs. predicted probability, optionally segmented by a categorical feature

A continuous feature can be provided for the scatterplot; if missing, a dummy feature is created. A categorical feature may be used for segmentation, provided it has no more than five unique values.

Parameters:

data – pandas DataFrame. Input dataset containing the features and labels.
true_label – array-like. Ground truth binary labels.
pred_label – array-like. Predicted probabilities from a classifier.
thrs – float, optional. Threshold applied to predicted probabilities to derive class labels (default is 0.5).
sample – int, optional. If provided, randomly samples this many rows for the scatterplot to improve readability.
feat – str, optional. Name of a continuous feature to plot against predicted probabilities. If missing or invalid, a dummy feature is created.
hue_feat – str, optional. Name of a categorical feature used to segment the scatterplot. Ignored if not present or if it has more than five unique values.
savename – str, optional. If provided, saves the figure to this path instead of displaying it.

Returns:

None. Displays or saves a 2x2 grid of diagnostic plots.

tubesml.report.plot_confusion_matrix(true_label, pred_label, ax=None, thrs=0.5, proba=True)

Plot a normalized confusion matrix from true and predicted labels.

The function computes a confusion matrix and displays it as a heatmap. If predicted values are probabilities, a threshold can be applied to convert them into class labels. A custom matplotlib axis can be provided; otherwise, a new figure is created.

Parameters:

true_label – array-like. Ground truth labels.
pred_label – array-like. Predicted labels or predicted probabilities.
ax – matplotlib Axes, optional. Axis on which to draw the heatmap. If None, a new figure is created.
thrs – float, optional. Threshold applied to predicted probabilities when proba=True (default is 0.5).
proba – bool, optional. If True, pred_label is treated as probabilities and thresholded. If False, pred_label is assumed to contain class labels.

Returns:

matplotlib Axes or None. Returns the axis when provided; otherwise displays the plot.

tubesml.report.plot_regression_predictions(data, true_label, pred_label, hue=None, feature=None, savename=None)

Plot prediction vs true label and the distribution of both the label and the predictions. Display also the influence of categorical features via the hue parameter. You can also display the prediction vs a feature or more in the data. This will help identify non-desired patterns also with the help of a residuals plot.

Parameters:

data – pandas DataFrame. Ideally, the dataframe used for training the model you are evaluating. It must have the same number of rows of true_label and pred_label.
true_label – pandas Series, numpy array, or list with the true values of the target variable.
pred_label – pandas Series, numpy array, or list with the predicted values of the target variable.
hue – (optional) str, name of the feature to use as hue in the scatter plot. It must be in data or it will be ignored after a warning. It is ignored when the unique values are more than 5 for readability.
feature – (optional), str or list, feature(s) to use as x-axis in the scatter plot against the prediction. Using this option will produce 2 more plots for each feature provided, one with the feature vs the prediction and one with the feature vs the residuals
savename – (optional) str with the name of the file to use to save the figure. If not provided, the function simply plots the figure.