Model Inspection

This is text I want to write

tubesml.model_inspection.get_coef(pipe, feats=None)

Get dataframe with coefficients of a model in Pipeline.

The step before the model has to have a get_feature_names_out method.

If a simple estimator is provided, it creates a pipeline with a BaseTransformer. In that case, the feats input is not optional and there is no need for a get_feature_names_out method.

Parameters:

pipe – pipeline or estimator
feats – (optional) list of features the estimator uses.

Return result:

pandas DataFrame with a Feature column with the feature names and a score column with the coefficients values ordere by absolute magnitude.

tubesml.model_inspection.get_feature_importance(pipe, feats=None)

Get dataframe with the feature importance of a model in Pipeline.

The step before the model has to have a get_feature_names_out method.

If a simple estimator is provided, it creates a pipeline with a BaseTransformer. In that case, the feats input is not optional and there is no need for a get_feature_names_out method.

Parameters:

pipe – pipeline or estimator
feats – (optional) list of features the estimator uses.

Return result:

pandas DataFrame with a Feature column with the feature names and a score column with the feature importances values ordere by magnitude.

tubesml.model_inspection.get_pdp(estimator, feature, data, grid_resolution=100)

Calculates the partial dependence of the model to a variable. It is a wrapper around sklearn.inspect.partial_dependence

Parameters:

estimator – model or pipeline with a predict method. If the fit method was not previously called, it will throw an error.
feature – string or tuple of 2 strings. The feature for which to create the partial dependence. If it is a tuple, a 2-way partial dependence will be created.
data – pandas DataFrame. It must contain the features the estimator uses to generate predictions. If feature is not present in this dataframe, an error will be raised.
grid_resolution – Integer, default 100. The number of equally spaced points on the grid.

Returns:

pandas DataFrame with columns x (the feature values in the grid), feat (the feature name), y (the values of the partial dependence). If feature is a tuple, there is also another column x_1 with the values in the grid of the second feature of the tuple. If feature is a string, x_1 is empty.

tubesml.model_inspection.plot_feat_imp(data, n=-1, imp='shap', savename=None)

Plots a barplot with error bars of feature importance. It works with coefficients too.

Parameters:

data – pandas DataFrame with a mean and a std column. A KeyError is raised if any of these columns is missing. If shap is selected, the columns has to be shap_importance and shap_std.
n – int, default=-1. Number of features to display.
imp – string, default=shap. Allowed values are shap, standard, or both. If shap, the importances coming from shap values will be in the plot. If standard, the one coming from the model method. If both, 2 plots will be produced side by side.
savename – (optional) str with the name of the file to use to save the figure. If not provided, the function simply plots the figure.

tubesml.model_inspection.plot_learning_curve(estimator, X, y, scoring=None, ylim=None, cv=None, n_jobs=None, train_sizes=None, title=None)

Plot learning curve and scalability of the model. The estimation is an average across the folds in the cross validation, the uncertainty is the unbiased standard deviation of the mean.

It may create issues when both the estimator and this function have n_jobs>1.

Moreover, it doesn’t behave well with early stopping, which produces no result. In that case, a RuntimeError is raised

Parameters:

estimator – estimator or pipeline.
X – {array-like} of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), optional The target values (class labels) as integers or strings.
scoring – string. Scoring metric for the learning curve, see the sklearn documentation for the available options.
ylim – (optional) tuple with the limits to use in the y-axis of the plots.
cv – int, or KFold generator. The learning curves will be computed with prediction out of folds generated by this cross-validation choice.
n_jobs – int, number of jobs to run in parallel.
train_sizes – array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5) Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.
title – (optional) string for the figure title.

tubesml.model_inspection.plot_partial_dependence(pdps, savename=None)

Plot all the pdps in the dataframe in a plot with 2 columns and as many rows as necessary. The function is a wrapper around tubesml.plot_pdp

Parameters:: pdps – pandas DataFrame with the partial dependences It must contain a feat, a x

tubesml.model_inspection.plot_pdp(data, feature, title, axes)

Plot partial dependence of a feature in an ax. If available, uncertainty plotted around it.

Parameters:

data – pandas Dataframe with the partial dependence It must contain a feat, a x, and either a y or a mean columns If there is an std column, it will be plotted as uncertainty aroudn the mean
feature – string. The feature to plot as x axis in the partial dependence
title – string. The title on top of the plot
axes – matplotlib axes The plot will take place in this axes

Returns:

matplotlib axes with the plot.

tubesml.model_inspection.plot_shap_values(shap_values, features='all', savename=None)

Plots the shap values and interaction for all the specified features used in the model

:param shap_values, the model shap values object (values, data, etc) :param features, string or list, default = “all”. List of features to plot. If all,

all the features will be used.

tubesml.model_inspection.plot_two_pdp(data, feature, title, axes)

This function is still in development. Plot a 2-way partial dependence

Parameters:

data – pandas Dataframe with the partial dependence It must contain a feat, a x, a x_1 and a y columns.
feature – tuple of strings. The features to plot as x and y axis in the partial dependence
title – string. The title on top of the plot
axes – matplotlib axes The plot will take place in this axes

Returns:

matplotlib axes with the plot.

tubesml.report.plot_classification_probs(data, true_label, pred_label, thrs=0.5, sample=None, feat=None, hue_feat=None, savename=None): Plot prediction vs true label when the prediction is a probability Plots also a confusion matrix, a Data, true_label, and pred_label must be of compatible size hue_feat is ignored when the unique values are more than 5 for readability

tubesml.report.plot_regression_predictions(data, true_label, pred_label, hue=None, feature=None, savename=None)

Plot prediction vs true label and the distribution of both the label and the predictions. Display also the influence of categorical features via the hue parameter. You can also display the prediction vs a feature or more in the data. This will help identify non-desired patterns also with the help of a residuals plot.

Parameters:

data – pandas DataFrame. Ideally, the dataframe used for training the model you are evaluating. It must have the same number of rows of true_label and pred_label.
true_label – pandas Series, numpy array, or list with the true values of the target variable.
pred_label – pandas Series, numpy array, or list with the predicted values of the target variable.
hue – (optional) str, name of the feature to use as hue in the scatter plot. It must be in data or it will be ignored after a warning. It is ignored when the unique values are more than 5 for readability.
feature – (optional), str or list, feature(s) to use as x-axis in the scatter plot against the prediction. Using this option will produce 2 more plots for each feature provided, one with the feature vs the prediction and one with the feature vs the residuals
savename – (optional) str with the name of the file to use to save the figure. If not provided, the function simply plots the figure.