Model Inspection
This is text I want to write
- tubesml.model_inspection.get_coef(pipe, feats=None)
Get dataframe with coefficients of a model in Pipeline.
The step before the model has to have a
get_feature_names_outmethod.If a simple estimator is provided, it creates a pipeline with a
BaseTransformer. In that case, thefeatsinput is not optional and there is no need for aget_feature_names_outmethod.- Parameters:
pipe – pipeline or estimator
feats – (optional) list of features the estimator uses.
- Return result:
pandas DataFrame with a
Featurecolumn with the feature names and ascorecolumn with the coefficients values ordere by absolute magnitude.
- tubesml.model_inspection.get_feature_importance(pipe, feats=None)
Get dataframe with the feature importance of a model in Pipeline.
The step before the model has to have a
get_feature_names_outmethod.If a simple estimator is provided, it creates a pipeline with a
BaseTransformer. In that case, thefeatsinput is not optional and there is no need for aget_feature_names_outmethod.- Parameters:
pipe – pipeline or estimator
feats – (optional) list of features the estimator uses.
- Return result:
pandas DataFrame with a
Featurecolumn with the feature names and ascorecolumn with the feature importances values ordere by magnitude.
- tubesml.model_inspection.get_pdp(estimator, feature, data, grid_resolution=100)
Calculates the partial dependence of the model to a variable. It is a wrapper around
sklearn.inspect.partial_dependence- Parameters:
estimator – model or pipeline with a predict method. If the
fitmethod was not previously called, it will throw an error.feature – string or tuple of 2 strings. The feature for which to create the partial dependence. If it is a tuple, a 2-way partial dependence will be created.
data – pandas DataFrame. It must contain the features the
estimatoruses to generate predictions. Iffeatureis not present in this dataframe, an error will be raised.grid_resolution – Integer, default 100. The number of equally spaced points on the grid.
- Returns:
pandas DataFrame with columns
x(thefeaturevalues in the grid),feat(thefeaturename),y(the values of the partial dependence). Iffeatureis a tuple, there is also another columnx_1with the values in the grid of the second feature of the tuple. Iffeatureis a string,x_1is empty.
- tubesml.model_inspection.plot_feat_imp(data, n=-1, imp='shap', savename=None)
Plots a barplot with error bars of feature importance. It works with coefficients too.
- Parameters:
data – pandas DataFrame with a
meanand astdcolumn. A KeyError is raised if any of these columns is missing. Ifshapis selected, the columns has to beshap_importanceandshap_std.n – int, default=-1. Number of features to display.
imp – string, default=shap. Allowed values are shap, standard, or both. If shap, the importances coming from shap values will be in the plot. If standard, the one coming from the model method. If both, 2 plots will be produced side by side.
savename – (optional) str with the name of the file to use to save the figure. If not provided, the function simply plots the figure.
- tubesml.model_inspection.plot_learning_curve(estimator, X, y, scoring=None, ylim=None, cv=None, n_jobs=None, train_sizes=None, title=None)
Plot learning curve and scalability of the model. The estimation is an average across the folds in the cross validation, the uncertainty is the unbiased standard deviation of the mean.
It may create issues when both the estimator and this function have n_jobs>1.
Moreover, it doesn’t behave well with early stopping, which produces no result. In that case, a RuntimeError is raised
- Parameters:
estimator – estimator or pipeline.
X – {array-like} of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), optional The target values (class labels) as integers or strings.
scoring – string. Scoring metric for the learning curve, see the sklearn documentation for the available options.
ylim – (optional) tuple with the limits to use in the y-axis of the plots.
cv – int, or KFold generator. The learning curves will be computed with prediction out of folds generated by this cross-validation choice.
n_jobs – int, number of jobs to run in parallel.
train_sizes – array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5) Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.
title – (optional) string for the figure title.
- tubesml.model_inspection.plot_partial_dependence(pdps, savename=None)
Plot all the pdps in the dataframe in a plot with 2 columns and as many rows as necessary. The function is a wrapper around
tubesml.plot_pdp- Parameters:
pdps – pandas DataFrame with the partial dependences It must contain a
feat, ax
- tubesml.model_inspection.plot_pdp(data, feature, title, axes)
Plot partial dependence of a feature in an ax. If available, uncertainty plotted around it.
- Parameters:
data – pandas Dataframe with the partial dependence It must contain a
feat, ax, and either ayor ameancolumns If there is anstdcolumn, it will be plotted as uncertainty aroudn the meanfeature – string. The feature to plot as x axis in the partial dependence
title – string. The title on top of the plot
axes – matplotlib axes The plot will take place in this axes
- Returns:
matplotlib axes with the plot.
- tubesml.model_inspection.plot_shap_values(shap_values, features='all', savename=None)
Plots the shap values and interaction for all the specified features used in the model
:param shap_values, the model shap values object (values, data, etc) :param features, string or list, default = “all”. List of features to plot. If all,
all the features will be used.
- tubesml.model_inspection.plot_two_pdp(data, feature, title, axes)
This function is still in development. Plot a 2-way partial dependence
- Parameters:
data – pandas Dataframe with the partial dependence It must contain a
feat, ax, ax_1and aycolumns.feature – tuple of strings. The features to plot as x and y axis in the partial dependence
title – string. The title on top of the plot
axes – matplotlib axes The plot will take place in this axes
- Returns:
matplotlib axes with the plot.
- tubesml.report.plot_classification_probs(data, true_label, pred_label, thrs=0.5, sample=None, feat=None, hue_feat=None, savename=None)
Plot prediction vs true label when the prediction is a probability Plots also a confusion matrix, a Data, true_label, and pred_label must be of compatible size hue_feat is ignored when the unique values are more than 5 for readability
- tubesml.report.plot_regression_predictions(data, true_label, pred_label, hue=None, feature=None, savename=None)
Plot prediction vs true label and the distribution of both the label and the predictions. Display also the influence of categorical features via the hue parameter. You can also display the prediction vs a feature or more in the data. This will help identify non-desired patterns also with the help of a residuals plot.
- Parameters:
data – pandas DataFrame. Ideally, the dataframe used for training the model you are evaluating. It must have the same number of rows of
true_labelandpred_label.true_label – pandas Series, numpy array, or list with the true values of the target variable.
pred_label – pandas Series, numpy array, or list with the predicted values of the target variable.
hue – (optional) str, name of the feature to use as hue in the scatter plot. It must be in
dataor it will be ignored after a warning. It is ignored when the unique values are more than 5 for readability.feature – (optional), str or list, feature(s) to use as x-axis in the scatter plot against the prediction. Using this option will produce 2 more plots for each feature provided, one with the feature vs the prediction and one with the feature vs the residuals
savename – (optional) str with the name of the file to use to save the figure. If not provided, the function simply plots the figure.