Model Selection

In this section, we find 3 functions to support you in selecting your model.

Split the data in train and test set.
Create out of fold prediction on the entire dataset while getting insights on what your model is using to get to such predictions
Perform a grid search on your model or pipeline. You can also test if what processing of the data (or what new features) are getting you to better results

class tubesml.CV_score.CrossValidate(data, target, estimator, cv, test=None, target_proc=None, imp_coef=False, pdp=None, shap=False, class_pos=1, shap_sample=700, predict_proba=False, early_stopping=False, fit_params=None, regression=True, multiclass=False, check_shap_additivity=True)

Train and test a pipeline in kfold cross validation

Parameters:

data – pandas DataFrame. Data to tune the hyperparameters.
target – numpy array or pandas Series. Target column.
estimator – sklearn compatible estimator. It must have a predict method and a get_params method. It can be a Pipeline. If it is not a Pipeline, it will be made one for compatibility with other functionalities.
cv – KFold or StratifiedKFold object. For cross-validation, the estimates will be done across these folds.
test – pandas DataFrame, default=None Data to predict on within each fold. If provided, each model trained in each fold predicts on this set. The predictions are then averaged across the folds. If it is a classification problem and we are not predicting the probabilities, the most frequent class is used. If there is no majority class (it can happen with an even number of folds), the class is chosen at random.
target_proc – function, default=None. It must take target as input (in this context it can be one or more arrays or series) and 2 indices for train and validation. It must return 2 series with the train and validation targets
imp_coef – bool, default=False. If True, returns the feature importance or the coefficient values averaged across the folds, with standard deviation on the mean.
pdp – string or list, default=None. If not None, returns the partial dependence of the given features averaged across the folds, with standard deviation on the mean. The partial dependence of 2 features simultaneously is not supported.
shap – bool, default=False. If True, it calculates the shape values for a sample of the data in each fold. In that case the results will also have the shap values (concatenated) and the feature importance will have the one coming from the shap values. WARNING: if you can’t guarantee the same number of features in each fold, the shap calculation will break.
class_pos – bool, default=1. Position of the class of interest, relevant if using predict_proba and for some shap values explainers. If None, all the classes probabilities will be returned but it will conflict with some shap values explainers.
shap_sample – int, default=700. Number of samples to calculate the shap values in each fold.
predict_proba – bool, default=False. If True, calls the predict_proba method instead of the predict one.
early_stopping – bool, default=False. If True, uses early stopping within the folds for the estimators that support it.
fit_params – dict, default=None. If a dictionary is provided, it will pass it to the fit method. This is useful to control the verbosity of the fit method as some packages like XGBoost and LightGBM do not do that in the estimator declaration.
regression – bool, default=True. If True, the predictions on the test set will be averaged across folds. Set it to false if the problem is a classification problem and you are not using predict_proba.
multiclass – bool, default=False. Set to true to deal with multiclassification problems. The test set predictions in this case are a vote across the folds.
check_shap_additivity – bool, default=True. If False, it will skip the additivity check during the shap values calculation, this may be necessary for a few models when the discrapancy is really small. If often happens with infrequent dummies.

Return oof:

numpy array with the out of fold predictions for the entire train set.

Return res_dict:

A dictionary with additional results. If imp_coef=True, it contains a pd.DataFrame with the coefficients or feature importances of the estimator, it can be found under the key feat_imp. If early_stopping=True, it contains a list with the best iteration number per fold, it can be found under the key iterations. If pdp is not None, it contains a pd.DataFrame with the partial dependence of the given features, it can be found under the key pdp. If shap is true, it contains the shap values under the key shap_values, moreover, the feature importance will also have the average shap values.

Return pred:

(optional) numpy array with the prediction done on the test set (if provided).

score(): Main method to loop over the folds, train and predict. It produces out of fold predictions and, if provided, an average prediction on the test set. It can also produce various insights on the model, like feature importance and pdp’s.

tubesml.model_selection.grid_search(data, target, estimator, param_grid, scoring, cv, random=False)

Calls a grid or a randomized search over a parameter grid

Parameters:

data – pandas DataFrame. Data to tune the hyperparameters
target – numpy array or pandas Series. Target column
estimator – sklearn compatible estimator. It must have a predict method and a get_params method. It can be a Pipeline.
param_grid – dict. Dictionary of the parameter space to explore. In case the estimator is a pipeline, provide the keys in the format step__param.
scoring – string. Scoring metric for the grid search, see the sklearn documentation for the available options.
cv – KFold object or int. For cross-validation.
random – bool, default=False. If True, runs a RandomSearch instead of a GridSearch.

Returns:

a dataframe with the results for each configuration

Returns:

a dictionary with the best parameters

Returns:

the best (fitted) estimator

tubesml.model_selection.make_test(train, test_size, random_state, strat_feat=None)

Creates a train and test, stratified on a feature or on a list of features.

Parameters:

train – pandas DataFrame.
test_size – float. The size of the test set. It must be between 0 and 1.
random_state – int. Random state used to split the data.
strat_feat – str or list, default=None. The feature or features to use to stratify the split.

Returns:

A train set and a test set.