Feature Engineering

In this section, we find classes to create new features. This includes

creating polynomial features
target encode categorical features
using PCA to create new features or compress the data

In every case, the output is going to be a pandas DataFrame so that any further manipulation of the data can be as easy as the first.

class tubesml.poly.DfPolynomial(degree=2, interaction_only=False, include_bias=False, to_interact='all')

Wrapper around PolynomialFeatures.

Inherits from BaseTransformer.

Parameters:

degree – int, default=2 The degree of the polynomial features.
interaction_only – bool, default=False. If True, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.).
include_bias – bool, default=False If True, then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model). The column is added with the name BIAS_TERM
to_interact – str or list of strings, default=’all’. Columns to consider for the interactions. If ‘all’, then all the columns of the DataFrame will be used. If a list of columns is provided, only those columns will be used for creating the interactions. All the other columns will still be in the output DataFrame.

fit(X, y=None)

Method to train the transformer.

Depending on the to_interact attribute, if fits considering different slices of the input DataFrame

It also reset the columns attribute

Parameters:

X – pandas DataFrame of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → DfPolynomial

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns

selfobject: The updated object.

transform(X, y=None)

Method to transform the input data.

It populates the columns attribute with the columns of the output data.

If a bias term is inclued, it will be called BIAS_TERM.

Parameters:

X – pandas DataFrame of shape (n_samples, n_features) The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.

Returns:

pandas DataFrame with polynomial features

class tubesml.encoders.TargetEncoder(to_encode=None, prior_weight=100, agg_func='mean')

Heavily inspired by MaxHalford # noqa MaxHalford # noqa

Encodes categorical features with statistics of the target variable. For example, by using the mean target value.

It allows for other aggregating functions, for now it is assumed this is provided as a string for the agg method of pandas.

Inherits from BaseTransformer.

Parameters:

to_encode – str, list, None. default=None. (list of) column(s) to encode according to the agg_func. If None, it will encode all the non-numerical columns.
prior_weight – int, float. default=100. Value to weight the prior. The higher, the more important the prior is. The prior is the statistic of the target determined by agg_func.
agg_func – str, default=’mean’. Aggregation function to use for the target encoding.

fit(X, y)

Method to train the encoder by determining the posterior of each column

If to_encode is None, it will encode all the non-numerical columns

It also reset the columns attribute

Parameters:

X – pandas DataFrame of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs). The target values (or class labels) as integers or floats.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → TargetEncoder

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns

selfobject: The updated object.

transform(X, y=None)

Method to transform the input data

It populates the columns attribute with the columns of the output data

For each column to encode, it replaces each value with the posterior computed in the fit method If there are missing values, those are filled in with the prior (e.g. the statistic of the target determined by agg_func)

Parameters:

X – pandas DataFrame of shape (n_samples, n_features) The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.

Returns:

pandas DataFrame with encoded features

class tubesml.pca.DfPCA(n_components, svd_solver='auto', random_state=24, compress=False)

Wrapper around PCA to keep the dataframe structure. It can also return the same dataframe in a compressed form, e.g. by doing and undoing pca.

Inherits from BaseTransformer.

Parameters:

n_components – int, float or ‘mle’. Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) If n_components == 'mle' and svd_solver == 'full', Minka’s MLE is used to guess the dimension. Use of n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'. If 0 < n_components < 1 and svd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:: n_components == min(n_samples, n_features) - 1
svd_solver – {‘auto’, ‘full’, ‘arpack’, ‘randomized’}, default=’auto’ If auto : The solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards. If full : run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing If arpack : run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape) If randomized : run randomized SVD by the method of Halko et al.
random_state – int, RandomState instance or None, default=24 Used when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int for reproducible results across multiple function calls.
compress – bool, default=False. If True, it reverses the PCA via inverse_transform and returns a DataFrame with the original structure It can be useful to remove noise from the data by compressing the information.

fit(X, y=None)

Method to train the transformer.

It also reset the columns attribute.

Parameters:

X – pandas DataFrame of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → DfPCA

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns

selfobject: The updated object.

transform(X, y=None)

Method to transform the input data.

It populates the columns attribute with the columns of the output data.

The resulting columns will have name pca_{int}.

If compress=True, the inverse_transform method is called and the original columns are restored.

Parameters:

X – pandas DataFrame of shape (n_samples, n_features) The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.

Returns:

pandas DataFrame with pca columns or, if compress=True, pandas DataFrame with original columns