Feature Engineering
In this section, we find classes to create new features. This includes
creating polynomial features
target encode categorical features
using PCA to create new features or compress the data
In every case, the output is going to be a pandas DataFrame so that any further manipulation of the data can be as easy as the first.
- class tubesml.poly.DfPolynomial(degree=2, interaction_only=False, include_bias=False, to_interact='all')
Wrapper around PolynomialFeatures.
Inherits from
BaseTransformer.- Parameters:
degree – int, default=2 The degree of the polynomial features.
interaction_only – bool, default=False. If True, only interaction features are produced: features that are products of at most degree distinct input features (so not
x[1] ** 2,x[0] * x[2] ** 3, etc.).include_bias – bool, default=False If True, then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model). The column is added with the name BIAS_TERM
to_interact – str or list of strings, default=’all’. Columns to consider for the interactions. If ‘all’, then all the columns of the DataFrame will be used. If a list of columns is provided, only those columns will be used for creating the interactions. All the other columns will still be in the output DataFrame.
- fit(X, y=None)
Method to train the transformer.
Depending on the
to_interactattribute, if fits considering different slices of the input DataFrameIt also reset the
columnsattribute- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DfPolynomial
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns
- selfobject
The updated object.
- transform(X, y=None)
Method to transform the input data.
It populates the
columnsattribute with the columns of the output data.If a bias term is inclued, it will be called
BIAS_TERM.- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- Returns:
pandas DataFrame with polynomial features
- class tubesml.encoders.TargetEncoder(to_encode=None, prior_weight=100, agg_func='mean')
Heavily inspired by MaxHalford # noqa MaxHalford # noqa
Encodes categorical features with statistics of the target variable. For example, by using the mean target value.
It allows for other aggregating functions, for now it is assumed this is provided as a string for the agg method of pandas.
Inherits from
BaseTransformer.- Parameters:
to_encode – str, list, None. default=None. (list of) column(s) to encode according to the
agg_func. If None, it will encode all the non-numerical columns.prior_weight – int, float. default=100. Value to weight the prior. The higher, the more important the prior is. The prior is the statistic of the target determined by
agg_func.agg_func – str, default=’mean’. Aggregation function to use for the target encoding.
- fit(X, y)
Method to train the encoder by determining the posterior of each column
If
to_encodeis None, it will encode all the non-numerical columnsIt also reset the
columnsattribute- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs). The target values (or class labels) as integers or floats.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TargetEncoder
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns
- selfobject
The updated object.
- transform(X, y=None)
Method to transform the input data
It populates the
columnsattribute with the columns of the output dataFor each column to encode, it replaces each value with the posterior computed in the
fitmethod If there are missing values, those are filled in with the prior (e.g. the statistic of the target determined by agg_func)- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- Returns:
pandas DataFrame with encoded features
- class tubesml.pca.DfPCA(n_components, svd_solver='auto', random_state=24, compress=False)
Wrapper around PCA to keep the dataframe structure. It can also return the same dataframe in a compressed form, e.g. by doing and undoing pca.
Inherits from
BaseTransformer.- Parameters:
n_components – int, float or ‘mle’. Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) If
n_components == 'mle'andsvd_solver == 'full', Minka’s MLE is used to guess the dimension. Use ofn_components == 'mle'will interpretsvd_solver == 'auto'assvd_solver == 'full'. If0 < n_components < 1andsvd_solver == 'full', select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components. Ifsvd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples. Hence, the None case results in:: n_components == min(n_samples, n_features) - 1svd_solver – {‘auto’, ‘full’, ‘arpack’, ‘randomized’}, default=’auto’ If auto : The solver is selected by a default policy based on
X.shapeandn_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards. If full : run exact full SVD calling the standard LAPACK solver viascipy.linalg.svdand select the components by postprocessing If arpack : run SVD truncated to n_components calling ARPACK solver viascipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape) If randomized : run randomized SVD by the method of Halko et al.random_state – int, RandomState instance or None, default=24 Used when the ‘arpack’ or ‘randomized’ solvers are used. Pass an int for reproducible results across multiple function calls.
compress – bool, default=False. If True, it reverses the PCA via
inverse_transformand returns a DataFrame with the original structure It can be useful to remove noise from the data by compressing the information.
- fit(X, y=None)
Method to train the transformer.
It also reset the
columnsattribute.- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DfPCA
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns
- selfobject
The updated object.
- transform(X, y=None)
Method to transform the input data.
It populates the
columnsattribute with the columns of the output data.The resulting columns will have name
pca_{int}.If
compress=True, theinverse_transformmethod is called and the original columns are restored.- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- Returns:
pandas DataFrame with pca columns or, if
compress=True, pandas DataFrame with original columns