Data Processing

In this section, we find the classes that process the data to prepare them for further transformation or for model training. The DataFrame structure is preserved to allow for more flexible steps in the pipeline.

class tubesml.scale.DfScaler(method='standard', feature_range=(0, 1))

Wrapper of several sklearn scalers that keeps the dataframe structure.

Inherits from BaseTransformer,

Parameters:
  • method – str, the method to scale the data, default “standard” Allowed values: “standard”, ‘robust’, ‘minmax’

  • feature_range – tuple, the range to scale the data to. Relevant only if method=='minmax'

Attributes:
mean_pandas Series with the mean of each feature in the input data.

It is relevant only if method=Standard. The index of the series is the columns attribute of the input dataframe.

center_pandas Series with the median of each feature in the input data.

It is relevant only if method=Robust. The index of the series is the columns attribute of the input dataframe.

min_pandas Series with the min of each feature in the input data.

It is relevant only if method=minmax. The index of the series is the columns attribute of the input dataframe.

data_min_pandas Series with the min of each feature in the input data.

It is relevant only if method=minmax. The index of the series is the columns attribute of the input dataframe.

data_max_pandas Series with the max of each feature in the input data.

It is relevant only if method=minmax. The index of the series is the columns attribute of the input dataframe.

feature_range_pandas Series with the difference between max and min of each feature in the input data.

It is relevant only if method=minmax. The index of the series is the columns attribute of the input dataframe.

fit(X, y=None)

Method to train the scaler.

Depending on the method attribute, it calls a different sklearn scaler

It also reset the columns attribute

Parameters:
  • X – pandas DataFrame of shape (n_samples, n_features) The training input samples.

  • y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DfScaler

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

transform(X, y=None)

Method to transform the input data.

It populates the columns attribute with the columns of the output data.

Parameters:
  • X – pandas DataFrame of shape (n_samples, n_features). The input samples.

  • y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used. The target values (class labels) as integers or strings.

Returns:

pandas DataFrame with scaled data.

class tubesml.dummy.Dummify(drop_first=False, match_cols=True, verbose=False)

Wrapper for pd.get_dummies

It assures that if some column is missing or is new after the first transform, the pipeline won’t break

To avoid problems with using both drop_first and match_cols, specifically if the dropped category is missing when dummies are created after the first time, we let match_cols to have the role of drop_first if the transformer has been ran already. See test_match_columns_drop_first_equal for an example.

The fit method simply passes the data as in the BaseTransformer

Parameters:
  • drop_first – bool, default False. If True, the first dummy column is dropped

  • match_col – bool, default False. If True, it makes sure that all the columns found calling the transformer the first time are found every other time the transform method is called. It thus adds the missing columns (with all 0 values) and removes the columns not previously found.

  • verbose – bool, default False. If True, it raises a UserWarning when the _match_columns method is invoked

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') Dummify

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

Returns

selfobject

The updated object.

transform(X, y=None)

Method to transform the input data and create dummy columns. If match_cols=True, it also calls the _match_columns method to make sure the data shape stays consistent across runs. This is not done the first time the transform method is called.

It populates the columns attribute with the columns of the output data. This is done only the first time the transformer is called and not every time it outputs new data.

Parameters:
  • X – pandas DataFrame of shape (n_samples, n_features) The input samples.

  • y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.

Returns:

pandas DataFrame with dummified columns