Data Processing
In this section, we find the classes that process the data to prepare them for further transformation or for model training. The DataFrame structure is preserved to allow for more flexible steps in the pipeline.
- class tubesml.scale.DfScaler(method='standard', feature_range=(0, 1))
Wrapper of several sklearn scalers that keeps the dataframe structure.
Inherits from
BaseTransformer,- Parameters:
method – str, the method to scale the data, default “standard” Allowed values: “standard”, ‘robust’, ‘minmax’
feature_range – tuple, the range to scale the data to. Relevant only if
method=='minmax'
- Attributes:
- mean_pandas Series with the mean of each feature in the input data.
It is relevant only if
method=Standard. The index of the series is thecolumnsattribute of the input dataframe.- center_pandas Series with the median of each feature in the input data.
It is relevant only if
method=Robust. The index of the series is thecolumnsattribute of the input dataframe.- min_pandas Series with the min of each feature in the input data.
It is relevant only if
method=minmax. The index of the series is thecolumnsattribute of the input dataframe.- data_min_pandas Series with the min of each feature in the input data.
It is relevant only if
method=minmax. The index of the series is thecolumnsattribute of the input dataframe.- data_max_pandas Series with the max of each feature in the input data.
It is relevant only if
method=minmax. The index of the series is thecolumnsattribute of the input dataframe.- feature_range_pandas Series with the difference between max and min of each feature in the input data.
It is relevant only if
method=minmax. The index of the series is thecolumnsattribute of the input dataframe.
- fit(X, y=None)
Method to train the scaler.
Depending on the
methodattribute, it calls a different sklearn scalerIt also reset the
columnsattribute- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DfScaler
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns
- selfobject
The updated object.
- transform(X, y=None)
Method to transform the input data.
It populates the
columnsattribute with the columns of the output data.- Parameters:
X – pandas DataFrame of shape (n_samples, n_features). The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used. The target values (class labels) as integers or strings.
- Returns:
pandas DataFrame with scaled data.
- class tubesml.dummy.Dummify(drop_first=False, match_cols=True, verbose=False)
Wrapper for pd.get_dummies
It assures that if some column is missing or is new after the first transform, the pipeline won’t break
To avoid problems with using both
drop_firstandmatch_cols, specifically if the dropped category is missing when dummies are created after the first time, we letmatch_colsto have the role ofdrop_firstif the transformer has been ran already. See test_match_columns_drop_first_equal for an example.The fit method simply passes the data as in the
BaseTransformer- Parameters:
drop_first – bool, default False. If True, the first dummy column is dropped
match_col – bool, default False. If True, it makes sure that all the columns found calling the transformer the first time are found every other time the transform method is called. It thus adds the missing columns (with all 0 values) and removes the columns not previously found.
verbose – bool, default False. If True, it raises a UserWarning when the
_match_columnsmethod is invoked
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') Dummify
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns
- selfobject
The updated object.
- transform(X, y=None)
Method to transform the input data and create dummy columns. If
match_cols=True, it also calls the_match_columnsmethod to make sure the data shape stays consistent across runs. This is not done the first time the transform method is called.It populates the
columnsattribute with the columns of the output data. This is done only the first time the transformer is called and not every time it outputs new data.- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- Returns:
pandas DataFrame with dummified columns