Data Cleaning
In this section, we can find the classes and methods typically used to clean the data up for transformations and models at a later stage in a pipeline.
- class tubesml.clean.DfImputer(imputer_type='simple', strategy='mean', fill_value=None, add_indicator=False, n_neighbors=5, weights='uniform')
Just a wrapper for the SimpleImputer that keeps the dataframe structure.
Inherits from
BaseTransformer.- Parameters:
strategy – str, the strategy to impute the missing values, default “mean”. Allowed values: “mean”, “median”, “most_frequent”, “constant”
fill_value – value to use to impute the missing values when the
strategyis “constant”. It is ignored by any other strategyadd_indicator – bool, default=False. If True, a new column with binary values is created whenever missing values are found when the fit method is called. The column will be called
missing_<column_name>
- Attributes:
- statistics_pandas Series. The statistics per column, depending on the
strategychosen. The index of the series is the
columnsattribute of the input dataframe.- imp
sklearn.impute.SimpleImputer Core transformer. Its
fitandtransformmethods are used here.
- statistics_pandas Series. The statistics per column, depending on the
- fit(X, y=None)
Method to train the imputer.
It also reset the
columnsattribute- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The training input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') DfImputer
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Parameters
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weightparameter inscore.
Returns
- selfobject
The updated object.
- transform(X, y=None)
Method to transform the input data
It populates the
columnsattribute with the columns of the output data- Parameters:
X – pandas DataFrame of shape (n_samples, n_features) The input samples.
y – array-like of shape (n_samples,) or (n_samples, n_outputs), Not used The target values (class labels) as integers or strings.
- Returns:
pandas DataFrame with no missing values