Data Exploration

This is text I want to write

tubesml.explore.corr_target(data, target, cols, x_estimator=None)

Scatterplot + linear regression of a list of columns against the target. A correlation matrix is also printed. It is possible to pass an estimator.

Parameters:
  • data – pandas DataFrame. The input dataframe.

  • target – str, name of the target column

  • cols – list of columns to consider for the plot against the target.

  • x_estimator – (optional) additional input for sns.regplot.

tubesml.explore.find_cats(data, target, thrs=0.1, agg_func='mean', critical=0.05, ks=True, frac=1)

Finds interesting categorical features either by perfoming a Kolmogorov-Smirnov test or simply be comparing the descriptive statistic of the full population versus the one obtained with the various subsets.

tubesml.explore.list_missing(data, verbose=True)

Find all the columns with missing values and report on the percentage of missing values.

Parameters:
  • data – pandas Dataframe. The input dataframe

  • verbose – bool, default=True. If True, it prints the percentage of missing values in each column with missing values

Return mis_cols:

A list of column names with missing values.

tubesml.explore.plot_bivariate(data, x, y, hue=None, **kwargs)

Scatterplot of the feature x vs the feature y with the possibility of adding a hue.

Parameters:
  • data – pandas DataFrame. The input dataframe.

  • x – str, name of the feature to plot on the x-axis.

  • y – str, name of the feature to plot on the y-axis.

  • hue – (optional) str, feature to use as hue.

  • kwargs – additional key arguments to pass to sns.scatterplot

tubesml.explore.plot_correlations(data, target=None, limit=50, figsize=(12, 10), **kwargs)

This function plots the correlation matrix of a dataframe. If a target feature is provided, it will display only a certain amount of features, the ones correlated the most with the target. The number of features displayed is controlled by the parameter limit.

Parameters:
  • data – pandas DataFrame. The input dataframe.

  • target – str, default=None. If not None, it displays the correlation matrix in order from the most correlated to the target column to the least. It must be present in data.

  • limit – int, number of feature to display, default=50. This is to avoid plots that are difficult to read.

  • figsize – tuple, default=(12,10). Size of the output figure.

  • kwargs – kwargs to be passed to sns.heatmap. For example, to display annotations. See the documentation of Seaborn for more options.

Return cor_target:

Only if target is provided, correlation matrix of the features in data

tubesml.explore.plot_distribution(data, column, bins=50, correlation=None)

Plots a histogram of a given column. If a Pandas Series is provided with the correlation values, it will be displayed in the title.

Parameters:
  • data – pandas DataFrame. The input dataframe.

  • column – str, name of the column to plot. if correlation is provided, make sure this column is among the indexes of that input

  • bins – int, number of bins.

  • correlation – (optional) pandas Series. Ideally the output of tubesml.plot_correlations

tubesml.explore.segm_target(data, cat, target)

Studies the target segmented by a categorical feature. It plots both a boxplot and a distplot for visual support

Parameters:
  • data – Pandas Dataframe with the columns cat and target

  • cat – str, name of the category used to cut the data

  • target – str, name of the continuous target variable