5 Steps to Transform Messy Functions into Production-Ready Code

Motivation

Functions are essential in a data science project because they make the code more modular, reusable, readable, and testable. However, writing a messy function that tries to do too much can introduce maintenance hurdles and diminish the code’s readability.

In the following code, the function impute_missing_values is long, messy, and tries to do many things. Since there are many hard-coded values, it would be impossible for someone else to reuse this function for a DataFrame with different column names.

def impute_missing_values(df):
    # Fill missing values with group statistics
    df["MSZoning"] = df.groupby("MSSubClass")["MSZoning"].transform(
        lambda x: x.fillna(x.mode()[0])
    )
    df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(
        lambda x: x.fillna(x.median())
    )

    # Fill missing values with constant
    df["Functional"] = df["Functional"].fillna("Typ")

    df["Alley"] = df["Alley"].fillna("Missing")
    for col in ["GarageType", "GarageFinish", "GarageQual", "GarageCond"]:
        df[col] = df[col].fillna("Missing")

    for col in ("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2"):
        df[col] = df[col].fillna("Missing")

    df["FireplaceQu"] = df["FireplaceQu"].fillna("Missing")

    df["PoolQC"] = df["PoolQC"].fillna("Missing")

    df["Fence"] = df["Fence"].fillna("Missing")

    df["MiscFeature"] = df["MiscFeature"].fillna("Missing")

    numeric_dtypes = ["int16", "int32", "int64", "float16", "float32", "float64"]
    for i in df.columns:
        if df[i].dtype in numeric_dtypes:
            df[i] = df[i].fillna(0)

    # Fill missing values with mode
    df["Electrical"] = df["Electrical"].fillna("SBrkr")
    df["KitchenQual"] = df["KitchenQual"].fillna("TA")
    df["Exterior1st"] = df["Exterior1st"].fillna(df["Exterior1st"].mode()[0])
    df["Exterior2nd"] = df["Exterior2nd"].fillna(df["Exterior2nd"].mode()[0])
    df["SaleType"] = df["SaleType"].fillna(df["SaleType"].mode()[0])
    for i in df.columns:
        if df[i].dtype == object:
            df[i] = df[i].fillna(df[i].mode()[0])
    return df

This example is adapted from the notebook titled How I Achieved Top 0.3% in a Kaggle Competition, with a few alterations.

Unfortunately, functions like this are common in a data science project. In this article, you will learn how to refactor this function so that it is reusable, maintainable, and scalable.

To refactor this function, we will follow these 5 steps:

  • Remove redundant code.
  • Split the function into smaller units.
  • Remove duplicates
  • Make code extendable without alteration.
  • Strengthen code against invalid data inputs.

Feel free to play and fork the source code of this article here:

Remove redundant code

Problem 1

The first problem with this code is that each column is set to “Missing” individually.

def impute_missing_values(df): 
		...
    # Fill missing values with constant
    df["Functional"] = df["Functional"].fillna("Typ")

    df["Alley"] = df["Alley"].fillna("Missing")
    for col in ["GarageType", "GarageFinish", "GarageQual", "GarageCond"]:
        df[col] = df[col].fillna("Missing")

    for col in ("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2"):
        df[col] = df[col].fillna("Missing")

    df["FireplaceQu"] = df["FireplaceQu"].fillna("Missing")

    df["PoolQC"] = df["PoolQC"].fillna("Missing")

    df["Fence"] = df["Fence"].fillna("Missing")

    df["MiscFeature"] = df["MiscFeature"].fillna("Missing")
		...

Solution

To address this issue, let’s group columns that are labeled “Missing” and fill them using loops.

def impute_missing_values(df): 
		...  
    df["Functional"] = df["Functional"].fillna("Typ")

    for col in [
        "Alley",
        "GarageType",
        "GarageFinish",
        "GarageQual",
        "GarageCond",
        "BsmtQual",
        "BsmtCond",
        "BsmtExposure",
        "BsmtFinType1",
        "BsmtFinType2",
        "FireplaceQu",
        "PoolQC",
        "Fence",
        "MiscFeature",
    ]:
        df[col] = df[col].fillna("Missing")

Problem 2

The second problem is that it loops through every column to fill numbers with 0, which is inefficient.

def impute_missing_values(df): 
	...
    numeric_dtypes = ["int16", "int32", "int64", "float16", "float32", "float64"]
    for i in df.columns:
        if df[i].dtype in numeric_dtypes:
            df[i] = df[i].fillna(0)

Solution

To make this code more efficient, use Pandas’ built-in functions to batch-update columns with .fillna().

def impute_missing_values(df): 
	...
    numerical_features = df.select_dtypes(include=["int64", "float64"]).columns
    df[numerical_features] = df[numerical_features].fillna(0)

Split the function into smaller units

Problem

The function impute_missing_values performs various data cleaning operations: data cleaning tasks, such as substituting missing values with group statistics, a constant, and the most frequent value.

import pandas as pd


def impute_missing_values(df):
    # Fill missing values with group statistics
    df["MSZoning"] = df.groupby("MSSubClass")["MSZoning"].transform(
        lambda x: x.fillna(x.mode()[0])
    )
    df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(
        lambda x: x.fillna(x.median())
    )

    # Fill missing values with constant
    df["Functional"] = df["Functional"].fillna("Typ")

    for col in [
        "Alley",
        "GarageType",
        "GarageFinish",
        "GarageQual",
        "GarageCond",
        "BsmtQual",
        "BsmtCond",
        "BsmtExposure",
        "BsmtFinType1",
        "BsmtFinType2",
        "FireplaceQu",
        "PoolQC",
        "Fence",
        "MiscFeature",
    ]:
        df[col] = df[col].fillna("Missing")

    numerical_df = df.select_dtypes(include=["int64", "float64"]).columns
    df[numerical_df] = df[numerical_df].fillna(0)

    # Fill missing values with mode
    categorical_df = df.select_dtypes(include=["object"]).columns
    fill_dict_categorical = df[categorical_df].mode().to_dict(orient="records")[0]
    df[categorical_df] = df[categorical_df].fillna(fill_dict_categorical)
    return df

This blend of tasks complicates testing, leading to cluttered tests that require data samples covering diverse data types, columns, and imputing methods.

import pandas as pd
from process_data import impute_missing_values


def test_impute_missing_values():
    df = pd.DataFrame(
        {
            "MSZoning": [None, "RL", None, "RM"],
            "MSSubClass": [20, 30, 30, 20],
            "LotFrontage": [80.0, None, 75.0, None],
            "Neighborhood": ["CollgCr", "Veenker", "Crawfor", "CollgCr"],
            "Functional": [None, "Typ", "Min1", "Min2"],
            "Alley": [None, "Grvl", None, None],
            "GarageType": [None, "Attchd", None, "Detchd"],
            "GarageFinish": ["Fin", None, "RFn", None],
            "GarageQual": [None, "TA", None, "TA"],
            "GarageCond": ["TA", None, "Fa", None],
            "BsmtQual": [None, "Gd", "TA", "Ex"],
            "BsmtCond": ["TA", "Gd", None, "Fa"],
            "BsmtExposure": ["No", None, "Mn", "Av"],
            "BsmtFinType1": ["GLQ", "ALQ", None, "Unf"],
            "BsmtFinType2": ["Unf", "BLQ", None, "LwQ"],
            "FireplaceQu": [None, "TA", "Gd", "Ex"],
            "PoolQC": ["Ex", None, "Fa", None],
            "Fence": [None, "MnPrv", "GdWo", None],
            "MiscFeature": [None, "Shed", "Gar2", None],
            "Electrical": ["SBrkr", None, "FuseF", "FuseA"],
            "KitchenQual": ["Gd", "TA", None, "Ex"],
            "Exterior1st": [None, "VinylSd", "Wd Sdng", "MetalSd"],
            "Exterior2nd": ["VinylSd", None, "Wd Shng", "MetalSd"],
            "SaleType": ["WD", None, "New", "COD"],
        }
    )

    imputed_df = impute_missing_values(df)

    assert not imputed_df.isnull().any().any()

Solution

We will break this function down into smaller functions, each handling a specific type of missing data:

  • fill_missing_values_with_group_statistic fills missing values in a target feature by applying the most common value or median of each group determined by another feature in the DataFrame.
def impute_missing_values_with_group_statistic(
    df: pd.DataFrame,
    group_feature: str,
    target_feature: str,
    strategy: Literal["most_frequent", "median", "mean"],
) -> pd.DataFrame:
    strategy_functions = {
        "most_frequent": lambda series: series.fillna(series.mode()[0]),
        "median": lambda series: series.fillna(series.median()),
        "mean": lambda series: series.fillna(series.mean()),
    }
    if strategy not in strategy_functions:
        raise ValueError(f"Invalid strategy: {strategy}")

    impute_function = strategy_functions[strategy]
    df[target_feature] = df.groupby(group_feature)[target_feature].transform(
        impute_function
    )
    return df
  • fill_missing_values_with_constant fills missing values in specified features with a constant value.
def fill_missing_values_with_constant(
	df: pd.DataFrame, features: list, fill_value: Union[str, int, float]
) -> pd.DataFrame:
	df[features] = df[features].fillna(fill_value)
	return df
  • fill_missing_values_with_statistics fills missing values in specified features with the mode of each feature.
def impute_missing_values_with_statistics(
    df: pd.DataFrame,
    features: list,
    strategy: Literal["most_frequent", "median", "mean"],
) -> pd.DataFrame:
    strategy_functions = {
        "most_frequent": lambda series: series.fillna(series.mode()[0]),
        "median": lambda series: series.fillna(series.median()),
        "mean": lambda series: series.fillna(series.mean()),
    }
    if strategy not in strategy_functions:
        raise ValueError(f"Invalid strategy: {strategy}")

    impute_function = strategy_functions[strategy]
    df[features] = df[features].apply(impute_function)
    return df

These functions will then be combined within the impute_missing_values function:

def impute_missing_values(df: pd.DataFrame) -> pd.DataFrame:
    categorical_features = df.select_dtypes(include=["object"]).columns
    numerical_features = df.select_dtypes(include=["int64", "float64"]).columns

    df = impute_missing_values_with_group_statistic(
        df,
        strategy="most_frequent",
        group_feature="MSSubClass",
        target_feature="MSZoning",
    )
    df = impute_missing_values_with_group_statistic(
        df,
        strategy="median",
        group_feature="Neighborhood",
        target_feature="LotFrontage",
    )

    df = impute_missing_values_with_constant(
        df, features=["Functional"], impute_value="Typ"
    )
    df = impute_missing_values_with_constant(
        df, features=numerical_features, impute_value=0
    )
    df = impute_missing_values_with_constant(
        df,
        features=[
            "Alley",
            "GarageType",
            "GarageFinish",
            "GarageQual",
            "GarageCond",
            "BsmtQual",
            "BsmtCond",
            "BsmtExposure",
            "BsmtFinType1",
            "BsmtFinType2",
            "FireplaceQu",
            "PoolQC",
            "Fence",
            "MiscFeature",
        ],
        impute_value="Missing",
    )
    df = impute_missing_values_with_statistics(
        df, features=categorical_features, strategy="most_frequent"
    )
    return df

Now, each imputing function can be tested individually, making the tests more concise and easier to understand.

def test_impute_missing_values_with_constant():
    df = pd.DataFrame({"A": [1, None, 3], "B": ["x", "y", None]})
    result = impute_missing_values_with_constant(
        df, features=["A", "B"], impute_value=0
    )
    expected = pd.DataFrame({"A": [1, 0, 3], "B": ["x", "y", 0]})
    pd.testing.assert_frame_equal(result, expected, check_dtype=False)

Remove Code Duplicates

Problem

The two functions  impute_missing_values_with_group_statistic  and  impute_missing_values_with_statistics  perform a similar pattern of actions:

  1. Check the value of strategy.
  2. Apply the corresponding statistical function to fill missing values.
def impute_missing_values_with_group_statistic(
    df: pd.DataFrame,
    group_feature: str,
    target_feature: str,
    strategy: Literal["most_frequent", "median", "mean"],
) -> pd.DataFrame:
    strategy_functions = {
        "most_frequent": lambda series: series.fillna(series.mode()[0]),
        "median": lambda series: series.fillna(series.median()),
        "mean": lambda series: series.fillna(series.mean()),
    }
    if strategy not in strategy_functions:
        raise ValueError(f"Invalid strategy: {strategy}")

    impute_function = strategy_functions[strategy]
    df[target_feature] = df.groupby(group_feature)[target_feature].transform(
        impute_function
    )
    return df


def impute_missing_values_with_statistics(
    df: pd.DataFrame,
    features: list,
    strategy: Literal["most_frequent", "median", "mean"],
) -> pd.DataFrame:
    strategy_functions = {
        "most_frequent": lambda series: series.fillna(series.mode()[0]),
        "median": lambda series: series.fillna(series.median()),
        "mean": lambda series: series.fillna(series.mean()),
    }
    if strategy not in strategy_functions:
        raise ValueError(f"Invalid strategy: {strategy}")
    
    impute_function = strategy_functions[strategy]
    df[features] = df[features].apply(impute_function)
    return df

If a change or a bug fix is required, it needs to be applied in multiple places. This increases the risk of inconsistencies if the change is not replicated exactly in all instances of the duplicated code.

def impute_missing_values_with_group_statistic(
    df: pd.DataFrame,
    group_feature: str,
    target_feature: str,
    strategy: Literal["most_frequent", "median", "mean"],
) -> pd.DataFrame:
    strategy_functions = {
		# CHANGE HERE
        "most_frequent": lambda series: series.fillna(series.mode().get(0, default=pd.NA)),
        "median": lambda series: series.fillna(series.median()),
        "mean": lambda series: series.fillna(series.mean()),
    }
    if strategy not in strategy_functions:
        raise ValueError(f"Invalid strategy: {strategy}")

    impute_function = strategy_functions[strategy]
    df[target_feature] = df.groupby(group_feature)[target_feature].transform(
        impute_function
    )
    return df


def impute_missing_values_with_statistics(
    df: pd.DataFrame,
    features: list,
    strategy: Literal["most_frequent", "median", "mean"],
) -> pd.DataFrame:
    strategy_functions = {
		# BUT FORGET TO CHANGE HERE
        "most_frequent": lambda series: series.fillna(series.mode()[0]),
        "median": lambda series: series.fillna(series.median()),
        "mean": lambda series: series.fillna(series.mean()),
    }
    if strategy not in strategy_functions:
        raise ValueError(f"Invalid strategy: {strategy}")
    
    impute_function = strategy_functions[strategy]
    df[features] = df[features].apply(impute_function)
    return df

Solution

To address these issues, the duplicated logic can be abstracted into a shared function called get_strategy_function.

def get_strategy_function(strategy: Literal["most_frequent", "median", "mean"]):
    strategy_functions = {
        "most_frequent": lambda series: series.fillna(
            series.mode().get(0, default=pd.NA)
        ),
        "median": lambda series: series.fillna(series.median()),
        "mean": lambda series: series.fillna(series.mean()),
    }
    if strategy not in strategy_functions:
        raise ValueError(f"Invalid strategy: {strategy}")

    return strategy_functions[strategy]


def impute_missing_values_with_group_statistic(
    df: pd.DataFrame,
    group_feature: str,
    target_feature: str,
    strategy: Literal["most_frequent", "median", "mean"],
) -> pd.DataFrame:
    impute_function = get_strategy_function(strategy)
    df[target_feature] = df.groupby(group_feature)[target_feature].transform(
        impute_function
    )
    return df


def impute_missing_values_with_statistics(
    df: pd.DataFrame,
    features: list,
    strategy: Literal["most_frequent", "median", "mean"],
) -> pd.DataFrame:
    impute_function = get_strategy_function(strategy)
    df[features] = df[features].apply(impute_function)
    return d

This would centralize the imputation logic, making the codebase easier to manage, extend, and test.

Make code extendable without alteration

Problem

The Open-Closed Principle suggests that objects should be designed to accommodate new features without changing the existing code that’s already been developed and tested.

The impute_missing_values function violates the Open-Closed Principle because it requires changes to its code when the data sets change or when integrating a new imputation method.

def impute_missing_values(df: pd.DataFrame) -> pd.DataFrame:
    categorical_features = df.select_dtypes(include=["object"]).columns
    numerical_features = df.select_dtypes(include=["int64", "float64"]).columns

    df = impute_missing_values_with_group_statistic(
        df,
        strategy="most_frequent",
        group_feature="MSSubClass",
        target_feature="MSZoning",
    )
    df = impute_missing_values_with_group_statistic(
        df,
        strategy="median",
        group_feature="Neighborhood",
        target_feature="LotFrontage",
    )

    df = impute_missing_values_with_constant(
        df, features=["Functional"], impute_value="Typ"
    )
    df = impute_missing_values_with_constant(
        df, features=numerical_features, impute_value=0
    )
    df = impute_missing_values_with_constant(
        df,
        features=[
            "Alley",
            "GarageType",
            "GarageFinish",
            "GarageQual",
            "GarageCond",
            "BsmtQual",
            "BsmtCond",
            "BsmtExposure",
            "BsmtFinType1",
            "BsmtFinType2",
            "FireplaceQu",
            "PoolQC",
            "Fence",
            "MiscFeature",
        ],
        impute_value="Missing",
    )

    # USE NEW IMPUTING METHOD
    df = impute_missing_values_with_random_value(
        df, features=["NewFeature"]
    )
    return df

When you make changes to the code, you may need to re-test the entire codebase to ensure that the changes did not break anything.

import pandas as pd
from process_data import impute_missing_values


def test_impute_missing_values():
    df = pd.DataFrame(
        {
            "MSZoning": [None, "RL", None, "RM"],
            "MSSubClass": [20, 30, 30, 20],
            "LotFrontage": [80.0, None, 75.0, None],
            "Neighborhood": ["CollgCr", "Veenker", "Crawfor", "CollgCr"],
            "Functional": [None, "Typ", "Min1", "Min2"],
            "Alley": [None, "Grvl", None, None],
            "GarageType": [None, "Attchd", None, "Detchd"],
            "GarageFinish": ["Fin", None, "RFn", None],
            "GarageQual": [None, "TA", None, "TA"],
            "GarageCond": ["TA", None, "Fa", None],
            "BsmtQual": [None, "Gd", "TA", "Ex"],
            "BsmtCond": ["TA", "Gd", None, "Fa"],
            "BsmtExposure": ["No", None, "Mn", "Av"],
            "BsmtFinType1": ["GLQ", "ALQ", None, "Unf"],
            "BsmtFinType2": ["Unf", "BLQ", None, "LwQ"],
            "FireplaceQu": [None, "TA", "Gd", "Ex"],
            "PoolQC": ["Ex", None, "Fa", None],
            "Fence": [None, "MnPrv", "GdWo", None],
            "MiscFeature": [None, "Shed", "Gar2", None],
			"NewFeature": [...] # ADD THIS
        }
    )

    imputed_df = impute_missing_values(df)

    assert not imputed_df.isnull().any().any()

This can be time-consuming and may delay the release of the software.

Solution

To solve this problem, we can define an abstract base class DataFrameImputer with an abstract method impute. Subsequently, we can create a variety of concrete imputer classes that inherit from DataFrameImputer. Each of these would implement the impute method to carry out a specific imputation strategy such as filling missing values with group statistics, constants, or statistics.

class DataFrameImputer(ABC):
    @abstractmethod
    def impute(self, df: pd.DataFrame) -> pd.DataFrame:
        pass


class GroupStatisticImputer(DataFrameImputer):
    def __init__(
        self,
        strategy: Literal["most_frequent", "median", "mean"],
        group_feature: str,
        target_feature: str,
    ):
        self.strategy = strategy
        self.group_feature = group_feature
        self.target_feature = target_feature

    def impute(self, df: pd.DataFrame) -> pd.DataFrame:
        impute_function = get_strategy_function(self.strategy)
        df[self.target_feature] = df.groupby(self.group_feature)[
            self.target_feature
        ].transform(impute_function)
        return df


class ConstantImputer(DataFrameImputer):
    def __init__(self, features: list, fill_value: Union[str, float, int]):
        self.features = features
        self.fill_value = fill_value

    def impute(self, df: pd.DataFrame) -> pd.DataFrame:
        df[self.features] = df[self.features].fillna(self.fill_value)
        return df


class StatisticsImputer(DataFrameImputer):
    def __init__(
        self,
        features: Iterable[NumberOrStr],
        strategy: Literal["most_frequent", "median", "mean"],
    ):
        self.features = features
        self.strategy = strategy

    def impute(self, df: pd.DataFrame) -> pd.DataFrame:
        impute_function = get_strategy_function(self.strategy)
        df[self.features] = df[self.features].apply(impute_function)
        return df

Since all impute classes have the impute method, the impute_missing_values function can accept a list of DataFrameImputer objects and applies their impute method to the DataFrame in a loop.

def impute_missing_values(
    df: pd.DataFrame, imputers: list[DataFrameImputer]
) -> pd.DataFrame:
    for imputer in imputers:
        df = imputer.impute(df)
    return df

By doing so, users can extend the functionalities of the impute_missing_values function without modifying the underlying code.

df = pd.DataFrame(
    {
        "Cat": ["A", None, "B", "B"],
        "Num": [1, None, 3, None],
    }
)

class RandomImputer(DataFrameImputer):
    def __init__(self, features: Iterable[NumberOrStr]):
        self.features = features

    def impute(self, df: pd.DataFrame) -> pd.DataFrame:
        for feature in self.features:
            nan_mask = df[feature].isna()
            non_nan_values = df.loc[~nan_mask, feature].tolist()
            if non_nan_values:  
                df.loc[nan_mask, feature] = nan_mask.loc[nan_mask].apply(
                    lambda x: np.random.choice(non_nan_values)
                )
        return df

imputers = [
    ConstantImputer(features=["Num"], fill_value=0),
    RandomImputer(features=["Cat"]), # USE NEW IMPUTER
]

# DOESN'T CHANGE impute_missing_values
impute_missing_values(df, imputers=imputers)

"""
  Cat  Num
0   A  1.0
1   A  0.0
2   B  3.0
3   B  0.0
"""

Strengthen code against invalid data inputs

Problem 1

The second half of the Robustness Principle states: “Be liberal in what you accept from others.” This means that your code should be able to handle inputs that contain unexpected variations.

However, the GroupStatisticsImputer class violates this principle by not handling cases where the group_feature column contains missing values.

This results in the impute method not raising an error, but the output still includes the missing values, which differs from user expectations.

class GroupStatisticImputer(DataFrameImputer):
    def __init__(
        self,
        strategy: Literal["most_frequent", "median", "mean"],
        group_feature: str,
        target_feature: str,
    ):
        self.strategy = strategy
        self.group_feature = group_feature
        self.target_feature = target_feature

    def impute(self, df: pd.DataFrame) -> pd.DataFrame:
        impute_function = get_strategy_function(self.strategy)
        df[self.target_feature] = df.groupby(self.group_feature)[
            self.target_feature
        ].transform(impute_function)
        return df
df = pd.DataFrame(
	{"Group": ["A", "A", None, "B"], "Value": [1, None, 3, None]}
)
imputer = GroupStatisticImputer(
	strategy="mean", group_feature="Group", target_feature="Value"
)

# CONTAINS MISSING VALUES
imputer.impute(df)
"""
  Group  Value
0     A    1.0
1     A    1.0
2  None    NaN
3     B    NaN
"""

Solution

To address this issue, we will raise a ValueError if the group_feature column contains missing values.


class GroupStatisticImputer(DataFrameImputer):
    def __init__(
        self,
        strategy: Literal["most_frequent", "median", "mean"],
        group_feature: str,
        target_feature: str,
    ):
        self.strategy = strategy
        self.group_feature = group_feature
        self.target_feature = target_feature

    def impute(self, df: pd.DataFrame) -> pd.DataFrame:
        if df[self.group_feature].isna().any():
            raise ValueError(
                f"Group feature {self.group_feature} cannot contain NaN values"
            )
        impute_function = get_strategy_function(self.strategy)
        df[self.target_feature] = df.groupby(self.group_feature)[
            self.target_feature
        ].transform(impute_function)
        return df
df = pd.DataFrame(
    {"Group": ["A", "A", None, "B"], "Value": [1, None, 3, None]}
)
imputer = GroupStatisticImputer(
    strategy="mean", group_feature="Group", target_feature="Value"
)
imputer.impute(df)

"ValueError: Group feature Group cannot contain NaN values"

Problem 2

Users may want to apply a single imputer instead of multiple ones. Yet, if they pass an individual imputer instance rather than a list, Python will try to iterate over the object, leading to a TypeError.

def impute_missing_values(
    df: pd.DataFrame, imputers: list[DataFrameImputer]
) -> pd.DataFrame:
    for imputer in imputers:
        df = imputer.impute(df)
    return df
df = pd.DataFrame(
    {
        "Group": ["A", "A", "B", "B"],
        "Value": [1, None, 3, None],
    }
)

# USE ONE IMPUTER
group_imputer = GroupStatisticImputer(MeanSeriesImputer(), "Group", "Value")
impute_missing_values(df, group_imputer)

"""
def impute_missing_values(
        df: pd.DataFrame, imputers: list[DataFrameImputer]
    ) -> pd.DataFrame:
>       for imputer in imputers:
E       TypeError: 'GroupStatisticImputer' object is not iterable
"""

Solution

To accommodate this, we will enable users to provide a single DataFrameImputer instance as the imputers argument by turning the imputers into a list if it is not already one.


def check_variables_is_list(variables: Union[Any, Iterable[Any]]) -> Iterable[Any]:
    if isinstance(variables, list):
        return variables
    return [variables]

def impute_missing_values(
    df: pd.DataFrame, imputers: Union[DataFrameImputer, list[DataFrameImputer]]
) -> pd.DataFrame:
    imputers_ = check_variables_is_list(imputers) # Add this
    for imputer in imputers_:
        df = imputer.impute(df)
    return df

This will allow users to provide a single DataFrameImputer instance as the imputers argument.

df = pd.DataFrame(
    {
        "Group": ["A", "A", "B", "B"],
        "Value": [1, None, 3, None],
    }
)
group_imputer = GroupStatisticImputer(MeanSeriesImputer(), "Group", "Value")
impute_missing_values(df, group_imputer)

"""
  Group  Value
0     A    1.0
1     A    1.0
2     B    3.0
3     B    3.0
"""

Conclusion

In this article, we have transformed a function from:

import pandas as pd


def impute_missing_values(df):
    # Fill missing values with group statistics
    df["MSZoning"] = df.groupby("MSSubClass")["MSZoning"].transform(
        lambda x: x.fillna(x.mode()[0])
    )
    df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(
        lambda x: x.fillna(x.median())
    )

    # Fill missing values with constant
    df["Functional"] = df["Functional"].fillna("Typ")

    df["Alley"] = df["Alley"].fillna("Missing")
    for col in ["GarageType", "GarageFinish", "GarageQual", "GarageCond"]:
        df[col] = df[col].fillna("Missing")

    for col in ("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2"):
        df[col] = df[col].fillna("Missing")

    df["FireplaceQu"] = df["FireplaceQu"].fillna("Missing")

    df["PoolQC"] = df["PoolQC"].fillna("Missing")

    df["Fence"] = df["Fence"].fillna("Missing")

    df["MiscFeature"] = df["MiscFeature"].fillna("Missing")

    numeric_dtypes = ["int16", "int32", "int64", "float16", "float32", "float64"]
    for i in df.columns:
        if df[i].dtype in numeric_dtypes:
            df[i] = df[i].fillna(0)

    # Fill missing values with mode
    df["Electrical"] = df["Electrical"].fillna("SBrkr")
    df["KitchenQual"] = df["KitchenQual"].fillna("TA")
    df["Exterior1st"] = df["Exterior1st"].fillna(df["Exterior1st"].mode()[0])
    df["Exterior2nd"] = df["Exterior2nd"].fillna(df["Exterior2nd"].mode()[0])
    df["SaleType"] = df["SaleType"].fillna(df["SaleType"].mode()[0])
    for i in df.columns:
        if df[i].dtype == object:
            df[i] = df[i].fillna(df[i].mode()[0])
    return df

to a more refined version:

import pandas as pd
from typing import Union
from utils import check_variables_is_list


def impute_missing_values(
    df: pd.DataFrame, imputers: Union[DataFrameImputer, list[DataFrameImputer]]
) -> pd.DataFrame:
    imputers_ = check_variables_is_list(imputers)
    for imputer in imputers_:
        df = imputer.impute(df)
    return df

This new function is scalable, maintainable, and testable, allowing for easy integration of new data and methods as needed.

4 thoughts on “5 Steps to Transform Messy Functions into Production-Ready Code”

Comments are closed.

Scroll to Top