How to Structure a Data Science Project for Maintainability

Motivation

It is important to structure your data science project based on a certain standard so that your teammates can easily maintain and modify your project.

But what kind of standard should you follow? Wouldn’t it be convenient if you had a template to create an ideal structure for your data science project?

That is why I created a repository name data-science-template. This repository is the result of my years refining the best way to structure a data science project so that it is reproducible and maintainable.

In this article, you will learn how to use this template to incorporate best practices into your data science workflow.

Feel free to play and fork the source code of this article here:

Get Started

To download the template, start with installing Cookiecutter:

pip install cookiecutter

Create a project based on the template:

cookiecutter https://github.com/khuyentran1401/data-science-template --checkout dvc-poetry

…, and you will be prompted to answer some details about your project:

Now a project with the specified name is created in your current directory! The structure of the project looks like the below:

The tools used in this template are:

  • Poetry: A tool that manages Python dependencies
  • hydra: A tool that manages configuration files
  • pre-commit plugins: A tool that automates code reviewing and formatting
  • pdoc: A tool that automatically creates API documentation for your project

In the next few sections, we will learn the functionalities of these tools.

Install Dependencies

Poetry is a Python dependency management tool and is an alternative to pip. Poetry allows you to:

  • Store the flexible versions of packages in the “pyproject.toml” file, ensuring that your project can adapt to newer releases.
  • Store the exact version numbers of each package and its dependencies in the “poetry.lock” file, ensuring the reproducibility of dependencies.
  • Efficiently removes packages and their associated dependencies.
  • Efficiently resolve dependencies and addresses any conflicts promptly.
  • Package your project in several lines of code.
https://mathdatasimplified.com/2023/06/12/poetry-a-better-way-to-manage-python-dependencies

To activate a new environment, run:

poetry shell

To install all dependencies, run:

poetry install

To add a new PyPI library, run:

poetry add <library-name>

To remove a library, run:

poetry remove <library-name>

Check Issues in Your Code Before Committing

When committing Python code to Git, it is crucial to ensure that:

  1. Consistent Style Guides: The code follows consistent style guides such as indentation, line length, and whitespace usage to promote easier code review processes.
  2. Docstrings: The code includes docstrings to enhance the maintainability of your code.

However, verifying all these criteria manually before each commit can be overwhelming. Thus, this template uses pre-commit to automate this process. Pre-commit executes hooks specified in the “.pre-commit-config.yaml” file to identify simple issues before commiting the code. 

This template uses the following hooks:

  • Ruff: An extremely fast Python linter, written in Rust. It supports 500 lint rules, many of which are inspired by popular tools like Flake8, isort, pyupgrade, and others.
  • black is a code formatter in Python.
  • interrogate: Checks your code base for missing docstrings.
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/charliermarsh/ruff-pre-commit
    rev: "v0.0.275"
    hooks:
      - id: ruff
        args: [--fix, --exit-non-zero-on-fix]
  - repo: https://github.com/psf/black
    rev: 23.3.0
    hooks:
      - id: black
  - repo: https://github.com/econchick/interrogate
    rev: 1.5.0
    hooks:
      - id: interrogate
        args: [src, -v, -i, --fail-under=80]
        pass_filenames: false

To add pre-commit to git hooks, type:

pre-commit install

Now, whenever you run git commit, your code will be automatically checked and reformatted before being committed.

$ git commit -m 'my commit'
ruff.......................................Failed
- hook id: ruff
- exit code: 1
Found 3 errors (3 fixed, 0 remaining).
black....................................Failed
- hook id: black
- files were modified by this hook
reformatted src/process.py
All done! ✨ 🍰 ✨
1 file reformatted.
interrogate..........................Passed

Makefile

Makefile allows you to create short commands for repetitive tasks that usually involve multiple steps, such as environment setup.

There are two targets in this Makefile. The dependencies target makes dependencies and the env target makes the environment. 

The dependencies target initializes a new Git repository, installs the project dependencies, and sets up pre-commit hooks. 

The env target executes the dependencies target first then activates the virtual environment created by Poetry.

# Makefile
dependencies: 
	@echo "Initializing Git..."
	git init
	@echo "Installing dependencies..."
	poetry install
	poetry run pre-commit install
env: dependencies
	@echo "Activating virtual environment..."
	poetry shell

To set up the environment for your projects, others can simply execute a single command:

# Set up the environment
make env

This command will execute a sequence of commands.

Manage Configuration Files with Hydra

It is essential to avoid hard-coding as it gives rise to various issues:

  • Maintainability: If values are scattered throughout the codebase, updating them consistently becomes harder. This can lead to errors or inconsistencies when values must be updated.
  • Reusability: Hardcoding values limits the reusability of code for different scenarios.
  • Security concerns: Hardcoding sensitive information like passwords or API keys directly into the code can be a security risk. 
  • Testing and debugging: Hardcoded values can make testing and debugging more challenging.

Configuration files solve these problems by storing parameters separately from the code, which enhances code maintainability and reproducibility.

https://mathdatasimplified.com/2023/05/25/stop-hard-coding-in-a-data-science-project-use-configuration-files-instead

Among the numerous Python libraries available for creating configuration files, Hydra stands out as my preferred configuration management tool because of its impressive set of features, including:

  • Convenient parameter access
  • Command-line configuration override

and more.

Suppose the “main.yaml” file located in the “config” folder looks like this:

# main.yaml
data:
  raw: data/raw/sample.csv
  processed: data/processed/processed.csv
  final: data/final/final.csv
process:
  use_columns:
    - col1
    - col2

Within a Python script, you can effortlessly access a configuration file by applying a single decorator to your Python function. By using the dot notation (e.g., config.data.raw), you can conveniently access specific parameters from the configuration file.


import hydra
from omegaconf import DictConfig
@hydra.main(config_path="../config", config_name="main", version_base=None)
def process_data(config: DictConfig):
    """Function to process the data"""
    print(f"Process data using {config.data.raw}")
    print(f"Columns used: {config.process.use_columns}")
if __name__ == "__main__":
    process_data()

To override configuration from the command line, type:

python src/processs_data.py config.data.raw=data/raw/sample2.csv

Add API Documentation

As data scientists often collaborate with other team members on projects, it is important to create comprehensive documentation for these projects. However, manually creating documentation can be time-consuming.

This template leverages pdoc to generate API documentation based on the docstrings of your Python files and objects.

To create documentation for your project, run:

make docs

Output:

Save the output to docs...
pdoc src --http localhost:8080
Starting pdoc server on localhost:8080
pdoc server ready at http://localhost:8080


The documentation is now saved as Markdown files, and you can access it by opening http://localhost:8080 in your web browser.

Conclusion

Congratulations! You have just learned how to structure your data science project using a data science template. This template is designed to be flexible. Feel free to adjust the project based on your applications.


I love writing about data science concepts and playing with different data science tools. You can stay up-to-date with my latest posts by:

Scroll to Top