How to Setup a Data Extraction Python Project?

I have started a project where I want to extract source data for a Customer Relations Management (CRM) software and load the raw data into Snowflake for transformation and then reporting. What I found a bit intimidating starting on this project was the blank Python project. I wanted to set up a sensible project structure that leant itself to key data engineering principles such as version control and Don't Repeat Yourself (DRY) code to name a couple.

Consequently, I thought it useful to write up the structure I settled on along with my justifications to hopefully get people thinking about the importance of project structure and to offer a basic entry point for others who had a blank slate and did not want to necessarily download a structure from GitHub.

I opted to set this project up with PyCharm. I like that PyCharm projects will establish a Virtual Environment associated with the project upon creation where the terminal is already configured to run off the virtual environment (with some config files another developer could download the repository and be ready to branch off and develop on the project.

Furthermore PyCharm also offers a pretty effective debugger that I have grown familiar with and also a git interface that I like given its seamless integration into the project.

PyCharm GUI

The interface has a panel at the bottom to navigate between a terminal pre-configured to the virtual environment, a package overview and a version control tab that lets you branch off an existing repository.

The panel down the left-hand side lets you navigate between the project file structure and options to commit changes to a repository/branch you are working on.

The Setup

This is the file structure I opted for with some placeholder files in the tests and capsule_api_connections subdirectories.

project_name
	capsule_api_connections/ 
		__init__.py 
		main.py 
		api_client.py 
		Opportunities.py
	tests/ 
		test_api_client.py 
		test_opportunities.py
		# Additional test files 
	venv/
		virtual environment files
	.gitignore 
	README.md
	pyproject.tml
	.env
	requirements.txt
	docs/
		docs related to project
		

The .gitignore is set up to ignore both the virtual environment and the .env file where environment variables are set: ensuring secrets are not captured in the version control set up.

Having established the overarching structure I'd now like to breakdown certain parts, credit inspiration for the setup and outline some lessons from the process.

Inspiration for the Chosen Structure

My repository design choices were mainly motivated by the article on structuring a Python project. My other motivation was to build my project out with modules and avoid repeated sections of code based on what I had learnt on my journey to the dbt Analytics Engineer Certification

As a result, I made some sub-directories off the root folder based along similar principles to a dbt project; in particular, a test directory where later in development I will install some automated tests to guide development similar to tests that are run in a dbt project after a build command. In an effort to modularize my code I made a package directory to house my scripts and facilitate the use of relative imports for as streamlined code as possible.

The Dagster article on project structure was also interesting and the source of the graphic below for trying to explain the package structure within a python project.

Virtual Environment

PyCharm has the benefit of creating the virtual environment when a user makes a new project and the terminal is pre-configured to run within the virtual environment. These are good practices to follow as the project's Python instance can then avoid version conflicts within its dependencies caused by other projects (as these are then also built with their own environment). I added the virtual environment to the .gitignore file as I did not want to clutter the repository with items not directly associated with the project. We can navigate the dependencies through the use of a pyproject.toml.

pyproject.toml

I opted to have a pyproject,.toml rather than a setup.py as online recommendations suggested that it was a more modern solution. Inside the file I specified package specific metadata and also used setuptools to find the dependencies for the package. if the user has cloned the repository and is in the root directory where this pyproject.tml is located the dependencies can be installed with the simple command:

pip install .

requirements.txt

requirements.txt offers an alternative, more traditional means of downloading the dependencies using:

pip install -r requirements.txt

to install all the dependencies specified in the requirements file.

Git Integration

Although not exclusive to PyCharm I like the interface for configuring version control for the project and then creating a remote repository on GitHub - subsequent commits with comments and a check box for the files to track make for a user-friendly experience.

Testing

Unit testing is an integral part of a python project, as such in my proposed project structure I have a testing sub-directory where I plan to use the unittest package in order to mock certain imports for individual modules and ensure that the code is producing the expected output. The goal of setting up these test files is to eventually automate them to check before a merge to the main branch.

.env

Something that might be overlooked by those new to version control is the inclusion of secrets/sensitive information in a repository. The approach I opted for here was a simple set-up of environment variables included in my .env file that are then loaded from that file into the python modules using the dotenv package. As stated earlier the .env file is added to.gitignore so that it is not uploaded to the remote repository thereby keeping the sensitive details private.

There are alternative approaches such as inserting the values at run time using a cloud platform or a GitHub action which I think makes sense for a production environment.

Python Modules

Since Python 3.3, the __init__.py file has not been compulsory for recognizing a directory as package. I have included it to explicitly refer to the directory as a package. It also is a canvas to add any code that needs executing upon package import.

Documentation

As I go through the project I have a space to add any documentation for parts of the project that might need dedicated docs.

Author:
Edward Hayter
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab