Data analysis with Python and Poetry
In my current role I work alongside data scientists and support them with code and infrastructure as we aim to use data to make more effective development progress in low and middle-income countries as they work towards the Sustainable Development Goals.
I’m always keen to expand my skills and so have been trying to find opportunities to carry out interesting data science projects in order to be better able to understand the full range of work we do.
Recently I found some interesting data that I was keen to explore and had a problem in mind that I was hoping it could solve. I wanted to explore the data first locally using python and jupyter notebooks. Here is the best way I found to set this environment up.
Firstly, you will need to install python locally. For this I use Pyenv
which helps with versioning across multiple projects.
We are going to be using Poetry for managing our dependencies for this project. For those familiar with the javascript ecosystem, Poetry takes a similar approach to how npm works with node projects in that it defines a pyproject.toml
which resembles a package.json
file. Poetry will also handle our virtual environment for us too as we will come to later.
For now we need to set up our project structure. Either bootstrap the entire project with poetry new my-project
or in this case we will create the folder structure we want and then use poetry init
to bootstrap the necessary files. So we run
mkdir my-project
cd my-project
poetry init
On running init
Poetry will walk you through the configuration options, most of which have sensible defaults. You will also have the opportunity to define any packages you would like to install. In this case we will search for and add pandas
for the data analysis and jupyter
and ipykernel
to allow us to spin up the notebook.
Note that these packages are not installed globally on your machine. Running jupyter --version
at this point will return something like zsh: command not found: jupyter
. Instead we need to create a shell within a virtual environment which has these packages installed. We do this by running
poetry install
poetry shell
We now have a terminal process running which has access to the aforementioned packages, without having needed to install them globally.
Within this process we can run
jupyter notebook
To start the jupyter server and use any other packages we have defined to start digging into our data.
Resources I found useful for developing this process
[1] Hypermodern Python
[2] Poetry Documentation