On Scientific Python Best Practices

I’ve been spending some time with my scientist friends trying to help spread good coding practices. As someone who has run rm -rf / multiple times before demos in front of people who’s names you’d probably recognize and not gotten fired, I feel uniquely qualified to talk about these issues.

Before we get into the codes, I’m going to try to give a little background on why these things should be done. Good programming practices are tricky. There’s no real rule book and it’s often not directly clear why they’re useful. To make the situation worse, it’s more of a spectrum than an absolute. Being overly scrupulous comes at the cost of speed and visa versa. To truly nail this stuff experience is key, but in lieu of that, i’ll give you my secret comprehensive list of invaluable principles:

Calvin’s Amazing Coding Principles

  1. You’re an idiot
  2. See (1)

You may have a lot of great qualities, you may be a great person, but rest assured, you are stupid, you will make mistakes, and you will pay for those mistakes if you don’t use protection. Good coding practices are that protection. The more you use, the safer, but more annoying it is.

Setting up Your Development Environment

This guide is for Mac, but Linux shouldn’t be all that different. If you want to develop on windows, only god can help you there. We’re going to walk through the boilerplate I use every time I start a new python project. I’ve been using it for a while, and it seems to work pretty well. I’ve also provided a GitHub template of my setup here. Feel free to use it!

Creating the Workspace

mkdir ~/ws
cd ~/ws

The first thing I always do if it doesn’t exist already is create a workspace directory. The reason I create this directory is to separate my coding projects from the rest of my file system. Sometimes science can be messy and your filesystem can get dirtier than, well I already made the protection joke, so pick your favorite dirty thing. If we’re using this workspace directory, we can shove all of the mess we make into it, and if everything goes wrong, we can terminate it with extreme prejudice.

Why do we do this? Well, we’re acting under the assumption that your code is dirty, sick, and highly contagious. If we aren’t careful, it can reach out and infect the rest of your computer. Setting up your code in this way is kind of like putting it into a closet that’s also an incinerator. If the pathogens start to leak, you turn it on and burn it all. Then, if you combined it with source control (see below), you bring it all back nice and clean the way it was before.

If you don’t think you need this because you have a thing for clean file structures and have never had problems before, please refer to principle (1). Principle (2) will also work, but (1) is more direct.

Setting up the Github Repository

You should always use a source control system and that’s cause you’re an idiot. You’ve likely heard this before, but I’ll even make this easier: If you’re working in science, you should always use Github. Github is good. People use it. It works. Don’t get fancy. If you don’t know how to use git, google it. Explaining my git workflow is a bit out of scope for this post, but it may come in the future. There are two simple reasons for using git/Github:

  1. Other people can see your code
  2. You can go back to previous versions when you screw up

The first is pretty self-explanatory, being able to share science is useful. The second is also self-explanatory, but I’m pretty sure some readers still think they aren’t idiots. Remember (1). I guarantee there will be a time when you work for 8 hours and realize that what you really want is to get the code back to where it was before you started messing with it. You can either spend another 8 hours retracing your steps or type git reset --hard. It’s your choice. I know you’re an idiot, but come on.

Once you’ve set up the git repository, you can clone it into your workspace directory, and you’re good to move on to the next steps.

Setting up the Conda Environment

name: <my-cool-name>
channels:
  - conda-forge
  - defaults
dependencies:
  - numpy
  - pip
  - pip:
    - <some package no in conda>

The next thing I do is set up a conda environment.yml file. Chances are you’re going to be a pansy and not write all your parallel matrix operations from scratch, so you’re going to need 3rd party libraries. Now you could just type pip install a bunch of times, which is fine until it isn’t. What happens when your friend wants to run your code to make sure it works or you’re working on two projects at once and want to install different versions of numpy. What do you do then? Actually don’t answer that, I’m scared. This is why we use isolated reproducible environments. There are many choices here, but you should use conda. It does a lot of fancy stuff for you (google SAT solver, it’s pretty cool), but the nice thing is it usually just works and is (mostly) idiot-proof. Copy and paste that environment file into your project, give it a cool name, look up how to switch between conda environments, and you’re all set.

Writing the Makefile

.DEFAULT_GOAL := help

install: ## Install the conda environment
    conda env update -f environment.yml

help:
    @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'

We’re doing fantastic, but the trouble with conda is the commands can get hard to remember. Your installation is also sometimes more complex than setting up the conda environment. For example, you might have a package you want to install in -e mode or some data files to download. Just remembering all these commands can get pretty complicated. Too complicated for an idiot. To combat this, the next thing I do is add a self-documenting Makefile. Make. is an OG build system and the backbone of a lot of really cool code, but we’re not really going to use any of its cool features. Instead, we’re just going to have run the series of commands our code needs to work. In this case, we’re just going to have it update our conda environment, but we could add more commands if we wanted to. It’s kind of like using an AK-47 to open a jar of pickles, but it actually works pretty well.

This makes installing our project as easy as typing make install . The self-documenting part is nice because typing make will print out all the documented targets and amend the documentation you put after the two hashes. Swingin!

Writing the README.md

The last thing we’re going to do is add some documentation. Why do we need documentation? I mean obviously it’s going to help other people, but screw them, what’s in it for me? Well, I haven’t said it in a while, so go take another long hard look at (1). We’re human, we forget things. You’ve probably already forgotten the rest of the steps before this, but that’s probably my fault. The readme you write should always contain 3 things:

  1. A description of what the code does
  2. How to install the code
  3. How to use the code

It’s really hard to keep documentation updated and it’s a constant struggle even on professional software teams, but it’s probably the most important thing you can do. The simple truth is that your code exists to do things. If nobody knows how to use it and it can’t do the things it’s supposed to, it’s just a bunch of garbled BS. You owe it to your code, keep your documentation up to date.

Conclusions

And that’s all it takes. I mean it’s kind of a lot, but code is hard. I hope you’ll use some or all of these tips in your work. If you’re unconvinced, I’m not too worried because all I have to do is wait until you’re inevitably an idiot. Then I can say I told you so, which is honestly so much better than having good science code in the world.

3 thoughts on “On Scientific Python Best Practices”

  1. Pingback: Great coding practices from a genius - Rich's Tong Family

Leave a Reply

%d bloggers like this: