I made these notes to help my students to :
- Get a broad overview of the world (the jungle?) of Python for scientific computing and data analysis.
- Get quickly a robust and usable Python setup (instead of getting lost by manually installing individual packages).
Last update: December 2018
1. Setting up a Python installation
Overview: to do scientific computing in Python, ones needs:
- the Python interpreter (version 3.6, 3.7 or more recent. Version 2.7 is getting obsolete).
- Several extra packages from the so-called "Scientific Python Stack" that extends the language to do scientific computing: numpy, Matplotlib, pandas... (aka "toolboxes" in Matlab)
- A development environment (just one, or a few different ones to adapt to development task at hand)
To install all these tools, I highly recommend using a Python distribution: it helps installing all these tools in one step, and then maintaining them up-to-date over the months and years.
- Recommended: Anaconda Distribution by Anaconda (formerly Continuum Analytics). Free for everyone
- Alternative: Canopy by Enthought. However, as of December 2018, it hasn't been updated since April 2018, and is stuck with Python 3.5
- Alternative: Canopy by Enthought. However, as of December 2018, it hasn't been updated since April 2018, and is stuck with Python 3.5
- Slightly not recommended: on Linux distributions, use the Python packages from the distribution. These get easily outdated (esp. for pandas)
- Highly not recommended: installing the Python interpreter from python.org and then all the extra packages manually.
2. Getting started
a) Learning Python
Two online resources selected among tons of other.
Software carpentry's lesson Programming with Python.
“Our real goal isn’t to teach you Python, but to teach you the basic concepts that all programming depends on.”
A more comprehensive set of tutorials for scientific computing: Scipy Lecture Notes
“One document to learn numerics, science, and data with Python”
Python for Data Analysis
Also, many students coming from a Matlab backgrounds found Enthought’s Pandas Cheat Sheets very useful for an overview of key functions data analysis and vizualization (NB: need to give email address to download them).
b) Development environment: (too?) many options
Unlike Matlab, there is no single environment (IDE) for computing/developing in Python. One has to choose from quite many different options, but they can all be classified into three groups:
- Integrated Development Environment (IDE)
- Text editor + Python shell
- Jupyter Notebook
For an easier transition for Matlab practitioners, I recommend option 1 IDE. I'm using options 2 & 3 (Text editor or Notebook). Notice that there is no lock-in: one can easily switch between development environment to adapt to the various development task at hand.
Opt 1: IDE dedicated for scientific programming
Recommended for Matlab practitioners: an IDE dedicated to scientific programming. I know three such IDEs:
- Recommended: Spyder, a fully open source environment. It will be familiar to Matlab users (editor panel, console, variable explorer...) although it may not be as polished. Spyder is included in the Anaconda distribution, and it's the recommended way to install Spyder.
- Alternatives:
- Canopy comes with its own IDE (probably good, but I never tested it). Comes with a promising Interactive Graphical Debugger.
- Pyzo (formely IEP). Open source, plays nicely with the Anaconda distribution. Its "design is aimed at simplicity and efficiency", which sounds good, but I only heard about it, never tested.
Notice: it is possible to use a general purpose IDE like PyCharm, but such environments can be overly complex and not fully adapted for a scientific usage (it is intended for professionals spending 10h/day developing things like web apps).
Opt 2: Text editor + Python shell
Using an integrated environment (opt 1) is fine but not necessary. Instead, it is possible to work only with a text editor along with an interactive Python shell (the command line prompt).
Text editor
The text editor (not to be confused with a Word Processor like Microsoft Word) should be optimized for writing code, with features like automatic color-coding of key words.
On Windows: the open source Notepad++ is pretty nice (pre-installed on Supélec computers). Not recommended: the default Notepad of Windows (no color highlighting, etc...).
On Linux: too many to choose (I'm using Gnome's Gedit).
Multi-platform alternatives:
- Atom is a pretty new (2014) & fancy (fashionable?) editor. I'm using it quite a lot, but rather for web development (HTML, Javascript, CSS) rather than Python. However, it surely supports Python.
- Bonus point: nice support for version control with Git.
- Microsoft's free & open source Visual Studio Code, with its Python extension. See also the Python tutorial section of its manual. Also very fashionable.
- Python comes in standard with IDLE. Never used it much but some like it.
Python shell (Console)
(this is the command line where to type Python code and run Python scripts)
To increase productivity, I strongly recommend using the IPython shell over the regular Python command line. Extra features are code completion, color highlighting... Since IPython is very popular among scientific Python users, it is included in Python distributions like Anaconda.
The IPython shell also integrates in the Spyder IDE (cf. Opt 1 above), so it's possible to take the best of both worlds (nice IDE + nice console).
Opt 3: Jupyter notebook
Jupyter notebook (a spin-off project of IPython) provides a very different environment to do computing. The interface (inspired by Mathematica) enables mixing code with text and images, including the output of code execution (numbers, graphs) into one document. It makes it easy to create computing narratives to go beyond a simple numerical script. I recommend it.
Just one word of advice. I use Jupyter notebooks a lot, but I don't want to recommend it without a soft warning to the novice user: the computing paradigm of notebooks is quite different from the {script +shell} paradigm (as in Matlab). Indeed, the execution of code in the Notebook is based on the notion of code cell, executed one after another (and one cell can be re-executed many times) within the same session: it is a time order. The execution order is not based on the spatial order (from top to bottom) like in a regular script. The consequence: after many edits of code cells, it becomes difficult to remember what was executed to create the current state of the session. Therefore, I recommend important to refactor the notebook regularly (say every 10-30 minutes) so that cells can be executed linearly on the next day, when starting from a blank session.