Better stats with Python thanks to Patsy

Patsy, a Python package for describing statistical models and building design matrices using  "formulas" : a great addition in the ecosystem of Python statistical packages !

The first release a very nice package called Patsy was announced last week by Nathaniel Smith. Quoting Nathaniel's message, Patsy is « a Python package for describing statistical models and building design matrices using "formulas". Patsy's formulas are inspired by and largely compatible with those used in R ».

I should add to this message that Patsy is also King Arthur's assistant in the Monty Python and The Holy Grail film. While I discovered this by chance, I doubt it is a coincidence...

This new module is a great addition to the emerging ecosystem of Python statistical packages which already includes statsmodels and pandas as flagships, the latter being very actively developed. A complete description of Patsy's functionalities is provided on its overview page, but the best description to me is the clearly written quickstart section. Not all modules enjoy such a nice documentation, so let's enjoy !

Open Terminal Here

Friday afternoon, I finally got tired of a bug in "Open Terminal Here", a Bash script I was using so far. I made a Python replacement...

Open Terminal Here ?

"Open Terminal Here" just a simple Bash script meant to be launched from a right-click within a directory opened in Nautilus, the Gnome file manager. As the name states, the gnome-terminal should be launched with its working directory being set to the currently opened directory. Simply handy! Only one problem: setting the  working directory was failing whenever the name of directory contained non-ASCII characters. 🙁

This "Open Terminal Here" is not a standard functionality of Nautilus (as opposed to Dolphin where you can just press Maj+F4 to launch a "konsole"). Rather, you have to download an extension script from G-Scripts, a nice central repository, and drop it in a quite buried directory (~/.gnome2/nautilus-scripts). When it comes to the "Open Terminal Here" function, there are several scripts available. Choosing between these is up to the visitor...

"Open Terminal Here" in Bash

Here is the central part of the Bash script I was using so far:

base="`echo $NAUTILUS_SCRIPT_CURRENT_URI | cut -d'/' -f3- | sed 's/%20/ /g'`"
     while [ ! -z "$1" -a ! -d "$base/$1" ]; do shift; done

gnome-terminal --working-directory="$dir"

I had never paid attention to this code until Friday (especially since I have such a poor Bash understanding). I now realize that line 1 does most of the job, which is :

  1. retreive the $NAUTILUS_SCRIPT_CURRENT_URI environment variable.
  2. cut the "file://" prefix from the URI
  3. replace "%20" by real space characters " "

This Bash script works most of the time, except with directory names containing non-ASCII characters. Indeed, in a directory called "études notes" for example, Nautilus sets the URI variable to "file:///home/pierre/%C3%A9tudes%20notes". One can see how "é" gets encoded as "%C3%A9" and the Bash script doesn't perform any decoding.

Now "Open Terminal Here" with Python

In order to support non-ASCII, I tried to build my own script. I chose Python simply because that's the language I'm the most familiar with. The entire source is a GitHub Gist. Here are the main lines:

import os
from urllib import unquote
from subprocess import Popen

# 1a) Retrieve the URI of the current directory:
env = os.getenv("NAUTILUS_SCRIPT_CURRENT_URI", "/home/pierre")

# 1b) Process the URI to make it just a regular Path string
env = env.replace('file://', '') # Should fail with Python 3 ?!
env = unquote(env) # decode the URI

# 2) Launch gnome-terminal:
Popen(["gnome-terminal", '--working-directory=%s' % env])

Well, it's basically the same as in Bash, except with a useful call to the urllib.unquote function which performs all the URI decoding job. All imports are from the Python standard library: great !

And by the way, with Perl !

While writing this post, I realized I had on my laptop a second "Open Terminal Here" script. I had forgotten about it because it was disabled for I don't know what reason. It's a Perl script, also coming from G-Script. Here are the main lines:

use strict;

if ($_ and m#^file:///#) {
  exec "gnome-terminal --working-directory='$_'";

What can be said except that it also works ! The URI decoding is nicely performed by "chr(hex($1))". I actually think the author wrote this script for the very same reason I wrote a Python one: "he loves Perl!" 😉

Conclusion ?

I don't know if it's that useful to have three versions of a script to solve the very same tiny problem... Still, one can notice the vivid differences in the programming approach: while my Python scripts relies mostly function coming from standard modules, the Perl script is literally built on Regular Expressions. No surprise !

Finally, the Bash script just calls command line utilities (cut and sed), but doesn't decode the URI. Is there a command line program for that ?

And what about Haskell ?

I'll keep this for another day 😉 I need to get a Fistful of Monads first. Would be a nice experiment though...

Autoregressive processes : learning with an interactive tool

I recently got interested in second order autoregressive processes ("AR(2)" processes), after spending quite some time working only with first order ones... It was time to get an extra degree of freedom/complexity ! However, I didn't have much intuition about the behavior of these models and neither about how to "tune the knobs" that set this behavior so I spent some time on this subject.

For the sake of clarity, let's write down a definition/notation for an AR(2) process X_k:

X_k = a_1 X_{k-1} + a_2 X_{k-2} + Z_k

with the real numbers a1 and a2 being the two parameters that tune the "behavior" of the process. Z_k may be call the "input" in the context of linear filters, while "innovation" (i.e. a white noise input) would better fit in the context of random processes. A nice property of AR(2) processes, as opposed to AR(1) is their ability to capture slowly oscillating behaviors, like ocean waves. Also, within the AR(2) model, AR(1) is included as the special case a2 = 0.

Bits of a non-interactive study

In order to understand the interplay of the two parameters a1 and a2, I spent extensive time doing paper-based analytical/theoretical studies. I also created a IPython Notebook which I should publish when there is proper Notebook -> HTML conversion system (edit: now available on nbviewer). Most of the AR(2) behavior is governed by the characteristic polynomial P(z) = z^2 - a_1 z - a_2. This polynomial helps in showing that there are two regimes : the one with two real roots (when  a_2 > - a_1^2/4 ) and the one with two complex-conjugate roots (when  a_2 < - a_1^2/4 ) where I can get my long-awaited slow oscillations. A third regime is the special case of a double root  a_2 = - a_1^2/4 which I don't think is of a special interest.

Beyond the study of the root, I also computed the variance of the process and its spectral density. I found both not so easy to compute !

Creating an interactive tool

After the static paper-based study and the still semi-static computer-based study, where I was able to derive and plot the impulse response and some realization of random trajectories, I wanted to get a real-time interactive experience of the interplay of the two parameters a1 and a2. To this end I wrote a tiny Python script, based on the event handling machinery of Matplotlib. More specifically, I started from the "looking glass" example script which I altered progressively as my understanding of event handling in Matplotlib was getting better. Here is a frozen screen capture of the resulting plots:

screen capture from the Interactive AR2 demo
Interactive demo of some properties of an auto-regressive AR(2) process

I put the code of this script on a GitHub Gist. This code is quite crude but it should work (assuming that you have NumPy, SciPy and Matplotlib installed). I've implemented the plots of three properties of the process:

  1. the root locus diagram, where one can see the transition between the two real roots and the two complex-conjugate roots as the  a_2 = - a_1^2/4 boundary is crossed.
  2. the Impulse Response (output of the filter for a discrete impulse  Z_k = \delta_k ), implemented simply using scipy.signal.lfilter.
  3. the Frequency Response, which I normalized using the variance of the process so that it always has unit integral. This latter property is mainly based on my own analytical derivations so that it should not be taken as a gold reference !

My feeling about interactive plotting with Matplotlib

Overall, I felt pretty at ease with the event handling machinery of Matplotlib. I feel that writing an interactive plot didn't add too many heavy lines of code compared to a regular static plot. Also, the learning curve was very smooth, since I didn't have to learn a new plotting syntax, as opposed to some of my older attempts at dynamic plotting with Traits and Chaco. Still, I feel that if I wanted to add several parameters in a small interactive GUI, Traits would be the way to go because I haven't seen a nice library of predefined widgets in Matplotlib (it's not it's purpose anyway). So I keep the Traits and Chaco packages for future iterations !

Electricity consumption peaks

Thanks to the publicly available RTE éCO2mix data about French electricity, I was able to analyze the French electricity consumption since June 24th 2000 (15 minutes time-step averages), that is a bit more than 11 years. This is the first data analysis based on my éCO2mix downloading tool I announced a few days ago. It shows the tremendous increase (44 %) in the peak consumption value over this period.

Note : tools (Python script) and data that generated these curves and tables are openly available on GitHub.

Consumption evolution

To get a glimpse at what is going on with French electricity consumption, here is a plot over those 11 years.

weekly electricity consumption 2001-2011 (with records)
French electricity consumption : weekly averages, with records pointed with red diamonds

How to read this plot

  • The read line shows weekly averages so that weekly and daily variations are smoothed out.
  • The light blue shaded region which is the range between min and max consumption during the given week. This gives an idea of intra-week variability that the red curve doesn't show.
  • The red diamonds shows the successive consumption records (see below)

Pattern and trend

On this plot, one can see the seasonal consumption pattern which is basically:

  • less consumption during summers
  • more consumption during winters

The reason for this being roughly that there is a bigger need for heating and lightning during winter than during summer.

Also, there is a general increasing trend. More precisely, one can see that consumption during summer is just slowing increasing, while winter demand is getting bigger and bigger. I will now illustrate this with the analysis of consumption records.

Consumption records

I searched for successive records in the consumption data (starting June 2000). I found 27 records ("I" being my Python code...) which are all listed in the table below. All records (but one) happened around 7pm in winter, that is during the so called "evening peak", when people come back home and put both heating and lighting on (and start watching TV...).

Also, all these records happened during weekdays other than Saturday or Sunday. This is related to the weekly pattern which contains a significant drop on weekends.

A tremendous increase of peak demand

If I summarize these successive records over the years, here is what I get :

  • During Winter 2000-2001, the highest peak was 70.5 GW (at 7pm on Tuesday January 9th, 2001)
  • During Winter 2005-2006, the highest peak was 86.2 GW (at 7pm on Friday January 27th 2006)
  • During Winter 2011-2012 an historical record was set at 101.7 GW (at 7pm on Wednesday February 8th, 2012)

This is +44 % increase in peak demand, which I feel is quite huge over this not-so-long period of 11 years. This requires the commissioning of new dedicated power plants to avoid black outs during winter.

Rationale for this increase

I'm not a specialist of this topic, but the basic idea is that this tremendous increase is French specific (if compared to similar European countries like Germany or UK).

It comes from the fact that a high proportion of house heating is electric (around 30 %, and even more in new buildings). Each new building with an electric heating system is prone to be an additional peak load during winter, and they are a lot.

This is summarized by the "thermal sensitivity" of French electricity consumption : there is an increase of +2.3 GW each time the outside air temperature goes down by 1°C. In Germany, it is only +0.5 GW...

And why do French people rely so much on electric heating ? Perhaps because of the price and also the culture of "cheap nuclear electricity". French regulated price for electrical energy is about 0.11 €/kWh for residential customers (including taxes). The regulated price has been frozen for some years by the French government to avoid angering people. Also, the installation cost of electric heaters is lower than of a fuel/gas based central heating system.

However, in my opinion, French electricity price is bound to increase of maybe 30 % in the coming years, partly due to the need to either stop or rebuild nuclear power plants. It is sad that a lot of peoplee who will have to pay this extra cost where trapped in the first place by an artificially low electricity price.

Electricity consumption peaks table

French electricity consumption peaks between June 2000 and February 2012 (demand records computed with 15 minutes averages)
DatePower (GW)WeekdayHour

RTE éco2mix tools

RTE, the French Transmission system operator (TSO) launched in November 2010 the éco2mix web application where real-time updated data about the French electricity market (national consumption + detailed production) was made publicly available. Their Flash application displays this information nicely, but the raw data is also available for download, which is far more interesting...

I've started writing simple tools (Python-based) to download this data set and then process it. As of now, these tools are openly available on a GitHub repository. Various custom plotting functions should follow.

Interestingly enough, the éco2mix website proposes data only up to one month prior the current date. However, it appeared that the underlying data server can serve daily files as early as year 1900 ! Only those files are quite empty... 😉

The real daily data files are available starting June 26th 2000, but those only contain the national electricity consumption, along with the D-1 consumption forecast. The detailed electricity production data (Nuclear, Hydro, Wind, ...) has been only available since about 2010, that is when the éco2mix application was launched.