Skip to content
Snippets Groups Projects
Frank Sauerburger's avatar
Frank Sauerburger authored
66d42847
History

This repository consists of a collection of python examples intended as an introduction to the use of python in data analysis, especially for the advanced laboratories in physics at the University of Freiburg. In previous years, code examples for ROOT have been provided. Material showing how to use python for the same task was missing. This document tries to fill the gap.

If you think this tutorial is useful, lacks essential information, or is unclear, don't hesitate to give feedback.

Table of Contents

  1. Installation
  2. Prerequisites and Structure
  3. 'Hello World' Example
  4. Numpy Arrays
  5. Plotting Functions
  6. Plotting Data Points
  7. Reading, Plotting and Fitting Experimental Data
  8. Further Reading

Installation

To get started with python for data analysis in the advanced laboratories you need the python interpreter. In this document, we will use python3. The additional packages numpy, scipy and matplotlib are useful for data analysis and data presentation.

Ubuntu

To install all the packages on Ubuntu, you can run the following commands.

$ sudo apt-get update
$ sudo apt-get install -y python3 python3-pip
$ pip3 install --user numpy scipy matplotlib

The --user argument for pip installs the python package in your home directory, which potentially hides older packages installed with apt-get.

Windows

Since I'm not using python on Windows myself, I don't have first-hand experience with it. However, I think Anaconda is a good solution for windows users since it provides all required packages.

Prerequisites and About the Tutorial

This tutorial assumes that you have some experience with python, which includes variable assignment, function calling, and function definition, if statements and for loops. The tutorial uses only the very basics, such as variable assignments and function calls, but it is certainly advisable to know about control structures.

To catch up on these aspects, you can refer to the python documentation. If you are already familiar with another programming language, it should be quite intuitive to switch to python.

This tutorial is divided into several examples, which might depend on each other. The examples show code snippets which you are supposed to copy to a text editor and save to a file. The scripts can then be executed in a terminal. Besides this modus operandi, you are invited to use the interactive mode of python or ipython instead and copy the code directly to the python interpreter. Recently jupyter notebooks have become very popular. I recommend you to try out these different options and choose the one most suited for you.

Side remark: This repository does not contain ready-made python example scripts or plots. The only code resource is this README file. The idea behind this is, that there shouldn't be duplications of code snippets, which will be out-of-sync eventually. The repository is set up, such that each commit triggers continues integration tasks on the server, which parses the examples from the README and executes them with the doxec package. This means you can download ready-made scripts and plots produced by the continues integration task.

Let's get going!

'Hello World' Example

The first example is basically a 'Hello World' script, to check whether python is running correctly. Create a file named hello_world.py and add the following content.

# Load math library with sqrt function.
import math

print("Example 1:")

# Strings can be formatted with the % operator. The placeholder %g prints a
# floating point numbers as decimal or in exponent notation(3e9) depending on its
# magnitude.
print("  Square root of 2 = %g" % math.sqrt(2))

To run the example, open a terminal tell the python interpreter to run your code.

$ python3 hello_world.py
Example 1:
  Square root of 2 = 1.41421

Have you seen the expected output? Congratulations, you can move on to real-life examples.

Numpy Arrays

The standard data structure to store numerical data is a numpy array. Numpy arrays are defined in the numpy package and are implemented in a very efficient way.

To get started with numpy arrays, create a file np_arrays.py and add all lines listed in this section. The first line should be an import statement.

# Import the numpy library.
import numpy as np

In this example, we create the numpy array numbers containing my favorite numbers from the python list [4, 9, 16, 36, 49].

# Create a numpy array by passing a list to np.array.
numbers = np.array([4, 9, 16, 36, 49])

Having all these numbers in a numpy array makes bulk computations very efficient. Assume, we want to calculate the square root of all these numbers. We can simply use numpy's sqrt method do perform the same operation on all elements of the array at the same time.

# Calculate the square root for each item in the array numbers.
roots = np.sqrt(numbers)

Since the resulting variable roots is also a numpy array, we can perform similar operations on this variable.

# Execute other computations for each element separately.
something_else = 1.5 * roots - 4

Numpy arrays overload the typical arithmetic operations, such that the above statement benefits from numpys efficient, vectorized (i.e. performing the same operation on may values) implementation. You should always think about a way to use such vectorized statements, and try to avoid manually looping over all the values. Using a python loop to run over

10^3
values is probably fine, but you don't want to wait for a python loop iterating over 10^8 or 10^9 values.

Finally, add a print statement to check that all the calculations have been carried out as expected.

print("The result is", something_else)

When executed, you should get the following printout. You can convince yourself, that the calculations are correct.

$ python3 np_arrays.py
The result is [-1.   0.5  2.   5.   6.5]

Numpy offers many other functionalities which are beyond the scope of this basic introduction. It is worth glancing at the documentation.

Plotting Functions

One major aspect of data analysis is the visualization of data. This includes the generation of diagrams and plots. You can use the powerful matplotlib library to create high-quality plots in a python environment. The goal of this example is to plot the cropped parabola

f(x) = \left\{\begin{array}{lr} x^2, & \text{for } x < 2\\ 4, & \text{for } 2 \leq x \end{array}\right.,

which looks like this:

Plot of cropped parabola

Create the file func_plot.py and add the following lines.

# Import the numpy library.
import numpy as np

# Import the powerful plotting library.
import matplotlib.pyplot as plt

Plotting a function with matplotlib means plotting many points connected by a line. First, we create an array with 200 equidistant

x
-values in the interval
[-2.5, 3]
. This array functions as a grid, for which we calculate the
y
values.

x = np.linspace(-2.5, 3, 200)

We can easily calculate the square of all these values with x**2. Cropping the right part is a bit more tricky. First, we create an index array of 1's and 0's, which indicate whether

x \geq 2
. This index array has the same length as our
x
-grid. The first elements of the index array are 0's, since the corresponding
x
-value is less than two. At some point in the array, the value changes to 1, since then the corresponding
x
values satisfy
x\geq2
. The index array can be used to select a subset of
y
-values, namely all
y
-values, for which
x\geq 2
. Finally, we can assign the value
4
to this subset, and therefore effectively crop the parabola. The implementation in python of the algorithm outlined above is rather short.

# Calculate the regular parabola.
y = x**2

# Create index array.
idx = (x >= 2)

# Set all y-values to 4, for which x >= 2.
y[idx] = 4

# One can get rid of the intermediate index array and combine both lines into
# the statement y[x >= 2] = 4

The final step of this example is to call matplotlib, which plots the points and connects them with a line. Additionally, we can add axis labels and save the resulting figure.

# Plot a line specified by x- and y-arrays.
plt.plot(x, y)

# Set axis labels. Latex expression can be used.
plt.xlabel("$x$") 
plt.ylabel("Cropped Parabola")

# Save the figure. Different output formats are available.
plt.savefig("cropped_parabola.eps")

Run your script and check that the file cropped_parabola.eps is created.

$ python3 func_plot.py

Plotting Data Points

A typical task in the advanced laboratory is to compare measured data points to an expected function. Let's assume the expected function is the cropped parabola

f(x)
from the previous example. We will use random data points in this example since we haven't actually measured real data which is expected to follow
f(x)
.

This example is based on the code from the previous example. Copy the file from the previous example to data_plot.py, such that we can append the following code snippets to data_plot.py. Keep the plotting code from the last example as it is.

We generate the pseudo data points by adding random deviations to the expected

y
-values. Let's pretend we have measured data points for all half-integer
x
-values in the interval
[-2.5, 3]
.

# Create x-value grid for the "measured" data.
x_data = np.array([-2.5, -2, -1.5, -1, -.5, 0, .5, 1, 1.5, 2, 2.5, 3])

We evaluate the function

f(x)
again for the x_data values. The resulting array y_data matches the curve from the previous example perfectly. We draw random deviations from a centered normal distribution with a standard deviation of 0.3 and add them to y_data. The third argument of numpy.random.normal specifies, how many random samples we want to draw. We use len(x_data) since we want to draw an independent random deviation for each
x
-value.

# Calculate square of x_data points.
y_data = x_data**2

# Crop y_data points if x_data >= 2.
y_data[x_data >= 2] = 4

# Add random fluctuations to y_data.
y_data += np.random.normal(0, 0.3, len(y_data))

Important: Forging data, manipulating data or copying from others is a clear violation of "good scientific practices". Violations will have consequences.

Finally, we can add this to our plot. Since our data points are subject to statistical fluctuations, we would like to use matplotlib's errorbar method, which draws our data points as dots with error bars.

# Draw with error bars, similar to plot(). The third parameter is the size of
# the error bar in $`y`$ direction.
plt.errorbar(x_data, y_data, 0.3, fmt="ko", capsize=0)

# Save figure.
plt.savefig("measurement.eps")

The character k in the format parameter fmt sets the color to black, the o in fmt changes the style to large dots. The optional parameter capsize modifies the style of the error bars. You can play with these options to see what happens if you change the options, or have a look at the documentation for more information about the options.

After running data_plot.py, you should have a plot similar to this. Plot of cropped parabola with data points

Reading, Plotting and Fitting Experimental Data

We are given with experimental data from radioactive decay in this example. The experimental setup consisted of a radioactive probe, a detector, and a multi-channel-analyzer. The recorded data in decay.txt consist of two tab-separated columns. The first column is called channel. Each channel corresponds to a certain energy range. The multi-channel-analyzer maintains a counter for each channel. A detected decay causes the multi-channel-analyzer to increment the internal counter which corresponds to the energy of the measured decay. The second column stores these counts. Open the file with your favorite text editor and have a look at the data.

As usual, create the file decay.py and add the import statements needed for this example.

# Import the numpy library.
import numpy as np

# Import the scipy library with fit routines.
import scipy.optimize

# Import the scipy library with probability distributions and statistical tests.
import scipy.stats

# Import the plotting library.
import matplotlib.pyplot as plt

To inspect the provided data, we can plot the raw data points first. Numpy provides the function genfromtxt, which reads a whitespace-separated file into a numpy array. The function returns a two-dimensional array. The outer array has one entry for each line in the text file. The inner array has two entries in our case, one for the channel and the other one for the event count. We can use transpose() to flip the matrix, such that the outer array has two entries, one with an array of channel values and the other one with event counts. Since our measured event counts stem from radioactive decay, we know that the event counts follow a Poisson distribution. Therefore, the uncertainties of the event counts are simply the square roots of the number of events.

# Read both columns from the text file.
channel, count = np.genfromtxt("decay.txt").transpose()

# Calculate the uncertainty on the number of events per channel.
s_count = np.sqrt(count)

# Create and save a raw version of the plot with data points.
# The label will later be used to identify the curves in a legend.
plt.errorbar(channel, count, s_count, fmt='.k', capsize=0, label="Data")
plt.savefig("decay_raw.eps")

The plot of the raw data is shown below. The plot shows the channel on the

x
-axis and the number of events per bin on the
y
-axis. From the experimental setup, we expect a linearly rising background plus a Gaussian peak.

Data only for the decay

Judging from the plot, it looks like the assumed model could describe the data. We would like to fit this model to our data to determine the optimal values of the model parameters and their uncertainties (and the covariance matrix). Let's give a more formal version of the expected model

n(c) = A \exp\left(-\frac{(c-m)^2}{2 s^2}\right) + y_0 + bc,

where

n(c)
is the expected number of events in channel
c
;
A, m
and
s
are the height, center, and width of the Gaussian, respectively. The parameters
y_0
and
b
are the usual parameters of a linear curve which is assumed to describe our background. The model can be implemented in python as a function. The first parameter should be the
x
-value, all following arguments are free parameters of the model. The return value corresponds to the
y
-value, in our case the expected number of events.

def model(channel, m, s, A, y0, b):
  return  A * np.exp(-0.5 * (channel - m)**2 / s**2) + y0 + b * channel

Please note that we are making an approximation with this definition. Strictly speaking, comparing the return values of our model to the measured count is not correct. The variable channel corresponds to the energy measured with the setup. Let's assume channel

c_i
corresponds to the energy
E_i
. If we measure
n_i
events in channel
c_i
, this means that we have measured
n_i
in the energy interval
[\frac{1}{2}(E_{i-1} + E_i), \frac{1}{2}(E_i + E_{i+1})]
. The proper way is to integrate our continuous function
n(c)
in each bin
[c_{i} - \frac{1}{2}, c_{i} + \frac{1}{2}]
and compare these bin-wise integrals to the measured data. The procedure shown here is a good approximation if the function can be considered to be linear within each bin. However, the parameter
A
and
b
are not normalized to the bin width in this case.

To fit this model to our experimental data, we can use the function curve_fit provided by the scipy package. The function curve_fit performs a least square fit and returns the optimal parameters and the covariance matrix. The fit might not converge on its own. We can guide the optimization procedure by providing suitable start values of the free parameters. From the plot, I read off a height

A=50
, a center
m=60
and a width
s = 10
for the Gaussian part and
y_0 = 20
and
b = 1
for the linear part. These values don't have to be accurate. They should be a rough estimation. This is usually enough to get a stable fit result. More information on the fitting method can be found in the documentation.

# Define the initial values of the free parameters.
# Remember, that we defined our model as n(c; m, s, A, y0, b)
p0 = (60, 10, 50, 20, 1)

# Perform the actual fit. The parameters are
#   (1) Model to fit
#   (2) Array of x-values
#   (3) Array of y-values to which the model should be fitted
#   (4) Array with initial values for the free parameters
#   (5) Array with uncertainties on the y-values.
popt, pcov = scipy.optimize.curve_fit(model, channel, count, p0, s_count)

To visualize the fitted model, we need to evaluate our model with the optimized parameters popt.

# Evaluate the model with the optimized parameter.
fit_count = model(channel, *popt)

# Plot a curve representing the fitted model.
plt.plot(channel, fit_count, label="Linear + Gauss")

# Add axis labels.
plt.xlabel("Channel")
plt.ylabel("Counts")

# Add a legend to identify data and our fit. This method uses values passed to
# the optional argument 'label' of plot() and errorbar().
plt.legend()

# Save the figure.
plt.savefig("decay.eps")

The result should look like this.

Data and fit for the decay

Usually, we want to measure some quantity with an experimental setup. For this, we need the optimized parameters and the covariance matrix returned by the fit. Let's assume we are interested in the best fit value of the parameters and their uncertainties. The uncertainties are the square roots of the diagonal of the covariance matrix. We can add the following print statements, to display this kind of information.

print("Optimal parameters:")
print("  m = %g +- %g"  % (popt[0], np.sqrt(pcov[0][0])))
print("  s = %g +- %g"  % (popt[1], np.sqrt(pcov[1][1])))
print("  A = %g +- %g"  % (popt[2], np.sqrt(pcov[2][2])))
print("  y0 = %g +- %g" % (popt[3], np.sqrt(pcov[3][3])))
print("  b = %g +- %g"  % (popt[4], np.sqrt(pcov[4][4])))
print()  # print blank line

A

\chi^2
-test can be performed to assess the goodness of this fit. In a counting experiment like this one, we can rely on scipy's chisquare, which returns the
\chi^2
and the
p
-value. The chisquare method assumes, that the uncertainties are the square roots of the expected number of events. If this is not the case, we have to compute the \chi^2 manually. The following example shows both, the usage of chisquare and the manual computation. The print statements for both methods produce the same output. Please note that our model has five free parameters. The total number of degrees of freedom is the number of data points reduced by the of parameters of our model.

# Delta degrees of freedom
ddof = 5

print("chi^2 from scipy:")

# Call chisquare and print.
chi2, p = scipy.stats.chisquare(count, fit_count, ddof=ddof - 1)
print("  chi2 / dof = %g / %d" % (chi2, len(count) - ddof))
print("  p-value = %g" % p)

print()  # print blank line

print("Manual chi^2 test:")

# Calculate chi^2
uncertainty = np.sqrt(fit_count)
chi2 = ((count - fit_count)**2 / uncertainty**2).sum()

# Calculate p-value and print
p = scipy.stats.distributions.chi2.sf(chi2, len(count) - ddof)
print("  chi2 / dof = %g / %d" % (chi2, len(count) - ddof))
print("  p-value = %g" % p)

The additional - 1 in the call of chisquare is necessary because the method implicitly assumes that the total number of events is fixed. The method, therefore, reduces the number of degrees of freedom by one. However, in our case, the number of decays is not fixed, and we have an additional degree of freedom compared to what chisquare assumes. The - 1 in the ddof corrects for this.

If you run the decay.py you should see the following fit results.

$ python3 decay.py
Optimal parameters:
  m = 61.756 +- 0.696195
  s = 8.59757 +- 0.708479
  A = 43.2461 +- 3.24047
  y0 = 20.6589 +- 1.0885
  b = 0.841156 +- 0.0193703

chi^2 from scipy:
  chi2 / dof = 120.577 / 123
  p-value = 0.544943

Manual chi^2 test:
  chi2 / dof = 120.577 / 123
  p-value = 0.544943

Congratulations! You have mastered the first steps to analyze experimental data with python.

Further Reading

This tutorial can not cover all topics which can be relevant for the advanced laboratories. Here is a list of online resources, which might be useful.