-
Frank Sauerburger authored
Add a preliminary version, i.e. without proof reading, of the section describing how to read, plot and fit data.
Frank Sauerburger authoredAdd a preliminary version, i.e. without proof reading, of the section describing how to read, plot and fit data.
This repository consists of a collection of python examples intended as an introduction on the usage of python in data analysis, especially for the advanced laboratories in physics at the University of Freiburg. In previous years code examples for ROOT have been provided. Material on the usage of python was missing. The code examples shown in this repository follow the examples shown in the ROOT introductions.
Table of Contents
- Installation
- Prerequisites and Structure
- 'Hello World' Example
- Numpy Arrays
- Plotting Functions
- Plotting Data Points
- Reading, Plotting and Fitting Experimental Data
Installation
To get started with python for data analysis in the advanced laboratories you
need the python interpreter. In this document we will use python3
. The
additional packages numpy
, scipy
and matplotlib
are useful for data
analysis and data presentation. To install all the packages on Ubuntu, you
can run the following command line.
sudo apt-get install python3 python3-numpy python3-scipy python3-matplotlib
Prerequisites and Structure
This tutorial assumes that you have a little experience in python, which includes variable assignment, function calling and function definition, if statements and for loops. The tutorial uses only the very basics, such as variable assignments and function calls, but it is certainly advisable to know about control structures.
To catch up on these aspects, you can refer to the python documentation.
The tutorial is structured into several examples, which might depend on each other. The examples show code snippets which you are supposed to copy to a text editor. The scripts can then be executed in a terminal. You are invited to use the interactive mode of python or ipython instead of copying the code to a text editor. Recently jupyter notebooks have been become very popular. I recommend you to try out these different platforms and choose the one most suited for you.
This repository does not contain ready-made python example scripts or plots.The only resource with code is this README file. The idea behind this is, that there shouldn't be duplications of code snippets, which will be out-of-sync eventually. The repository is set up, such that each commit triggers continues integration tasks, which parses the examples from the README and executes them with the doxec package. This means, you can download read-made scripts and plots produced by the continues integration task.
'Hello World' Example
The first example is basically a 'Hello World' script, to check whether python
is running correctly. Create a file named hello_world.py
and add the following
content.
# load math library with sqrt function
import math
print("Example 1:")
# Strings can be formatted with the % operator. The placeholder %g prints a
# floating point numbers as decimal or with exponent depending on its
# magnitute.
print(" Square root of 2 = %g" % math.sqrt(2))
To run the example, open a terminal tell the python interpreter to run your code.
$ python3 hello_world.py
Example 1:
Square root of 2 = 1.41421
Have you seen the expected output? Congratulations, you can move on to real-life examples.
Numpy Arrays
The standard data structure to store numerical data are numpy arrays. Numpy arrays are defined in the numpy package, and are implemented in a very efficient way.
To get stared with numpy arrays create a file np_arrays.py
and add all lines
listed in this chapter. The first line should be an import statement.
import numpy as np
In this example we create a numpy array numbers
containing my favorite numbers from
a python list.
numbers = np.array([4, 9, 16, 36, 49])
Having all these numbers in a numpy array makes bulk computations very efficient.
Assume, we want to calculate the square root of all these numbers, wen can
simply use numpy's sqrt
method do perform the same operation on all elements
of the array at the same time.
roots = np.sqrt(numbers)
Since the resulting variable roots
is also a numpy array, we can perform
similar operations on this variable.
something_else = 1.5 * roots - 4
Numpy arrays overload the typical arithmetic operations, such that the above
statement benefits from numpys efficient, vectorized (i.e. performing the same
operation on may values) implementation. You should always think about a way to
use such vectorized statements, and try to avoid manually looping over all the
values. Using a python loop to run over 10^3
values is probably fine, but you
don't want to wait for a python loop iterating over 10^6 or 10^9 values.
Finally add a print statement to check that all the caluclations are as expected.
print("The result is", something_else)
When executed you should get the following printout.
$ python3 np_arrays.py
The result is [-1. 0.5 2. 5. 6.5]
Numpy offers many other functionalities which are beyond the scope of this basic introduction. It is definetely worth glancing at the documentation.
Plotting Functions
One major aspect of data analysis is also data presentation. This includes the geneation of diagrams and plots. You can use the powerful library matplotlib to create publication-quality plots from python. The goal of this example is to plot the cropped parabola f(x), which is limited y=4 for x>=2.
f(x) = \left\{\begin{array}{lr}
x^2, & \text{for } x < 2\\
4, & \text{for } 2 \leq x
\end{array}\right.
The final plot should look like this.
Create the file func_plot.py
and add the following lines.
import numpy as np
import matplotlib.pyplot as plt
Plotting a function with matplotlib means plotting many points connected by a line. First we create an array with 200 equidistant values in the interval [-2.5, 3]. This array functions as a grid of x-values, for which we calculate the y values.
x = np.linspace(-2.5, 3, 200)
We can easily calculate the square of all these values with x**2
. Cropping the
right part is a bit more complex. First we create an index array of 1's and 0's, which
indicate whether x >= 2. This index array can be used to select a subset of
y-values. Finally we can assign the value 4 to this subset, and therefore
effectively cropping the parabola. The full example reads:
y = x**2
idx = (x >= 2)
y[idx] = 4
The final step of this example is to plot the points and connect the with a line by using matplolib's plot method. We can also add axis labels and save the resulting figure.
plt.plot(x, y)
plt.xlabel("$x$") # latex synatx can be used
plt.ylabel("cropped parabola")
plt.savefig("cropped_parabola.eps")
Run your script and check the file cropped_parabola.eps
is created.
$ python3 func_plot.py
Plotting Data Points
A typical task in the advanced laboratory might be to compare measured data
points to an expected function. Lets assume the expected function is the cropped
parabola f(x)
from the previous examples. We will use random data points in
this example, since we haven't actually measured real data, which is expected to
follow f(x)
.
This example is based on the code from the previous example. Copy the file from
the previous examples to data_plot.py
, such that we an append to it and keep
the function plotting code. Please add the code snippets in this section to the
file data_plot.py
.
We generate the pseudo data points by adding random deviations to the expected y-values. Lets assume we measured data points for all quarter-integer x-values.
x_data = np.array([-2.5, -2, -1.5, -1, -.5, 0, .5, 1, 1.5, 2, 2.5, 3])
We evaluate the function f(x)
again for the x_data
values. The random
deviations are drawn from a centered normal distribution with a standard
deviation of 0.3. The third argument of numpy.random.normal
specifies, how
many random samples we want to draw. We use len(x_data)
, since we want to draw
a random deviation for each x-value.
y_data = x_data**2
y_data[x_data >= 2] = 4
y_data += np.random.normal(0, 0.3, len(y_data))
Finally we can add this to our plot. Since our data points are subject to
statistical fluctuations, we would like to use matplotlib's errorbar
method,
which draw our data points with errorbars.
plt.errorbar(x_data, y_data, 0.3, fmt="ko", capsize=0)
plt.savefig("measurement.eps")
The character k
in the format parameter fmt
sets the color to black, the
o
in fmt
changes the style to large dots. The optional parameter capsize
modifies the style of the error bars. You can play with these options and see
what happens or have a look at the
documentation.
After running data_plot.py
you should have a plot similar to this.
Reading, Plotting and Fitting Experimental data
We are given with experimental data from a radioactive decay in this example.
The experimental setup consisted of a radioactive probe, a detector and a
multi-channel-analyzer. The recorded data in decay.txt
consist of two
tab-separated columns. The first column is called channel. Each channel
corresponds to a certain energy range. The multi-channel-analyzer maintains a
counter for each channel. For each recorded decay the internal counter which
corresponds to the measured energy is incremented. The seconds column stored
these counts. Open the file with out favorite text editor and have a look at the
data.
As usual, create the file decay.py
and the import statements, which we need
for this example.
import numpy as np
import scipy.optimize # provides fit routines
import scipy.stats # provides statistical tests and distributions
import matplotlib.pyplot as plt
To inspect the provided data, we can plot the raw data points first. Numpy
provides the function loadtxt
, which read a whitespace-separated file into a
numpy array. The function returns a two dimensional array. The outer array has
one entry for each line, and the inner two entries in our case, one for the
channel and the other one for the event count. We can use transpose()
to flip
the matrix, such that the outer array has two entries, one with an array of
channel values and the other one with event counts. Since our measured event
counts stem from radioactive decay, we know that the event counts follow a
Poisson distribution. Therefore, the uncertainty of the events counts is simply
the square root of the number of events.
channel, count = np.loadtxt("decay.txt").transpose()
s_count = np.sqrt(count)
plt.errorbar(channel, count, s_count, fmt='.k', capsize=0, label="Data")
plt.savefig("decay_raw.eps")
The plot of the raw data is shown below. The plot shows the channel on the
x-axis and the number of events per bin on the y-axis. From the experimental setup we expect a
linearly rising background plus a Gaussian peak.
Judging from the plot, it looks like the assumed model describes the data. We would like this model to our data to determine the optimal value of the model parameter and their uncertainties (and the covariance matrix). Lets give a more formal version of the expected model
n(c) = A exp\left(-\frac{(c-m)^2}{2 s^2}\right) + y_0 + bc,
where n(c)
is the expected number of events in channel c
; A, m
and s
are the height, center and width of the Gaussian respectively. The
parameters y_0
and b
are the usual parameter of a linear curve which is
assumed to describe our background. The model can be implemented in python as a
function. The first parameter should be the x value, all following arguments are
free parameters of the model. The return value corresponds to the y value, in
our case the expected number of events.
def model(channel, m, s, A, y0, b):
return A * np.exp(-0.5 * (channel - m)**2 / s**2) + y0 + b * channel
Please note that we are going to make an approximation with this definiton.
Strictly speaking, comparing thhe return values of our model to the measured
count is not correct. The variable channel corresponds to the radiation energy
measured with the setup. Lets assume channel c_i
corresponds to energy
E_i
. If we measure n_i
events in channel c_i
, this means that we
have measured n_i
in the energy interval [\frac{1}{2}(E_{i-1} + E_i), \frac{1}{2}(E_i} + E_{i+}]
. The proper way is to integrate our continuous
function n(c)
in each bin [c_{i} - \frac{1}{2}, c_{i} } \frac{1}{2}]
. The
prodcedure show here is a good approximation, if the function can be considered
is linear within each bin. However, the parameter A
and b
are not normalized to
the bin width in this case.
To fit this model to our experimental data, we can use the function curve_fit
provided by the scipy package. The function curve_fit
performs a least square
fit and returns the optimal parameters and the covariance matrix. The fit might
not converge on its one. We can guide optimization procedure by providing
suitable start values of the free parameters. From the plot I read off a height
A=50
, a center m=60
and a width s = 10
for the Gaussian part and
y_0 = 20
, b = 1
for the linear part. These values don't have to be
accurate. They should be a rough estimation, this is usually enough to get a
stable fit result.
p0 = (60, 10, 50, 20, 1)
popt, pcov = scipy.optimize.curve_fit(model, channel, count, p0, np.sqrt(count))
To visualize the fitted model, we need to evaluate our model with the optimized
parameters popt
. We are now ready to add the fitted curve to the plot and save
it.
fit_count = model(channel, *popt)
plt.plot(channel, model(channel, *popt), label="Linear + Gauss")
plt.xlabel("Channel")
plt.ylabel("Counts")
plt.ylim(0, 1.1 * max(count))
plt.legend()
plt.savefig("decay.eps")
The result should look like this.
Usually we want to measure some quantity with the experimental setup. For this we need the optimized parameters and the covariance matrix returned by the fit. Lets assume we are interested in the best fit value of the parameters and their uncertainties. We can add the following print outs, to display this kind of information.
print("Optimal parameters:")
print(" m = %g +- %g" % (popt[0], np.sqrt(pcov[0][0])))
print(" s = %g +- %g" % (popt[1], np.sqrt(pcov[1][1])))
print(" A = %g +- %g" % (popt[2], np.sqrt(pcov[2][2])))
print(" y0 = %g +- %g" % (popt[3], np.sqrt(pcov[3][3])))
print(" b = %g +- %g" % (popt[4], np.sqrt(pcov[4][4])))
print()
A \chi^2
-test can also be performed, to assess the goodness of this fit. In
a counting experiment like this one, we can rely on scipy's chisquare
, which
returns the \chi^2
and the p
-value. The chisquare
method assumes, that
the uncertainties are the square root of the expectation. If this is not the
case, we have to compute the \chi^2
manually. The following example shows
both examples. The print statements produce the same output. Please note, that
we have five degrees of freedom, since we have five free parameters in our
model.
print("chi^2 from scipy:")
chi2, p = scipy.stats.chisquare(count, fit_count, ddof=5)
print(" chi2 / ndf = %g / %d" % (chi2, len(count) - 6))
print(" p-value = %g" % p)
print()
print("Manual chi^2 test:")
uncertainty = np.sqrt(fit_count)
chi2 = ((count - fit_count)**2 / uncertainty**2).sum()
p = scipy.stats.distributions.chi2.sf(chi2, len(count) - 6)
print(" chi2 / ndf = %g / %d" % (chi2, len(count) - 6))
print(" p-value = %g" % p)
If you run the decay.py
you should see the following fit results.
$ python3 decay.py
Optimal parameters:
m = 61.756 +- 0.696195
s = 8.59757 +- 0.708479
A = 43.2461 +- 3.24047
y0 = 20.6589 +- 1.0885
b = 0.841156 +- 0.0193703
chi^2 from scipy:
chi2 / ndf = 120.577 / 122
p-value = 0.519418
Manual chi^2 test:
chi2 / ndf = 120.577 / 122
p-value = 0.519418