This repository consists of a collection of python examples intended as an
introduction on the use of python in data analysis, especially for the
advanced laboratories in physics at the University of Freiburg. In previous
years, code examples for [ROOT](https://root.cern.ch/) have been provided.
Material showing how to use python for the same task was missing. This document
tries to fill the gap.

If you think this tutorial is useful, lacks important information, or is
unclear, don't hesitate to give [feedback](mailto:frank@sauerburger.com).

# Table of Contents
1. [Installation](#installation)
2. [Prerequisites and Structure](#prerequisites-and-structure)
3. ['Hello World' Example](#hello-world-example)
4. [Numpy Arrays](#numpy-arrays)
5. [Plotting Functions](#numpy-arrays)
6. [Plotting Data Points](#plotting-data-points)
7. [Reading, Plotting and Fitting Experimental Data](#reading-plotting-and-fitting-experimental-data)
8. [Further Reading](#furhter-reading)

# Installation
To get started with python for data analysis in the advanced laboratories you
need the python interpreter. In this document, we will use `python3`. The
additional packages `numpy`, `scipy` and `matplotlib` are useful for data
analysis and data presentation.

## Ubuntu
To install all the packages on Ubuntu, you
can run the following commands.

<!-- console -->
```bash
$ sudo apt-get update
$ sudo apt-get install -y python3 python3-pip
$ pip3 install --user numpy scipy matplotlib
```

The `--user` argument for `pip` installs the python package in your home
directory, which hides potentially older packages installed with `apt-get`.

## Windows
Since I'm not using python on Windows myself, I don't have first-hand experience
with it. However, I think [Anaconda](https://www.continuum.io/downloads) is a
good solution for windows users since it provides all required packages. 

# Prerequisites and About the Tutorial

This tutorial assumes that you have some experience with python, which
includes variable assignment, function calling and function definition, if statements and
for loops. The tutorial uses only the very basics, such as variable assignments
and function calls, but it is certainly advisable to know about control
structures.

To catch up on these aspects, you can refer to the [python
documentation](https://docs.python.org/3/tutorial/). If you are already
familiar with another programming language, it should be quite intuitive to
switch to python.

This tutorial is divided into several examples, which might depend on each
other. The examples show code snippets which you are supposed to copy to a text
editor. The scripts can then be executed in a terminal. Besides this modus
operandi, you are invited to use
the interactive mode of python or ipython instead and copy the code directly
to the python interpreter. Recently [jupyter](http://jupyter.org/) notebooks have 
become very popular. I recommend you to try out these different options and
choose the one most suited for you.

This repository does not contain ready-made python example scripts or plots. The
only code resource is this README file. The idea behind this is, that there
shouldn't be duplications of code snippets, which will be out-of-sync eventually.
The repository is set up, such that
each commit triggers continues integration tasks on the server, which parses the examples from
the README and executes them with the
[doxec](https://gitlab.sauerburger.com/frank/doxec) package. This means you can
download [ready-made scripts and
plots](https://gitlab.sauerburger.com/frank/FP-python-examples/-/jobs/artifacts/master/download?job=doxec_test)
produced by the continues integration task.

Let's get going!

# 'Hello World' Example
The first example is basically a 'Hello World' script, to check whether python
is running correctly. Create a file named `hello_world.py` and add the following
content.

<!-- write hello_world.py -->
```python
# Load math library with sqrt function.
import math

print("Example 1:")

# Strings can be formatted with the % operator. The placeholder %g prints a
# floating point numbers as decimal or with exponent depending on its
# magnitude.
print("  Square root of 2 = %g" % math.sqrt(2))
```

To run the example, open a terminal tell the python interpreter to run your
code.
<!-- console_output -->
```sh
$ python3 hello_world.py
Example 1:
  Square root of 2 = 1.41421
```

Have you seen the expected output? Congratulations, you can move on to real-life
examples.

# Numpy Arrays 
The standard data structure to store numerical data is a numpy array. Numpy
arrays are defined in the numpy package and are implemented in a very
efficient way. 

To get stared with numpy arrays create a file `np_arrays.py` and add all lines
listed in this section. The first line should be an import statement.
<!-- write np_arrays.py -->
```python
# Import the numpy library.
import numpy as np
```
In this example, we create a numpy array `numbers` containing my favorite numbers from
the python list `[4, 9, 16, 36, 49]`.
<!-- append np_arrays.py -->
```python
# Create a numpy array by passing a list to np.array.
numbers = np.array([4, 9, 16, 36, 49])
```
Having all these numbers in a numpy array makes bulk computations very efficient.
Assume, we want to calculate the square root of all these numbers. We can
simply use numpy's `sqrt` method do perform the same operation on all elements
of the array at the same time.
<!-- append np_arrays.py -->
```python
# Calculate the square root for each item in the array numbers.
roots = np.sqrt(numbers)
```

Since the resulting variable `roots` is also a numpy array, we can perform
similar operations on this variable.
<!-- append np_arrays.py -->
```python
# Perform other calucaltions for each element separately.
something_else = 1.5 * roots - 4
```
Numpy arrays overload the typical arithmetic operations, such that the above
statement benefits from numpys efficient, vectorized (i.e. performing the same
operation on may values) implementation. You should always think about a way to
use such vectorized statements, and try to avoid manually looping over all the
values. Using a python loop to run over $`10^3`$ values is probably fine, but you
don't want to wait for a python loop iterating over 10^6 or 10^9 values.

Finally, add a print statement to check that all the calculations have been
carried out as expected. 
<!-- append np_arrays.py -->
```python
print("The result is", something_else)
```

When executed, you should get the following printout. You can convince yourself,
that the calculations are correct.
<!-- console_output -->
```bash
$ python3 np_arrays.py
The result is [-1.   0.5  2.   5.   6.5]
```

Numpy offers many other functionalities which are beyond the scope of this basic
introduction. It is definitely worth glancing at the
[documentation](https://docs.scipy.org/doc/numpy/index.html).

# Plotting Functions
One major aspect of data analysis is the visualization of data. This includes the
generation of diagrams and plots. You can use the powerful library matplotlib to
create high-quality plots in a python environment. The goal of this example is to plot
the cropped parabola

```math
    f(x) = \left\{\begin{array}{lr}
              x^2, & \text{for } x < 2\\
              4, & \text{for } 2 \leq x
            \end{array}\right.,
```

which looks like this:
![Plot of cropped parabola](https://gitlab.sauerburger.com/frank/FP-python-examples/-/jobs/artifacts/master/raw/cropped_parabola.png?job=doxec_test)

Create the file `func_plot.py` and add the following lines.

<!-- Add additional files for non-X11 environment in CI -->
<!-- write func_plot.py
```python
import matplotlib
matplotlib.use('Agg')
```
-->
<!-- append func_plot.py -->
```python
# Import the numpy library.
import numpy as np

# Import the powerfull plotting library.
import matplotlib.pyplot as plt
```

Plotting a function with matplotlib means plotting many points connected by a
line. First, we create an array with 200 equidistant $`x`$-values in the interval
$`[-2.5, 3]`$. This array functions as a grid, for which we calculate
the $`y`$ values. 
<!-- append func_plot.py -->
```python
x = np.linspace(-2.5, 3, 200)
```

We can easily calculate the square of all these values with `x**2`. Cropping the
right part is a bit more complex. First we create an index array of `1`'s and
`0`'s, which
indicate whether $`x \geq 2`$. This index array has the same length as our
$`x`$-grid. The first elements of the index array are `0`'s, since the
corresponding $`x`$-value is less than two. At some point in the array, the value
changes to `1`, since then the corresponding $`x`$ values satisfy $`x\geq2`$.
The index array can be used to select a subset of
$`y`$-values, namely all $`y`$-values, for which $`x\geq 2`$. Finally, we can
assign the value $`4`$ to this subset, and therefore
effectively crop the parabola. The implementation in python of the algorithm outlined
above  is rather short.
<!-- append func_plot.py -->
```python
# Calculate the regular parabola.
y = x**2

# Create index array.
idx = (x >= 2)

# Set all y-values to 4, for which x >= 2.
y[idx] = 4

# One can get rid of the intermetdiate index array and combine both lines into
# the statement y[x >= 2] = 4
```

The final step of this example is to call matplotlib, which plots the points and
connects them with a line. Additionally, we can add axis labels and save the
resulting figure.
<!-- append func_plot.py -->
```python
# Plot a line specified by x- and y-arrays.
plt.plot(x, y)

# Set axis labels. Latex expression can be used.
plt.xlabel("$x$") 
plt.ylabel("Cropped Parabola")

# Save the figure. Various different output formats are available.
plt.savefig("cropped_parabola.eps")
```
<!-- append func_plot.py
```python
# We need also a png version of the plot to embed it into the README.md.
plt.savefig("cropped_parabola.png")
```
-->

Run your script and check that the file `cropped_parabola.eps` is created.
<!-- console -->
```bash
$ python3 func_plot.py
```

<!-- console_output
```
$ ls cropped_parabola.eps
cropped_parabola.eps
$ ls cropped_parabola.png
cropped_parabola.png
```
-->

# Plotting Data Points
A typical task in the advanced laboratory might be to compare measured data
points to an expected function. Let's assume the expected function is the cropped
parabola $`f(x)`$ from the previous example. We will use random data points in
this example since we haven't actually measured real data, which is expected to
follow $`f(x)`$.

This example is based on the code from the previous example. Copy the file from
the previous example to `data_plot.py`, such that we can append the following
code snippets to `data_plot.py`. Keep
the plotting code from the previous example as it is.
<!-- console
```bash
$ cp func_plot.py data_plot.py
```
-->

We generate the pseudo data points by adding random deviations to the expected
$`y`$-values. Let's pretend we have measured data points for all half-integer
$`x`$-values in the interval $`[-2.5, 3]`$.
<!-- append data_plot.py -->
```python
# Create x-value grid for the "measured" data.
x_data = np.array([-2.5, -2, -1.5, -1, -.5, 0, .5, 1, 1.5, 2, 2.5, 3])
```

We evaluate the function $`f(x)`$ again for the `x_data` values. The resulting
array `y_data` matches the curve from the previous example perfectly. We draw random
deviations from a centered normal distribution with a standard
deviation of 0.3 and add them to `y_data`. The third argument of `numpy.random.normal` specifies, how
many random samples we want to draw. We use `len(x_data)` since we want to draw
an independent random deviation for each $`x`$-value.
<!-- append data_plot.py -->
```python
# Calculate square of x_data points.
y_data = x_data**2

# Crop y_data points if x_data >= 2.
y_data[x_data >= 2] = 4

# Add random fluctuations to y_data.
y_data += np.random.normal(0, 0.3, len(y_data))
```

Finally, we can add this to our plot. Since our data points are subject to
statistical fluctuations, we would like to use matplotlib's `errorbar` method,
which draws our data points as dots with error bars.

<!-- append data_plot.py -->
```python
# Draw with error bars, similar to plot(). The third parameter is the size of
# the error bar in $`y`$ direction.
plt.errorbar(x_data, y_data, 0.3, fmt="ko", capsize=0)

# Save figure.
plt.savefig("measurement.eps")
```

The character `k` in the format parameter `fmt` sets the color to *black*, the
`o` in `fmt` changes the style to *large dots*. The optional parameter `capsize`
modifies the style of the error bars. You can play with these options to see
what happens if you change the options, or have a look at the
[documentation](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.errorbar.html)
for more information about the options.

<!-- append data_plot.py
```python
# We need also a png version of the plot to embed it into the README.md.
plt.savefig("measurement.png")
```
-->

<!-- console_output
```
$ python3 data_plot.py
$ ls measurement.eps
measurement.eps
$ ls measurement.png
measurement.png
```
-->

After running `data_plot.py`, you should have a plot similar to this.
![Plot of cropped parabola with data points](https://gitlab.sauerburger.com/frank/FP-python-examples/-/jobs/artifacts/master/raw/measurement.png?job=doxec_test)

# Reading, Plotting and Fitting Experimental Data
We are given with experimental data from a radioactive decay in this example.
The experimental setup consisted of a radioactive probe, a detector, and a
multi-channel-analyzer. The recorded data in `decay.txt` consist of two
tab-separated columns. The first column is called channel. Each channel
corresponds to a certain energy range. The multi-channel-analyzer maintains a
counter for each channel. A detected decay causes the multi-channel-analyzer to increment the internal counter which
corresponds to the energy of the measured decay. The second column stored
these counts. Open the file with your favorite text editor and have a look at the
data.

As usual, create the file `decay.py` and add the import statements needed
for this example.

<!-- Add additional files for non-X11 environment in CI -->
<!-- write decay.py
```python
import matplotlib
matplotlib.use('Agg')
```
-->
<!-- append decay.py -->
```python
# Import the numpy library.
import numpy as np

# Import the scipy library with fit routines.
import scipy.optimize

# Import the scipy library with probability distributions and statistical tests.
import scipy.stats

# Import the plotting library.
import matplotlib.pyplot as plt
```

To inspect the provided data, we can plot the raw data points first. Numpy
provides the function `loadtxt`, which reads a whitespace-separated file into a
numpy array. The function returns a two-dimensional array. The outer array has
one entry for each line in the text file. The inner array has two entries in our
case, one for the
channel and the other one for the event count. We can use `transpose()` to flip
the matrix, such that the outer array has two entries, one with an array of
channel values and the other one with event counts. Since our measured event
counts stem from radioactive decay, we know that the event counts follow a
Poisson distribution. Therefore, the uncertainties of the event counts are simply
the square roots of the number of events.

<!-- append decay.py -->
```python
# Read both columns from the text file.
channel, count = np.loadtxt("decay.txt").transpose()

# Calculate the uncertainty on the number of events per channel.
s_count = np.sqrt(count)

# Create and save a raw version of the plot with data points.
# The label will later be used to identify the curves in a legend.
plt.errorbar(channel, count, s_count, fmt='.k', capsize=0, label="Data")
plt.savefig("decay_raw.eps")
```
<!-- append decay.py
```
plt.savefig("decay_raw.png")
```
-->
<!-- console 
```
$ python3 decay.py
```
-->

The plot of the raw data is shown below. The plot shows the channel on the
$`x`$-axis and the number of events per bin on the $`y`$-axis. From the
experimental setup, we expect a
linearly rising background plus a Gaussian peak.
![Data only for the decay](https://gitlab.sauerburger.com/frank/FP-python-examples/-/jobs/artifacts/master/raw/decay_raw.png?job=doxec_test)

Judging from the plot, it looks like the assumed model could describe the data. We
would like to fit this model to our data to determine the optimal values of the model
parameters and their uncertainties (and the covariance matrix). Let's give a more
formal version of the expected model 
```math
  n(c) = A \exp\left(-\frac{(c-m)^2}{2 s^2}\right) + y_0 + bc,
```
where $`n(c)`$ is the expected number of events in channel $`c`$ ; $`A, m`$
and  $`s`$ are the height, center, and width of the Gaussian, respectively. The
parameters $`y_0`$ and $`b`$ are the usual parameters of a linear curve which is
assumed to describe our background. The model can be implemented in python as a
function. The first parameter should be the $`x`$-value, all following arguments are
free parameters of the model. The return value corresponds to the $`y`$-value, in
our case the expected number of events.

<!-- append decay.py -->
```python
def model(channel, m, s, A, y0, b):
  return  A * np.exp(-0.5 * (channel - m)**2 / s**2) + y0 + b * channel
```

Please note that we are making an approximation with this definition.
Strictly speaking, comparing the return values of our model to the measured
count is not correct. The variable _channel_ corresponds to the energy
measured with the setup. Lets assume channel $`c_i`$ corresponds to energy
$`E_i`$. If we measure $`n_i`$ events in channel $`c_i`$, this means that we
have measured $`n_i`$ in the energy interval $`[\frac{1}{2}(E_{i-1} + E_i),
\frac{1}{2}(E_i + E_{i+1})]`$. The proper way is to integrate our continuous
function $`n(c)`$ in each bin $`[c_{i} - \frac{1}{2}, c_{i} + \frac{1}{2}]`$ and
compare these bin-wise integrals to the measured data. The
procedure shown here is a good approximation, if the function can be considered
to be linear within each bin. However, the parameter $`A`$ and $`b`$ are not normalized to
the bin width in this case.


To fit this model to our experimental data, we can use the function `curve_fit`
provided by the scipy package. The function `curve_fit` performs a least square
fit and returns the optimal parameters and the covariance matrix. The fit might
not converge on its own. We can guide the optimization procedure by providing
suitable start values of the free parameters. From the plot I read off a height
$`A=50`$, a center $`m=60`$ and a width $`s = 10`$ for the Gaussian part and
$`y_0 = 20`$ and $`b = 1`$ for the linear part. These values don't have to be
accurate. They should be a rough estimation, this is usually enough to get a
stable fit result. More information on the fitting method can be found in the
[documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html).

<!-- append decay.py -->
```python
# Define the initial values of the free parameters.
# Remember, that we defined our model as n(c; m, s, A, y0, b)
p0 = (60, 10, 50, 20, 1)

# Perform the actual fit. The parameters are
#   (1) Model to fit
#   (2) Array of x-values
#   (3) Array of y-values to which the model should be fitted
#   (4) Array with initial values for the free parameters
#   (5) Array with uncertainties on the y-values.
popt, pcov = scipy.optimize.curve_fit(model, channel, count, p0, s_count)
```

To visualize the fitted model, we need to evaluate our model with the optimized
parameters `popt`. 

<!-- append decay.py -->
```python
# Evaluate the model with the optimized parameter.
fit_count = model(channel, *popt)

# Plot a curve representing the fitted model.
plt.plot(channel, fit_count, label="Linear + Gauss")

# Add axis labels.
plt.xlabel("Channel")
plt.ylabel("Counts")

# Add a legend to identify data and our fit. This method uses values passed to
# the optional argument 'label' of plot() and errorbar().
plt.legend()

# Save the figure.
plt.savefig("decay.eps")
```
<!-- append decay.py
```
plt.savefig("decay.png")
```
-->
<!-- console 
```
$ python3 decay.py
```
-->
The result should look like this.
![Data and fit for the decay](https://gitlab.sauerburger.com/frank/FP-python-examples/-/jobs/artifacts/master/raw/decay.png?job=doxec_test)

Usually, we want to measure some quantity with an experimental setup. For this,
we need the optimized parameters and the covariance matrix returned by the
fit. Let's assume we are interested in the best fit value of the parameters and
their uncertainties. The uncertainties are the square roots of the diagonal of
the covariance matrix. We can add the following print statements, to display this kind
of information.

<!-- append decay.py -->
```python
print("Optimal parameters:")
print("  m = %g +- %g"  % (popt[0], np.sqrt(pcov[0][0])))
print("  s = %g +- %g"  % (popt[1], np.sqrt(pcov[1][1])))
print("  A = %g +- %g"  % (popt[2], np.sqrt(pcov[2][2])))
print("  y0 = %g +- %g" % (popt[3], np.sqrt(pcov[3][3])))
print("  b = %g +- %g"  % (popt[4], np.sqrt(pcov[4][4])))
print()  # print blank line
```

A $`\chi^2`$-test can be performed to assess the goodness of this fit. In
a counting experiment like this one, we can rely on scipy's `chisquare`, which
returns the $`\chi^2`$ and the $`p`$-value. The `chisquare` method assumes, that
the uncertainties are the square root of the expected number of events. If this is not the
case, we have to compute the $`\chi^2`$ manually. The following example shows
both, the usage of `chisquare` and the manual computation. The print statements
for both methods produce the same output. Please note that
we have five degrees of freedom since we have five free parameters in our
model.

<!-- append decay.py -->
```python
# Degrees of freedom of our model.
dof = 5

print("chi^2 from scipy:")

# Call chisquare and print.
chi2, p = scipy.stats.chisquare(count, fit_count, ddof=dof)
print("  chi2 / dof = %g / %d" % (chi2, len(count) - dof))
print("  p-value = %g" % p)

print()  # print blank line

print("Manual chi^2 test:")

# Calculate chi^2
uncertainty = np.sqrt(fit_count)
chi2 = ((count - fit_count)**2 / uncertainty**2).sum()

# Calculate p-value and print
p = scipy.stats.distributions.chi2.sf(chi2, len(count) - 1 - dof)
print("  chi2 / dof = %g / %d" % (chi2, len(count) - dof))
print("  p-value = %g" % p)

```

If you run the `decay.py` you should see the following fit results.
<!-- console_output -->
```bash
$ python3 decay.py
Optimal parameters:
  m = 61.756 +- 0.696195
  s = 8.59757 +- 0.708479
  A = 43.2461 +- 3.24047
  y0 = 20.6589 +- 1.0885
  b = 0.841156 +- 0.0193703

chi^2 from scipy:
  chi2 / dof = 120.577 / 123
  p-value = 0.519418

Manual chi^2 test:
  chi2 / dof = 120.577 / 123
  p-value = 0.519418
```

Congratulations! You have mastered the first steps to analyze experimental data
with python.

# Further Reading
This tutorial can not cover all topics which can be relevant for the advanced
laboratories. Here is a list of online resources, which might be useful.

 - [Python](https://docs.python.org/3/) documentation
 - [Numpy and Scipy](https://docs.scipy.org/doc/) documentation
 - [Matplotlib](http://matplotlib.org/contents.html) documentation
 - [Orthogonal distance regression](https://docs.scipy.org/doc/scipy/reference/odr.html) - least square fit, which considers
   uncertainties in $`x`$ and $`y`$ directions.
 - Histograms in
   [numpy](https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html)
   and
   [matplotlib](https://matplotlib.org/api/pyplot_api.html?highlight=matplotlib%20pyplot%20hist#matplotlib.pyplot.hist).