From c41a7739a79ab4dd6728d26359f8c118af8e2555 Mon Sep 17 00:00:00 2001
From: Frank Sauerburger <frank@sauerburger.com>
Date: Mon, 7 Aug 2017 17:52:29 +0200
Subject: [PATCH] Add preliminary version of real-life section

Add a preliminary version, i.e. without proof reading, of the section describing
how to read, plot and fit data.
---
 README.md | 215 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 decay.txt | 128 ++++++++++++++++++++++++++++++++
 2 files changed, 342 insertions(+), 1 deletion(-)
 create mode 100644 decay.txt

diff --git a/README.md b/README.md
index c0660de..5ea9bfc 100644
--- a/README.md
+++ b/README.md
@@ -5,6 +5,15 @@ years code examples for [ROOT](https://root.cern.ch/) have been provided.
 Material on the usage of python was missing. The code examples shown in this
 repository follow the examples shown in the ROOT introductions.
 
+# Table of Contents
+1. [Installation](#installation)
+2. [Prerequisites and Structure](#prerequisites-and-structure)
+3. ['Hello World' Example](#hello-world-example)
+4. [Numpy Arrays](#numpy-arrays)
+5. [Plotting Functions](#numpy-arrays)
+6. [Plotting Data Points](#plotting-data-points)
+7. [Reading, Plotting and Fitting Experimental Data](#reading-plotting-and-fitting-experimental-data)
+
 # Installation
 To get started with python for data analysis in the advanced laboratories you
 need the python interpreter. In this document we will use `python3`. The
@@ -245,7 +254,7 @@ $ cp func_plot.py data_plot.py
 -->
 
 We generate the pseudo data points by adding random deviations to the expected
-y-values. Lets assume we measured data points for all half-integer x-values.
+y-values. Lets assume we measured data points for all quarter-integer x-values.
 <!-- append data_plot.py -->
 ```python
 x_data = np.array([-2.5, -2, -1.5, -1, -.5, 0, .5, 1, 1.5, 2, 2.5, 3])
@@ -299,3 +308,207 @@ measurement.png
 After running `data_plot.py` you should have a plot similar to this.
 ![Plot of cropped parabola with data points](https://srv.sauerburger.com/esel/FP-python-examples/-/jobs/artifacts/master/raw/measurement.png?job=doxec_test)
 
+# Reading, Plotting and Fitting Experimental data
+We are given with experimental data from a radioactive decay in this example.
+The experimental setup consisted of a radioactive probe, a detector and a
+multi-channel-analyzer. The recorded data in `decay.txt` consist of two
+tab-separated columns. The first column is called channel. Each channel
+corresponds to a certain energy range. The multi-channel-analyzer maintains a
+counter for each channel. For each recorded decay the internal counter which
+corresponds to the measured energy is incremented. The seconds column stored
+these counts. Open the file with out favorite text editor and have a look at the
+data.
+
+As usual, create the file `decay.py` and the import statements, which we need
+for this example.
+
+<!-- Add additional files for non-X11 environment in CI -->
+<!-- write decay.py
+```python
+import matplotlib
+matplotlib.use('Agg')
+```
+-->
+<!-- write decay.py -->
+```python
+import numpy as np
+import scipy.optimize  # provides fit routines
+import scipy.stats  # provides statistical tests and distributions
+import matplotlib.pyplot as plt
+```
+
+To inspect the provided data, we can plot the raw data points first. Numpy
+provides the function `loadtxt`, which read a whitespace-separated file into a
+numpy array. The function returns a two dimensional array. The outer array has
+one entry for each line, and the inner two entries in our case, one for the
+channel and the other one for the event count. We can use `transpose()` to flip
+the matrix, such that the outer array has two entries, one with an array of
+channel values and the other one with event counts. Since our measured event
+counts stem from radioactive decay, we know that the event counts follow a
+Poisson distribution. Therefore, the uncertainty of the events counts is simply
+the square root of the number of events.
+
+<!-- append decay.py -->
+```python
+channel, count = np.loadtxt("decay.txt").transpose()
+s_count = np.sqrt(count)
+plt.errorbar(channel, count, s_count, fmt='.k', capsize=0, label="Data")
+plt.savefig("decay_raw.eps")
+```
+<!-- append decay.py
+```
+plt.savefig("decay_raw.png")
+```
+-->
+<!-- console 
+```
+$ python3 decay.py
+```
+-->
+
+The plot of the raw data is shown below. The plot shows the channel on the
+x-axis and the number of events per bin on the y-axis. From the experimental setup we expect a
+linearly rising background plus a Gaussian peak.
+![Data only for the decay](https://srv.sauerburger.com/esel/FP-python-examples/-/jobs/artifacts/master/raw/decay_raw.png?job=doxec_test)
+
+Judging from the plot, it looks like the assumed model describes the data. We
+would like this model to our data to determine the optimal value of the model
+parameter and their uncertainties (and the covariance matrix). Lets give a more
+formal version of the expected model 
+```math
+  n(c) = A exp\left(-\frac{(c-m)^2}{2 s^2}\right) + y_0 + bc,
+```
+where $`n(c)`$ is the expected number of events in channel $`c`$; $`A, m`$
+and  $`s`$ are the height, center and width of the Gaussian respectively. The
+parameters $`y_0`$ and $`b`$ are the usual parameter of a linear curve which is
+assumed to describe our background. The model can be implemented in python as a
+function. The first parameter should be the x value, all following arguments are
+free parameters of the model. The return value corresponds to the y value, in
+our case the expected number of events.
+
+<!-- append decay.py -->
+```
+def model(channel, m, s, A, y0, b):
+  return  A * np.exp(-0.5 * (channel - m)**2 / s**2) + y0 + b * channel
+```
+
+Please note that we are going to make an approximation with this definiton.
+Strictly speaking, comparing thhe return values of our model to the measured
+count is not correct. The variable channel corresponds to the radiation energy
+measured with the setup. Lets assume channel $`c_i`$ corresponds to energy
+$`E_i`$. If we measure $`n_i`$ events in channel $`c_i`$, this means that we
+have measured $`n_i`$ in the energy interval $`[\frac{1}{2}(E_{i-1} + E_i),
+\frac{1}{2}(E_i} + E_{i+}]`$. The proper way is to integrate our continuous
+function $`n(c)`$ in each bin $`[c_{i} - \frac{1}{2}, c_{i} } \frac{1}{2}]`$. The
+prodcedure show here is a good approximation, if the function can be considered
+is linear within each bin. However, the parameter $`A`$ and $`b`$ are not normalized to
+the bin width in this case.
+
+
+To fit this model to our experimental data, we can use the function `curve_fit`
+provided by the scipy package. The function `curve_fit` performs a least square
+fit and returns the optimal parameters and the covariance matrix. The fit might
+not converge on its one. We can guide optimization procedure by providing
+suitable start values of the free parameters. From the plot I read off a height
+$`A=50`$, a center $`m=60`$ and a width $`s = 10`$ for the Gaussian part and
+$`y_0 = 20`$, $`b = 1`$ for the linear part. These values don't have to be
+accurate. They should be a rough estimation, this is usually enough to get a
+stable fit result.
+
+<!-- append decay.py -->
+```
+p0 = (60, 10, 50, 20, 1)
+popt, pcov = scipy.optimize.curve_fit(model, channel, count, p0, np.sqrt(count))
+```
+
+To visualize the fitted model, we need to evaluate our model with the optimized
+parameters `popt`. We are now ready to add the fitted curve to the plot and save
+it.
+
+<!-- append decay.py -->
+```
+fit_count = model(channel, *popt)
+plt.plot(channel, model(channel, *popt), label="Linear + Gauss")
+
+plt.xlabel("Channel")
+plt.ylabel("Counts")
+plt.ylim(0, 1.1 * max(count))
+plt.legend()
+plt.savefig("decay.eps")
+```
+<!-- append decay.py
+```
+plt.savefig("decay.png")
+```
+-->
+<!-- console 
+```
+$ python3 decay.py
+```
+-->
+The result should look like this.
+![Data and fit for the decay](https://srv.sauerburger.com/esel/FP-python-examples/-/jobs/artifacts/master/raw/decay.png?job=doxec_test)
+
+Usually we want to measure some quantity with the experimental setup. For this
+we need the optimized parameters and the covariance matrix returned by the
+fit. Lets assume we are interested in the best fit value of the parameters and
+their uncertainties. We can add the following print outs, to display this kind
+of information.
+
+<!-- append decay.py -->
+```python
+print("Optimal parameters:")
+print("  m = %g +- %g"  % (popt[0], np.sqrt(pcov[0][0])))
+print("  s = %g +- %g"  % (popt[1], np.sqrt(pcov[1][1])))
+print("  A = %g +- %g"  % (popt[2], np.sqrt(pcov[2][2])))
+print("  y0 = %g +- %g" % (popt[3], np.sqrt(pcov[3][3])))
+print("  b = %g +- %g"  % (popt[4], np.sqrt(pcov[4][4])))
+print()
+```
+
+A $`\chi^2`$-test can also be performed, to assess the goodness of this fit. In
+a counting experiment like this one, we can rely on scipy's `chisquare`, which
+returns the $`\chi^2`$ and the $`p`$-value. The `chisquare` method assumes, that
+the uncertainties are the square root of the expectation. If this is not the
+case, we have to compute the $`\chi^2`$ manually. The following example shows
+both examples. The print statements produce the same output. Please note, that
+we have five degrees of freedom, since we have five free parameters in our
+model.
+
+<!-- append decay.py -->
+```python
+print("chi^2 from scipy:")
+chi2, p = scipy.stats.chisquare(count, fit_count, ddof=5)
+print("  chi2 / ndf = %g / %d" % (chi2, len(count) - 6))
+print("  p-value = %g" % p)
+
+print()
+
+print("Manual chi^2 test:")
+uncertainty = np.sqrt(fit_count)
+chi2 = ((count - fit_count)**2 / uncertainty**2).sum()
+p = scipy.stats.distributions.chi2.sf(chi2, len(count) - 6)
+print("  chi2 / ndf = %g / %d" % (chi2, len(count) - 6))
+print("  p-value = %g" % p)
+
+```
+
+If you run the `decay.py` you should see the following fit results.
+<!-- console_output -->
+```bash
+$ python3 decay.py
+Optimal parameters:
+  m = 61.756 +- 0.696195
+  s = 8.59757 +- 0.708479
+  A = 43.2461 +- 3.24047
+  y0 = 20.6589 +- 1.0885
+  b = 0.841156 +- 0.0193703
+
+chi^2 from scipy:
+  chi2 / ndf = 120.577 / 122
+  p-value = 0.519418
+
+Manual chi^2 test:
+  chi2 / ndf = 120.577 / 122
+  p-value = 0.519418
+```
diff --git a/decay.txt b/decay.txt
new file mode 100644
index 0000000..b8892c8
--- /dev/null
+++ b/decay.txt
@@ -0,0 +1,128 @@
+0	21
+1	19
+2	18
+3	23
+4	37
+5	30
+6	23
+7	28
+8	32
+9	24
+10	32
+11	32
+12	39
+13	32
+14	31
+15	37
+16	32
+17	31
+18	36
+19	30
+20	35
+21	39
+22	39
+23	39
+24	30
+25	45
+26	50
+27	46
+28	49
+29	53
+30	41
+31	45
+32	57
+33	48
+34	39
+35	65
+36	61
+37	53
+38	53
+39	51
+40	48
+41	53
+42	54
+43	71
+44	64
+45	69
+46	59
+47	81
+48	73
+49	93
+50	68
+51	80
+52	98
+53	88
+54	102
+55	99
+56	106
+57	113
+58	95
+59	107
+60	107
+61	117
+62	106
+63	107
+64	115
+65	125
+66	127
+67	136
+68	120
+69	110
+70	99
+71	106
+72	103
+73	96
+74	92
+75	107
+76	93
+77	86
+78	72
+79	108
+80	92
+81	94
+82	83
+83	101
+84	110
+85	105
+86	110
+87	94
+88	87
+89	106
+90	74
+91	92
+92	98
+93	98
+94	102
+95	94
+96	114
+97	91
+98	117
+99	107
+100	100
+101	97
+102	106
+103	91
+104	94
+105	96
+106	98
+107	128
+108	120
+109	105
+110	104
+111	129
+112	126
+113	120
+114	115
+115	122
+116	127
+117	131
+118	142
+119	128
+120	117
+121	139
+122	128
+123	122
+124	123
+125	129
+126	137
+127	110
-- 
GitLab