diff --git a/LICENSE b/LICENSE index 540b0f80620c764949c8e548c0e660f509254371..1fb05390940e454fbbc11a536bf5fcd8ec730184 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2017 Frank Sauerburger +Copyright (c) 2017-18 Frank Sauerburger Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index 15348509257dd4ec8f82e0b2d6a2cccd325cb510..661be6e452ae0d9e8e809ffbda40c61fdcdf6059 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,9 @@ This repository consists of a collection of python examples intended as an -introduction on the usage of python in data analysis, especially for the +introduction on the use of python in data analysis, especially for the advanced laboratories in physics at the University of Freiburg. In previous -years code examples for [ROOT](https://root.cern.ch/) have been provided. -Material showing how to use python for the same task was missing. +years, code examples for [ROOT](https://root.cern.ch/) have been provided. +Material showing how to use python for the same task was missing. This document +tries to fill the gap. If you think this tutorial is useful, lacks important information, or is unclear, don't hesitate to give [feedback](mailto:frank@sauerburger.com). @@ -19,7 +20,7 @@ unclear, don't hesitate to give [feedback](mailto:frank@sauerburger.com). # Installation To get started with python for data analysis in the advanced laboratories you -need the python interpreter. In this document we will use `python3`. The +need the python interpreter. In this document, we will use `python3`. The additional packages `numpy`, `scipy` and `matplotlib` are useful for data analysis and data presentation. @@ -39,8 +40,8 @@ directory, which hides potentially older packages installed with `apt-get`. ## Windows Since I'm not using python on Windows myself, I don't have first-hand experience -with it. However I think [Anaconda](https://www.continuum.io/downloads) is a -good solution for windows users, which provides all required packages. +with it. However, I think [Anaconda](https://www.continuum.io/downloads) is a +good solution for windows users since it provides all required packages. # Prerequisites and About the Tutorial @@ -51,17 +52,17 @@ and function calls, but it is certainly advisable to know about control structures. To catch up on these aspects, you can refer to the [python -documentation](https://docs.python.org/3/tutorial/). If your are already +documentation](https://docs.python.org/3/tutorial/). If you are already familiar with another programming language, it should be quite intuitive to switch to python. -This tutorial is structured into several examples, which might depend on each +This tutorial is divided into several examples, which might depend on each other. The examples show code snippets which you are supposed to copy to a text editor. The scripts can then be executed in a terminal. Besides this modus operandi, you are invited to use -the interactive mode of python or ipython instead and copying the code directly -to the python interpreter. Recently [jupyter](http://jupyter.org/) notebooks have been -become very popular. I recommend you to try out these different platforms and +the interactive mode of python or ipython instead and copy the code directly +to the python interpreter. Recently [jupyter](http://jupyter.org/) notebooks have +become very popular. I recommend you to try out these different options and choose the one most suited for you. This repository does not contain ready-made python example scripts or plots. The @@ -70,9 +71,9 @@ shouldn't be duplications of code snippets, which will be out-of-sync eventually The repository is set up, such that each commit triggers continues integration tasks on the server, which parses the examples from the README and executes them with the -[doxec](https://srv.sauerburger.com/frank/doxec) package. This means, you can +[doxec](https://gitlab.sauerburger.com/frank/doxec) package. This means you can download [ready-made scripts and -plots](https://srv.sauerburger.com/esel/FP-python-examples/-/jobs/artifacts/master/download?job=doxec_test) +plots](https://gitlab.sauerburger.com/frank/FP-python-examples/-/jobs/artifacts/master/download?job=doxec_test) produced by the continues integration task. Let's get going! @@ -91,7 +92,7 @@ print("Example 1:") # Strings can be formatted with the % operator. The placeholder %g prints a # floating point numbers as decimal or with exponent depending on its -# magnitute. +# magnitude. print(" Square root of 2 = %g" % math.sqrt(2)) ``` @@ -108,8 +109,8 @@ Have you seen the expected output? Congratulations, you can move on to real-life examples. # Numpy Arrays -The standard data structure to store numerical data are numpy arrays. Numpy -arrays are defined in the numpy package, and are implemented in a very +The standard data structure to store numerical data is a numpy array. Numpy +arrays are defined in the numpy package and are implemented in a very efficient way. To get stared with numpy arrays create a file `np_arrays.py` and add all lines @@ -119,7 +120,7 @@ listed in this section. The first line should be an import statement. # Import the numpy library. import numpy as np ``` -In this example we create a numpy array `numbers` containing my favorite numbers from +In this example, we create a numpy array `numbers` containing my favorite numbers from the python list `[4, 9, 16, 36, 49]`. <!-- append np_arrays.py --> ```python @@ -132,7 +133,7 @@ simply use numpy's `sqrt` method do perform the same operation on all elements of the array at the same time. <!-- append np_arrays.py --> ```python -# Calculte the square root for each item in the array numbers. +# Calculate the square root for each item in the array numbers. roots = np.sqrt(numbers) ``` @@ -150,7 +151,7 @@ use such vectorized statements, and try to avoid manually looping over all the values. Using a python loop to run over $`10^3`$ values is probably fine, but you don't want to wait for a python loop iterating over 10^6 or 10^9 values. -Finally add a print statement to check that all the calculations have been +Finally, add a print statement to check that all the calculations have been carried out as expected. <!-- append np_arrays.py --> ```python @@ -183,7 +184,7 @@ the cropped parabola ``` which looks like this: - + Create the file `func_plot.py` and add the following lines. @@ -204,7 +205,7 @@ import matplotlib.pyplot as plt ``` Plotting a function with matplotlib means plotting many points connected by a -line. First we create an array with 200 equidistant $`x`$-values in the interval +line. First, we create an array with 200 equidistant $`x`$-values in the interval $`[-2.5, 3]`$. This array functions as a grid, for which we calculate the $`y`$ values. <!-- append func_plot.py --> @@ -217,16 +218,16 @@ right part is a bit more complex. First we create an index array of `1`'s and `0`'s, which indicate whether $`x \geq 2`$. This index array has the same length as our $`x`$-grid. The first elements of the index array are `0`'s, since the -corresponding $`x`$-value is below two. At some point in the array, the value +corresponding $`x`$-value is less than two. At some point in the array, the value changes to `1`, since then the corresponding $`x`$ values satisfy $`x\geq2`$. The index array can be used to select a subset of -$`y`$-values, namely all $`y`$-values, for which $`x\geq 2`$. Finally we can +$`y`$-values, namely all $`y`$-values, for which $`x\geq 2`$. Finally, we can assign the value $`4`$ to this subset, and therefore -effectively cropping the parabola. The implementation in python of the algorithm outlined +effectively crop the parabola. The implementation in python of the algorithm outlined above is rather short. <!-- append func_plot.py --> ```python -# Calculate the regualar parabola. +# Calculate the regular parabola. y = x**2 # Create index array. @@ -234,19 +235,22 @@ idx = (x >= 2) # Set all y-values to 4, for which x >= 2. y[idx] = 4 + +# One can get rid of the intermetdiate index array and combine both lines into +# the statement y[x >= 2] = 4 ``` The final step of this example is to call matplotlib, which plots the points and -connects the with a line. Additionally, We can add axis labels and save the +connects them with a line. Additionally, we can add axis labels and save the resulting figure. <!-- append func_plot.py --> ```python # Plot a line specified by x- and y-arrays. plt.plot(x, y) -# Set axis label. Latex expression can be used. +# Set axis labels. Latex expression can be used. plt.xlabel("$x$") -plt.ylabel("cropped parabola") +plt.ylabel("Cropped Parabola") # Save the figure. Various different output formats are available. plt.savefig("cropped_parabola.eps") @@ -275,14 +279,14 @@ cropped_parabola.png # Plotting Data Points A typical task in the advanced laboratory might be to compare measured data -points to an expected function. Lets assume the expected function is the cropped +points to an expected function. Let's assume the expected function is the cropped parabola $`f(x)`$ from the previous example. We will use random data points in -this example, since we haven't actually measured real data, which is expected to +this example since we haven't actually measured real data, which is expected to follow $`f(x)`$. This example is based on the code from the previous example. Copy the file from -the previous examples to `data_plot.py`, such that we an append the following -code snippets to `data_plot.py` and keep +the previous example to `data_plot.py`, such that we can append the following +code snippets to `data_plot.py`. Keep the plotting code from the previous example as it is. <!-- console ```bash @@ -291,11 +295,11 @@ $ cp func_plot.py data_plot.py --> We generate the pseudo data points by adding random deviations to the expected -$`y`$-values. Lets pretend we have measured data points for all half-integer +$`y`$-values. Let's pretend we have measured data points for all half-integer $`x`$-values in the interval $`[-2.5, 3]`$. <!-- append data_plot.py --> ```python -# Create x-value grid for the measured data. +# Create x-value grid for the "measured" data. x_data = np.array([-2.5, -2, -1.5, -1, -.5, 0, .5, 1, 1.5, 2, 2.5, 3]) ``` @@ -303,8 +307,8 @@ We evaluate the function $`f(x)`$ again for the `x_data` values. The resulting array `y_data` matches the curve from the previous example perfectly. We draw random deviations from a centered normal distribution with a standard deviation of 0.3 and add them to `y_data`. The third argument of `numpy.random.normal` specifies, how -many random samples we want to draw. We use `len(x_data)`, since we want to draw -a random deviation for each $`x`$-value. +many random samples we want to draw. We use `len(x_data)` since we want to draw +an independent random deviation for each $`x`$-value. <!-- append data_plot.py --> ```python # Calculate square of x_data points. @@ -317,13 +321,14 @@ y_data[x_data >= 2] = 4 y_data += np.random.normal(0, 0.3, len(y_data)) ``` -Finally we can add this to our plot. Since our data points are subject to +Finally, we can add this to our plot. Since our data points are subject to statistical fluctuations, we would like to use matplotlib's `errorbar` method, -which draws our data points with as dots with error bars. +which draws our data points as dots with error bars. <!-- append data_plot.py --> ```python -# Draw with error bars, similar to plot(). +# Draw with error bars, similar to plot(). The third parameter is the size of +# the error bar in $`y`$ direction. plt.errorbar(x_data, y_data, 0.3, fmt="ko", capsize=0) # Save figure. @@ -333,7 +338,7 @@ plt.savefig("measurement.eps") The character `k` in the format parameter `fmt` sets the color to *black*, the `o` in `fmt` changes the style to *large dots*. The optional parameter `capsize` modifies the style of the error bars. You can play with these options to see -what happens or have a look at the +what happens if you change the options, or have a look at the [documentation](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.errorbar.html) for more information about the options. @@ -354,21 +359,21 @@ measurement.png ``` --> -After running `data_plot.py` you should have a plot similar to this. - +After running `data_plot.py`, you should have a plot similar to this. + # Reading, Plotting and Fitting Experimental Data We are given with experimental data from a radioactive decay in this example. -The experimental setup consisted of a radioactive probe, a detector and a +The experimental setup consisted of a radioactive probe, a detector, and a multi-channel-analyzer. The recorded data in `decay.txt` consist of two tab-separated columns. The first column is called channel. Each channel corresponds to a certain energy range. The multi-channel-analyzer maintains a -counter for each channel. Decay causes the multi-channel-analyzer to increment the internal counter which -corresponds to the energy of the measured decay. The seconds column stored -these counts. Open the file with out favorite text editor and have a look at the +counter for each channel. A detected decay causes the multi-channel-analyzer to increment the internal counter which +corresponds to the energy of the measured decay. The second column stored +these counts. Open the file with your favorite text editor and have a look at the data. -As usual, create the file `decay.py` and add the import statements, which we need +As usual, create the file `decay.py` and add the import statements needed for this example. <!-- Add additional files for non-X11 environment in CI --> @@ -395,7 +400,7 @@ import matplotlib.pyplot as plt To inspect the provided data, we can plot the raw data points first. Numpy provides the function `loadtxt`, which reads a whitespace-separated file into a -numpy array. The function returns a two dimensional array. The outer array has +numpy array. The function returns a two-dimensional array. The outer array has one entry for each line in the text file. The inner array has two entries in our case, one for the channel and the other one for the event count. We can use `transpose()` to flip @@ -410,11 +415,11 @@ the square roots of the number of events. # Read both columns from the text file. channel, count = np.loadtxt("decay.txt").transpose() -# Calculate the uncertaintiy on the number of events per channel. +# Calculate the uncertainty on the number of events per channel. s_count = np.sqrt(count) # Create and save a raw version of the plot with data points. -# The label will be used later to identify the curves in a legend. +# The label will later be used to identify the curves in a legend. plt.errorbar(channel, count, s_count, fmt='.k', capsize=0, label="Data") plt.savefig("decay_raw.eps") ``` @@ -430,20 +435,21 @@ $ python3 decay.py --> The plot of the raw data is shown below. The plot shows the channel on the -$`x`$-axis and the number of events per bin on the $`y`$-axis. From the experimental setup we expect a +$`x`$-axis and the number of events per bin on the $`y`$-axis. From the +experimental setup, we expect a linearly rising background plus a Gaussian peak. - + Judging from the plot, it looks like the assumed model could describe the data. We would like to fit this model to our data to determine the optimal values of the model -parameters and their uncertainties (and the covariance matrix). Lets give a more +parameters and their uncertainties (and the covariance matrix). Let's give a more formal version of the expected model ```math n(c) = A \exp\left(-\frac{(c-m)^2}{2 s^2}\right) + y_0 + bc, ``` where $`n(c)`$ is the expected number of events in channel $`c`$ ; $`A, m`$ -and $`s`$ are the height, center and width of the Gaussian, respectively. The -parameters $`y_0`$ and $`b`$ are the usual parameter of a linear curve which is +and $`s`$ are the height, center, and width of the Gaussian, respectively. The +parameters $`y_0`$ and $`b`$ are the usual parameters of a linear curve which is assumed to describe our background. The model can be implemented in python as a function. The first parameter should be the $`x`$-value, all following arguments are free parameters of the model. The return value corresponds to the $`y`$-value, in @@ -457,7 +463,7 @@ def model(channel, m, s, A, y0, b): Please note that we are making an approximation with this definition. Strictly speaking, comparing the return values of our model to the measured -count is not correct. The variable channel corresponds to the radiation energy +count is not correct. The variable _channel_ corresponds to the energy measured with the setup. Lets assume channel $`c_i`$ corresponds to energy $`E_i`$. If we measure $`n_i`$ events in channel $`c_i`$, this means that we have measured $`n_i`$ in the energy interval $`[\frac{1}{2}(E_{i-1} + E_i), @@ -472,7 +478,7 @@ the bin width in this case. To fit this model to our experimental data, we can use the function `curve_fit` provided by the scipy package. The function `curve_fit` performs a least square fit and returns the optimal parameters and the covariance matrix. The fit might -not converge on its one. We can guide the optimization procedure by providing +not converge on its own. We can guide the optimization procedure by providing suitable start values of the free parameters. From the plot I read off a height $`A=50`$, a center $`m=60`$ and a width $`s = 10`$ for the Gaussian part and $`y_0 = 20`$ and $`b = 1`$ for the linear part. These values don't have to be @@ -482,15 +488,15 @@ stable fit result. More information on the fitting method can be found in the <!-- append decay.py --> ```python -# Define the intial values of the free parameters. +# Define the initial values of the free parameters. # Remember, that we defined our model as n(c; m, s, A, y0, b) p0 = (60, 10, 50, 20, 1) # Perform the actual fit. The parameters are # (1) Model to fit # (2) Array of x-values -# (3) Array of y-values to which the model shold be fitted -# (4) Array with inital values for the free parameters +# (3) Array of y-values to which the model should be fitted +# (4) Array with initial values for the free parameters # (5) Array with uncertainties on the y-values. popt, pcov = scipy.optimize.curve_fit(model, channel, count, p0, s_count) ``` @@ -511,7 +517,7 @@ plt.xlabel("Channel") plt.ylabel("Counts") # Add a legend to identify data and our fit. This method uses values passed to -# the the optional arguemnt 'label' of plot() and errorbar(). +# the optional argument 'label' of plot() and errorbar(). plt.legend() # Save the figure. @@ -528,11 +534,11 @@ $ python3 decay.py ``` --> The result should look like this. - + -Usually we want to measure some quantity with an experimental setup. For this +Usually, we want to measure some quantity with an experimental setup. For this, we need the optimized parameters and the covariance matrix returned by the -fit. Lets assume we are interested in the best fit value of the parameters and +fit. Let's assume we are interested in the best fit value of the parameters and their uncertainties. The uncertainties are the square roots of the diagonal of the covariance matrix. We can add the following print statements, to display this kind of information. @@ -548,19 +554,19 @@ print(" b = %g +- %g" % (popt[4], np.sqrt(pcov[4][4]))) print() # print blank line ``` -A $`\chi^2`$-test can also be performed, to assess the goodness of this fit. In +A $`\chi^2`$-test can be performed to assess the goodness of this fit. In a counting experiment like this one, we can rely on scipy's `chisquare`, which returns the $`\chi^2`$ and the $`p`$-value. The `chisquare` method assumes, that the uncertainties are the square root of the expected number of events. If this is not the case, we have to compute the $`\chi^2`$ manually. The following example shows both, the usage of `chisquare` and the manual computation. The print statements -for each method produce the same output. Please note, that. -we have five degrees of freedom, since we have five free parameters in our +for both methods produce the same output. Please note that +we have five degrees of freedom since we have five free parameters in our model. <!-- append decay.py --> ```python -# Degrees of freedom in our model. +# Degrees of freedom of our model. dof = 5 print("chi^2 from scipy:") @@ -605,12 +611,12 @@ Manual chi^2 test: p-value = 0.519418 ``` -Congratulations! You have mastered the first steps to analysis experimental data +Congratulations! You have mastered the first steps to analyze experimental data with python. # Further Reading This tutorial can not cover all topics which can be relevant for the advanced -laboratories. Here is a list with online resources, which might be useful. +laboratories. Here is a list of online resources, which might be useful. - [Python](https://docs.python.org/3/) documentation - [Numpy and Scipy](https://docs.scipy.org/doc/) documentation