From 09f0a59719990bba4e57d84b95809f1be199d8f9 Mon Sep 17 00:00:00 2001
From: Frank Sauerburger <frank@sauerburger.com>
Date: Mon, 29 Aug 2022 16:34:57 +0200
Subject: [PATCH] Add talk notebook

---
 .gitignore        |   3 +
 PyHEP_uhepp.ipynb | 503 ++++++++++++++++++++++++++++++++++++++++++++++
 requirements.txt  |   1 +
 3 files changed, 507 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 PyHEP_uhepp.ipynb
 create mode 100644 requirements.txt

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..5dbaea4
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,3 @@
+*.json
+*.pdf
+.ipynb_checkpoints/
diff --git a/PyHEP_uhepp.ipynb b/PyHEP_uhepp.ipynb
new file mode 100644
index 0000000..91245ac
--- /dev/null
+++ b/PyHEP_uhepp.ipynb
@@ -0,0 +1,503 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<h1 style=\"font-weight: 200; font-size: 5rem; border-bottom: 1px solid #000\">PyHEP: uhepp</h1>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Install dependency"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install -r requirements.txt &> /dev/null"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "and import packages:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import ROOT\n",
+    "import uhepp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Please make sure ROOT is available on your system."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inspecting ROOT file\n",
+    "\n",
+    "Inspecting the root file. It contains three TH1F histograms, one for\n",
+    " - measure data,\n",
+    " - expected background, and\n",
+    " - expected signal."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "root_file = ROOT.TFile.Open(\"toy_data.root\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "root_file.ls()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c = ROOT.TCanvas()\n",
+    "root_file.Get(\"data\").Draw()\n",
+    "c.Draw()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_yield = uhepp.from_th1(root_file.Get(\"data\"))\n",
+    "data_yield"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bkg_yield = uhepp.from_th1(root_file.Get(\"bkg\"))\n",
+    "sig_yield = uhepp.from_th1(root_file.Get(\"signal\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create your first plot\n",
+    "\n",
+    "Creating a plot with uhepp involved two tightly coupled but still separated tasks. First, we need to add the raw histogram contents. Afterward, we can set up the visual appearance of the plot.\n",
+    "\n",
+    "Create list of bin edges from histogram."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "th1f = root_file.Get(\"data\")\n",
+    "bin_edges = [th1f.GetXaxis().GetBinUpEdge(i) for i in range(th1f.GetNbinsX() + 1)]\n",
+    "len(bin_edges)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note there are 41 bin edges for 40 regular bins. The yield objects create above contains 42 bins, including the underflow and overflow bins."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist = uhepp.UHeppHist(\"$m$\", bin_edges)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The raw data is stored in the yields attribute. It is another dictionary mapping arbitrary internal names to the binned data. The binned data is stored as Yield objects. The yield objects couple the value with its uncertainties. A yield object comes close to a ROOT TH1 object (with some key distinctions). Yield objects can be added, scales, etc., while propagating uncertainties. Let's first create the yield objects from the sample data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, let's add the yields to our histogram."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist.yields = {\n",
+    "    \"sig\":  sig_yield,\n",
+    "    \"bkg\":  bkg_yield,\n",
+    "    \"data\": data_yield\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The names that we use here as dictionary keys, can be arbitrary strings. You are encouraged to use descriptive names, which makes editing the histogram much easier. We will use these names later to refer to the yields when we specify the main plot's content or the ratio plot. In this example, we've created a single background entry. In a real-world histogram, you would have several different physics processes with one entry per process. You are encouraged to use a fine-grained process list here. Merging two or more yield entries in the visual specification is easy."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Visual settings\n",
+    "\n",
+    "From the visual point of view, uhepp histograms are made up of a list of stacks. A stack consists of stack items. The bin contents of stack items within a stack are added (*stacked*). Separate stacks are down on top of each other. A stack in this context does not only refer to a bar histogram (`stepfilled`). It could be only the outline of bar histogram (`step`) or `points` usually used for measured data.\n",
+    "\n",
+    "Let's start by setting up a stack for the signal and background expectation (here `mc` for Monte Carlo). This stack consists of one stack item for the `signal` and one for the `background`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "background_si = uhepp.StackItem([\"bkg\"], \"Background\")\n",
+    "signal_si = uhepp.StackItem([\"sig\"], \"Signal\", color=\"red\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The creation of a `StackItem` takes at least two arguments. The first argument is a list of yield names we've defined in the previous section. Here you see the real power of separating the yield data from their stack definition. By passing multiple processes to the first argument, we can merge histograms with a single line of Python code. The second argument is used as the label in the legend. This is also the time to specify the color, line style, and other style settings. We choose to have the signal in red and leave the background color up to the system.\n",
+    "\n",
+    "The stack items need to be combined into a `Stack`. This tells the renderer to *stack* the bars of each stack item vertically. Finally, we can add the `mc_stack` to the stack list of our histogram object."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mc_stack = uhepp.Stack([background_si, signal_si])\n",
+    "hist.stacks.append(mc_stack)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We proceed similarly with data. Here we have a single item in the stack. The type of the stack should be `points`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_si = uhepp.StackItem([\"data\"], \"Data\")\n",
+    "\n",
+    "data_stack = uhepp.Stack([data_si], bartype=\"points\")\n",
+    "hist.stacks.append(data_stack)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To improve the output, we can also make the label of the x-axis more verbose. In the beginning, we've set the symbol already to `m`. Here we add a verbose label `Mass` and the unit `GeV`. On top, we can add some metadata. Some tools use the filename attribute to have a default filename, for example, when you render the plot to graphics file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist.variable = \"Mass\"\n",
+    "hist.unit = \"GeV\"\n",
+    "hist.filename = \"higgs_mass_dist\"\n",
+    "hist.author = \"Your name\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now it is time to see how far we've got."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist.render()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So far so good. However, we realize that a ratio plot between data and the `background` plus `signal` model would have been nice. Adding a ratio plot is a matter of adding a `RatioItem` to the `ratio` list property. The two mandatory arguments of `RatioItem` are two lists of yield names: First, the one of the numerator. Second, the one for the denominator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ratio_si = uhepp.RatioItem([\"data\"], [\"sig\", \"bkg\"], bartype=\"points\")\n",
+    "hist.ratio = [ratio_si]\n",
+    "\n",
+    "hist.render()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Can you update the above example, to have the comparison of data against the background-only model?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You might interject that the binning is a bit too fine, and the statistical power of each bin is relatively low. No problem. We can rebin the histogram on the fly during rendering. Simply set an alternative binning. The original histograms are not modified, so you can always undo this step without rerunning your analysis framework."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "coarser_bins = hist.bin_edges[::4]  # Keep only every fourth bin edge\n",
+    "hist.rebin_edges = coarser_bins\n",
+    "hist.render()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To save the result as a graphics file, call the `render` method with a filename."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist.render(\"mass_dist.pdf\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Local storage\n",
+    "\n",
+    "At this point, we might want to save the result not only in a graphics file but also as a uhepp plot. The whole point of uhepp is that you can save the intermediate result between raw data and graphics files such that you can make modifications to the visual appearance without reprocessing your data set.\n",
+    "\n",
+    "The uhepp format does not define the syntax of the file format, only its semantics. You can use any semi-structured format that supports lists and maps. Popular choices are JSON or YAML. We will use JSON in this tutorial."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist.to_json(\"mass_hist.json\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Uhep plots stored in JSON or YAML can be modifed with many popular tools. You could even change labels or style settings manually with a text editor. Loading previously saved histograms is also easy."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Could be in a new Python session\n",
+    "import uhepp\n",
+    "loaded_hist = uhepp.from_json(\"mass_hist.json\")\n",
+    "loaded_hist.render()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Think about your recent work. Was there a moment when you wanted to modify the labels, axis ranges, collaboration badges, color scheme, binning, histogram content etc., after the plots have been produced? Had you stored the intermediate data in uhepp format, these tasks are trivial."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Online storage"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In high-energy physics, a common task is to processes enormous data sets. Often this is only possible with a computing cluster. The resulting histograms and plots are stored on a remote file system. To see what the analysis code did, one has to copy the files to the local computer.\n",
+    "\n",
+    "Today, high-energy experiments can only be built and maintained by large collaborations. The actual analysis of the data takes place in a team. Showing histograms to colleagues is a daily recurring task. To me, it was both tedious and unsatisfactory to prepare a PDF plot books.\n",
+    "\n",
+    "Wouldn't it be nice to have a central hub for all plots? Computing nodes can push histograms in uhepp format to the hub. The analyzer can review plots in realtime; colleagues can browse through the plots online, extract yields, download them, or even upload modified (e.g., rebinned) versions themself. Students could also benefit from a central hub. On the hub, a student can collect plots and histograms throughout the whole time as a student. Naturally, things like color schemes, process compositions, etc., evolve during the time as a student, however, in the thesis, it would be nice to present the material uniformly. If all the material is stored on the hub in uhepp format, it's easy to rebrand, recolor the histogram.\n",
+    "\n",
+    "Sounds cool? Then sign up.\n",
+    "\n",
+    " - Go to [https://uhepp.org/login](https://uhepp.org/login) and login using your CERN account and CERN's Single Sign-On.\n",
+    " - To use the API, you need an API token that is linked to your account. Go to [https://uhepp.org/tokens/new](https://uhepp.org/tokens/new) and create a token. Follow the instructions given on the token creation page. You need to add the token and the API endpoint to your environment variables.\n",
+    " - Plots on the uhepp hub are organized in collections. We will push the toy plot from the previous section to a new collection. Navigate to [https://uhepp.org/c/new](https://uhepp.org/c/new) and create a new collection. On the collection page, you see the ID of the collection. We need to specify this ID when we want to push or pull plots. For this tutorial, we assume the ID is `1`.\n",
+    " \n",
+    "Now, you are ready to push your first plot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "receipt = hist.push(1)\n",
+    "receipt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The return value of `push()` is a push receipt. By default, it renders to the URL where you can view the plot online. Every plot has a unique identifier (uuid). You can see the uuid in the URL. The push receipt gives you programmatic access to the generated uuid."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "receipt.uuid"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you have the uuid of a plot, you can also download it. The uhepp module provides the `pull()` method for this purpose."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This could be a new shell again or a different compute\n",
+    "import uhepp\n",
+    "uuid = 'a378d2b0-cde2-4266-be9b-85945d94880d'\n",
+    "hist = uhepp.pull(uuid)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist.render()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At this point, we might decide that the binning should be a bit finer again, and signal should be `C1` orange instead of red. So let's change this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hist.rebin_edges = hist.bin_edges[::2]  # Drop only every second bin edges\n",
+    "hist.stacks[0].content[1].color = 'C1'  # 0-th stack, 1-st stack-item\n",
+    "hist.render()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..f5ce36f
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1 @@
+uhepp
-- 
GitLab