# EDM4hep - The common event data model EDM4hep is the common and shared Event Data Model (EDM) of the Key4hep project. Here we will give a brief introduction to EDM4hep as well as some of the technicalities behind it. We will also guide you towards documentation and try to give you the knowledge to make sense of it. ## Important resources - EDM4hep doxygen API reference page: [edm4hep.web.cern.ch](https://edm4hep.web.cern.ch) - EDM4hep github repository: [github.com/key4hep/EDM4hep](https://github.com/key4hep/EDM4hep) - podio documentation page (including API reference): [key4hep.web.cern.ch/podio](https://key4hep.web.cern.ch/podio) - podio github repository: [github.com/AIDASoft/podio](https://github.com/AIDASoft/podio) ## Doxygen API documentation We start with having a look at the [EDM4hep doxygen API reference page](https://edm4hep.web.cern.ch): ### The overview diagram  You see a diagrammatic overview of EDM4hep with all the available data types, broadly organized into different categories. The arrows depict two ways data types can be related / linked with each other - ["Relations"](#relations) (black arrows) - ["Links"](#links) (purple-ish arrows) #### Relations These are relations defined within the data types, and which are directly accessible from the data types. They come in two flavors, depending on the multiplicity of the relation - `OneToOneRelations` - `OneToManyRelations` Data types can relate to other instances of the same type (e.g. `MCParticle`s usually form a hierarchy of mothers/daughters). Relations are directed, i.e. it is possible to go from one object to a related object, but vice versa this does usually not hold. For example, a `ReconstructedParticle` can point to multiple `Tracks` or `Clusters`, but those do not point to a `ReconstructedParticle`. #### Links These are relations that are in a sense "external" to the data model definition. They are currently mainly used to connect MC and RECO information, as a direct link via a relation is not desirable as it would mix the two worlds. In contrast to relations, links are not directed, i.e. it is possible to access both involved objects from the link. ### The table of available types Just below the diagram is an overview table of all the types that are defined in EDM4hep. Here they are organized into - `Components` - Very simple types, that are used throughout the `Datatypes` - `Datatypes` - The data types that are defined in EDM4hep - `Links` - The available links between different data types - `Generator related (meta-)data` - Data types related to generator metadata - `Interfaces` - Abstractions for accessing different types by a shared set of properties  Clicking on any of these links will take you to the [`edm4hep.yaml`](https://github.com/key4hep/EDM4hep/blob/master/edm4hep.yaml) definition file of EDM4hep, jumping directly to the definition of the respective datatype or component. For more information on this file check out the section about [podio](#podio---the-technical-infrastructure-on-which-things-run). In principle it is possible to have *very educated guesses* on how the interface of the classes will look like from this. ### Navigating the doxygen reference page To see all the available classes simply click on [`Classes -> Class Index`](https://edm4hep.web.cern.ch/classes.html) or on [`Classes -> Class List`](https://edm4hep.web.cern.ch/annotated.html). Doing the latter and expanding the `edm4hep` namespace gives you something like this  Clicking on any of the links in this list will take you to the reference page for that class, e.g. for the [`ReconstructedParticle`](https://edm4hep.web.cern.ch/classedm4hep_1_1_reconstructed_particle.html)  #### Why are there so many classes and do I need all of them? If you look at the list you will realize there are many classes that are all named very similar, e.g. - **`CaloHitContribution`** - **`CaloHitContributionCollection`** - `CaloHitContributionCollectionData` - `CaloHitContributionCollectionIterator` - `CaloHitContributionData` - `CaloHitContributionMutableCollectionIterator` - `CaloHitContributionObj` - `CaloHitContributionSIOBlock` - **`MutableCaloHitContribution`** From all of these classes **only the ones marked bold are truly "visible" and intended for use**. The others are internal classes or simple helper types that you will most likely only ever see in compiler errors, especially if you follow the *Almost Always Auto* style of writing c++ code (see [Herb Sutter's original blog post](https://herbsutter.com/2013/08/12/gotw-94-solution-aaa-style-almost-always-auto/) or [a slightly easier to digest summary](http://cginternals.github.io/guidelines/articles/almost-always-auto/)). To understand why these classes exist and what their purpose is, we have to make a slight detour to [podio](#podio---the-technical-infrastructure-on-which-things-run). ### Some utility functionality EDM4hep also brings a bit of utility functionality. You can find it in the [`edm4hep::utils` namespace](https://edm4hep.web.cern.ch/namespaceedm4hep_1_1utils.html) (click on `Namespaces -> Namespace List`, then expand the `edm4hep` namespace and then click on `utils` to arrive at this link). # podio - The technical infrastructure on which things run podio is an EDM toolkit that is used by and developed further in the Key4hep context. The main purpose is to have an efficiently implemented, thread safe EDM starting from a high level description. For more (gory) details have a look at the [github repository](https://github.com/AIDASoft/podio). Here we will describe the code generation, and its implications for EDM4hep. A [bit further down](#the-podioframe-container) we will describe how to read and write podio (root) files and the [`podio-dump`](#podio-dump) tool to inspect files without having to open them. ## podio code generation The podio code generator is a python script that reads in the EDM definition in **yaml** format, does a few basic validation checks on the definition, and then generates all the necessary code via the Jinja2 template engine. ![]()<img src="https://raw.githubusercontent.com/key4hep/key4hep-tutorials/4b0cb1387169538c3580ab953c7bb179e42a8470/edm4hep_analysis/images/podio_generate.svg" width="320"> The generated code should (among other things) - be efficient, - offer an easy to use interface, - offer performant I/O. Having automated code generation has a few advantages: - Freeing the user from the repetitive task of implementing all the types themselves - Freeing the user from having to deal with all the details of how to do things efficiently - Making it very easy to roll out improved implementations (or bug fixes) via simply regenerating the code ### The three layers of podio To achieve the goals stated above podio favors composition over inheritance and uses **plain-old-data (POD)** types wherever possible. To achieve this podio employs a layered design, which makes it possible to have an efficient memory layout and performant I/O implementation, while still offering an easy to use interface ![]()<img src="https://raw.githubusercontent.com/key4hep/key4hep-tutorials/4b0cb1387169538c3580ab953c7bb179e42a8470/edm4hep_analysis/images/podio_layers.png" width="320"> - The *User Layer* is the top most layer and it **offers the full functionality** and is the **only layer with which users interact directly**. It consists mainly of the collections and lightweight handle classes, i.e. - `XYZCollection` - `XYZ` - `MutableXYZ` - The *Object Layer* consists of the `XYZObj` classes, that take care of all resource management and which also enable the relations between different objects. - The *POD Layer* at the very bottom is where all the actual data lives in simple `XYZData` POD structs. These are the things that are actually stored in, e.g. root files that are written by podio. ### Basics of generated code - value semantics The generated c++ code offers so called *value semantics*. The exact details of what this actually means are not very important, the main point **is that you can treat all objects as values and you don't have to worry about inefficient copies or managing resources:** ```cpp auto recos = edm4hep::ReconstructedParticleCollection(); // ... fill, e.g. via auto p = recos.create(); // or via auto p2 = edm4hep::ReconstructedParticle(); recos.push_back(p2); // Loop over a collection for (auto reco : recos) { auto vtx = reco.getStartVertex(); // do something with the vertex // loop over related tracks for (auto track : reco.getTracks()) { // do something with this track } } ``` This looks very similar to the equivalent python code (if you squint a bit, and ignore the `auto`s, `;` and `{}` ;) ) ```python recos = edm4hep.ReconstructedParticleCollection() # ... fill, e.g. via p = recos.create() # or via p2 = edm4hep.ReconstructedParticle() recos.push_back(p2) # Loop over a collection for reco in recos: vtx = reco.getStartVertex() # do something with the vertex # loop over related tracks for track in reco.getTracks(): # do something with the tracks ``` The python interface is functionally equivalent to the one c++ interface, since that is implemented via PyROOT. There are some additions that make the python interface more *pythonic*, e.g. `len(recos)` is equivalent to `recos.size()`. Nevertheless, the doxygen reference is valid for both interfaces. ### Guessing the interface from the yaml definition Since all code is generated, it is usually pretty straight forward to guess how the interface will look like just from looking at the definition in the yaml file. For EDM4hep the general rule is to get a `Member` variable, a `OneToOneRelation`, a `OneToManyRelation` or a `VectorMember` is to **simply stick a `get` in front of the name in the yaml file and to capitalize the first letter.**, e.g. ```yaml Members: - edm4hep::Vector3f momentum // the momentum in [GeV] ``` will turn into something like ```cpp const edm4hep::Vector3f& getMomentum() const; ``` Similar, but in slightly more nuanced rules apply for the methods that are generated for setting a value. For `Member` variables and `OneToOneRelation`s the general rule is to **stick a `set` in front of the name in the yaml file and to capitalize the first letter**, e.g. (continuing from above) ```cpp void setMomentum(edm4hep::Vector3f value); ``` For the `OneToManyRelation`s and `VectorMember`s the rule is to **stick a `addTo` in front of the name in the yaml file and to capitalize the first letter**, e.g. ```yaml OneToManyRelation: - MCParticle daughters // the daughters of this particle ``` will be generated to ```cpp void addToDaughters(MCParticle daughter); ``` ### Why is there a `XYZ` and a `MutableXYX`? The underlying technical reasons are rather complex, dive quite deepish into c++ nuances, and definitely far beyond the scope of this tutorial. In short: We need two different handle classes in order to control whether users are allowed to modify things or not. As one of the main goals of podio generated EDMs is to be thread safe the default generated class for each data type allows only for immutable read access, i.e. it provides only the `get` methods. Only the `Mutable` classes actually have the `set` methods, and can hence be used to actually modify objects. The most important implication of this is the following: **Everything that you read from file, or that you get from the Gaudi TES is immutable.** I.e. there is no way for you to change or update the values that you read. The only way to "update" values (or collections) is to actually copy the contents and then store the updated values back. Independent copies of objects can be obtained with the clone` method. ### Writing function interfaces The `Mutable` objects implicitly convert to an instance of a default class. Hence, **always use the default classes when specifying function interfaces** (obviously this only works if you only need read access in the function). **There is no implicit conversion from the default, immutable objects to the `Mutable` objects!** As an example ```cpp void printE(edm4hep::MCParticle particle) { std::cout << particle.getEnergy() << '\n'; } void printEMutable(edm4hep::MutableMCParticle particle) { std::cout << particle.getEnergy() << '\n'; } int main() { auto mutP = edm4hep::MutableMCParticle(); p.setEnergy(3.14); printE(mutP); // Works due to implicit conversion printEMutable(mutP); // Obviously also works // Now we create an immutable object auto P = edm4hep::MCParticle(); printE(P); // Obviously works printEMutable(P); // BREAKS: No conversion from default to Mutable return 0; } ``` ### Subset collections Similar to LCIO, podio generated EDMs offer a *subset collection functionality*. This allows to create collections of objects, that are actually part of another collection, e.g. to simply collect all the muons that are present in a larger collection of reconstructed particles: ![]()<img src="https://raw.githubusercontent.com/key4hep/key4hep-tutorials/4b0cb1387169538c3580ab953c7bb179e42a8470/edm4hep_analysis/images/podio_subset_collections.svg" width="200"> To create a subset collection, simply do ```cpp auto muons = edm4hep::ReconstructedParticleCollection(); muons.setSubsetCollection(); // You can now add objects that are part // of another collection to this one via push_back muons.push_back(recos[0]); ``` Reading a subset collection works exactly the same as reading a normal collection. This is handled in a transparent way, such that you usually don't even realize that you are operating on a subset collection. ## The `podio::Frame` container The `podio::Frame` is a *generalized event*. It is a container that aggregates all relevant data (and some meta data). It also defines an implicit *interval of validity* (but that is less relevant for this tutorial). It provides a thread safe interface for data access - Immutable read access only for collections that are stored inside the a `Frame` - All data that is inside a `Frame` is owned by it, and this is also reflected in its interface. ![]()<img src="https://raw.githubusercontent.com/key4hep/key4hep-tutorials/4b0cb1387169538c3580ab953c7bb179e42a8470/edm4hep_analysis/images/frame_concept.svg" width="300"> Here we will just briefly introduce the main functionality, for more details see the [documentation in podio](https://github.com/AIDASoft/podio/blob/master/doc/frame.md). ### Getting collections from a `Frame` Assuming that `event` is a `podio::Frame` in the following code examples, getting a collection can be done via (c++) ```cpp auto& mcParticles = event.get<edm4hep::MCParticleCollection>("MCParticles"); ``` or (python) ```python mcParticles = event.get("MCParticles") ``` This retrieves the collection that is stored under the name `MCParticles` with type `edm4hep::MCParticleCollection`. If no such collection exists, it will simply return an empty collection of the desired type. As you can see, the type is automatically inferred in python. **Note that `get` returns a const&, so it is required to actually put the `&` behind `auto` in c++**, otherwise there will be a compilation error complaining about a copy-constructor being marked `delete`. ### Putting a collection into a `Frame` When putting a collection into a `Frame` you give up ownership of this collection. To signal this to the users, it is necessary to *move* the collection into a `Frame`. Again assuming `event` is a `podio::Frame` in the following examples, this looks like this ```cpp auto recos = edm4hep::ReconstructedParticleCollection(); event.put(std::move(recos), "ReconstructedParticles"); ``` Note the requirement to explicitly use `std::move` in this case. At this point `recos` is *moved* into the `event`, and you are left with an object [*in a valid but unspecified state*](https://stackoverflow.com/a/12095473) that you should under normal circumstances no longer use after this point. (Technically we do enough that you still can use this, but don't expect the results to match your expectations). ## Reading EDM4hep files EDM4hep files are read with tools provided by podio. As podio supports multiple different backends there are several, *low level* readers that support all the necessary functionality. You can obviously use these readers directly, but we recommend to use the `Reader` class and the `makeReader` function that will dispatch to the correct low level reader automatically. ```cpp #include <podio/Reader.h> #include <edm4hep/MCParticleCollection.h> int main() { auto reader = podio::makeReader("some_file_containing_edm4hep.data.root"); // Loop over all events for (size_t i = 0; i < reader.getEvents(); ++i) { auto event = reader.readNextEvent(); auto& mcParticles = event.get<edm4hep::MCParticleCollection>("MCParticles"); // do more stuff with this event } return 0; } ``` ### The available low level readers - `ROOTReader` - The default reader for TTree based files - `ROOTLegacyReader` - The reader for an old podio format based on TTrees - `RNTupleReader` - A reader for RNTuple based files - `SIOReader` - The reader for reading files using the SIO backend - `SIOLegacyReader` - The reader for the SIO backend with an old podio format The `Legacy` readers are stated here mainly for completeness, in case you need to read a rather old file that still used the `EventStore` which has been removed from podio some time ago. See [here](#how-do-i-figure-out-if-a-file-is-legacy) for more information on how to figure out whether the file you are interested in is a legacy file or not. As podio is a rather low level tool, also the interface of these readers feel somewhat low level. This is mostly visible in the fact, that you have to provide a `category` (name) when getting the number of entries, or when reading the next entry. This is because in principle podio can handle multiple different categories of Frames in one file. **For the purpose of this tutorial and also for the majority of use cases, simply use `"events"` as category name.** Readers in podio do not return a `podio::Frame` directly, rather they just return some *frame data* from which a `podio::Frame` can be constructed. Putting all of these things together, a simple event loop looks like this in c++: ```cpp #include "podio/ROOTReader.h" #include "podio/Frame.h" #include "edm4hep/MCParticleCollection.h" int main() { auto reader = podio::ROOTReader(); reader.openFile("some_file_containing_edm4hep_data.root"); // Loop over all events for (size_t i = 0; i < reader.getEntries("events"); ++i) { auto event = podio::Frame(reader.readNextEntry("events")); auto& mcParticles = event.get<edm4hep::MCParticleCollection>("MCParticles"); // do more stuff with this event } return 0; } ``` The equivalent python code looks like this ```python from podio import root_io reader = root_io.Reader("some_file_containing_edm4hep_data.root") # if you want to read legacy files use root_io.LegacyReader for event in reader.get("events"): mcParticles = event.get("MCParticles") # do more stuff with this event ``` ## ROOT file layout of podio generated EDMs podio generated EDMs, i.e. also EDM4hep, use ROOT as their default I/O backend. Since everything is based on PODs, the produced root files are pretty straight forward to read and interpret (with some caveats). They are already almost flat ntuples.   ### How do I figure out if a file is legacy? 1. Use [`podio-dump`](#podio-dump) and it will tell you ```console $podio-dump /home/workarea/data/rv02-02.sv02-02.mILD_l5_o1_v02.E250-SetA.I402003.Pe2e2h.eL.pR.n000.d_dstm_15089_0_edm4hep.root input file: /home/workarea/data/rv02-02.sv02-02.mILD_l5_o1_v02.E250-SetA.I402003.Pe2e2h.eL.pR.n000.d_dstm_15089_0_edm4hep.root Frame categories in this file (this is a legacy file!): [...] ``` 2. Peek inside the root file and look at the contents ![]()<img src="https://raw.githubusercontent.com/key4hep/key4hep-tutorials/4b0cb1387169538c3580ab953c7bb179e42a8470/edm4hep_analysis/images/initial_browser_edm4hep.png" width="200"> ![]()<img src="https://raw.githubusercontent.com/key4hep/key4hep-tutorials/4b0cb1387169538c3580ab953c7bb179e42a8470/edm4hep_analysis/images/initial_browser_legacy_edm4hep.png" width="200"> ## `podio-dump` The `podio-dump` utility allows to inspect EDM4hep files from the command line. The synopsis looks like this ``` console $podio-dump --help usage: podio-dump [-h] [-c CATEGORY] [-e ENTRIES] [-d] [--dump-edm DUMP_EDM] [--version] inputfile Dump contents of a podio file to stdout positional arguments: inputfile Name of the file to dump content from options: -h, --help show this help message and exit -c CATEGORY, --category CATEGORY Which Frame category to dump -e ENTRIES, --entries ENTRIES Which entries to print. A single number, comma separated list of numbers or "first:last" for an inclusive range of entries. Defaults to the first entry. -d, --detailed Dump the full contents not just the collection info --dump-edm DUMP_EDM Dump the specified EDM definition from the file in yaml format --version show program's version number and exit ``` By default it prints how many events are present in the file and also a summary of the contents of the first event. This overview consists of the names, data types and number of elements of the collections that are stored in this event. Using the `--detailed` flag, `podio-dump` will print the complete contents of all collections in ASCII format. This can be quite a bit of information. Using the `--entries` flag it is possible to choose which events to look at. The `--categories` flag is an advanced feature and not necessary for this tutorial. `podio-dump` will also tell you whether the file that is passed to it is a *legacy file* in which case you will need the `ROOTLegacyReader` or the `SIOLegacyReader` to read it.