This folder contains the data set as was used for the Process
Discovery Contest of 2020 (PDC 2020). The data set contains 192
training logs, 192 corresponding test logs, 192 corresponding
ground truth logs, and 96 models. The logs are all stored using the
IEEE XES file format (see either https://www.xes-standard.org/ or
https://ieeexplore.ieee.org/document/7740858), while the models are
workflow nets (a subclass of Petri nets) stored in the PNML file
format (see
https://www.iso.org/obp/ui/#iso:std:iso-iec:15909:-2:ed-1:v1:en).

The data set is generated from a single base model that allows for
the following characteristics A-G to be configured:

A: Dependent tasks, also known as long-term dependencies. Possible
values are 0 for No and 1 for Yes. If Yes then all transitions that
bypass the dependent tasks are disabled.

B: Loops. Possible values are 0 for No, 1 for Simple, and 2
forComplex. If No, then all transitions that start a loop are
disabled. If Simple, then all transitions that are a shortcut
between the loop and the main flow are disabled.

C: OR constructs. Possible values are 0 for No and 1 for Yes. If
No, then all transitions that only take some inputs for an OR-join
and all transitions that generate only some outputs for an OR-split
are disabled.

D: Routing constructs, also known as invisible tasks. Possible
values are 0 for No and 1 for Yes. If Yes, then some transitions
are made invisible.

E: Optional tasks.Possible values are 0 for No and 1 for Yes. If
Yes, then some invisible transitions are added to allow skipping of
some (visible) transitions.

F: Duplicate tasks, also known as recurrent activities. Possible
values are 0 for No and 1 for Yes. If Yes, then some transitions
are relabeled to existing labels.

G: Noise.Possible values are 0 for No and 1 for Yes. If Yes, then
noise is introduced in approx. 1 out of 5 traces. Noise is
introduced by deleting one event (40%), moving one event in the
trace (20%), or copying one event in the trace (40%).

The models and logs were generated in the following way from the
base model. For all 96 possible values for A-F, the corresponding
model pdc_2020_ABCDEF0.pnml is generated from the base model. From
every model pdc_2020_ABCDEF0.pnml, four logs are generated: (1) a
training log pdc_2020_ABCDEF0.xes without noise, (2) a training log
pdc_2020_ABCDEF1.xes where about 20% of the traces contain noise,
(3) a test log pdc_2020_ABCDEF0.xes where about 50% of the traces
contain noise, and (4) a test log pdc_2020_ABCDEF1.xes where about
50% of the traces contains noise. Every ground truth log
pdc_2020_ABCDEFG.xes is then generated by classifying the test log
pdc_2020_ABCDEFG.xes using the model pdc_2020_ABCDEF0.pnml. In each
ground truth log, the additional boolean “pdc:isPos” attribute
denotes whether the trace is positive (fits the model, true) or
negative (does not fit the model, false).
