Grouped tasks
Grouped Tasks
While defining the Column Mapping, we can also optionally define a set of columns to use for grouping events that have the same value (in the defined columns) together. The events are grouped within their case, and grouped tasks aggregations need to be defined for the dimensions and metrics.
The idea of this functionality is to regroup multiple similar events into only one event.
Grouped Tasks Example
Let's take the following events as an example:
CaseId | Activity | StartDate | EndDate | Country | City | Price |
---|---|---|---|---|---|---|
1 | A | 10/10/10 08:38 | 11/10/10 08:38 | France | Paris | 10 |
1 | B | 10/10/10 09:40 | 11/10/10 09:40 | Germany | Berlin | 20 |
1 | A | 10/10/10 10:42 | 11/10/10 10:42 | France | Toulouse | 30 |
1 | C | 10/10/10 11:50 | 11/10/10 11:50 | Germany | Munich | 10 |
1 | C | 10/10/10 12:50 | 11/10/10 12:50 | Germany | Hamburg | 20 |
2 | A | 10/10/10 08:20 | 11/10/10 08:20 | France | Rennes | 5 |
2 | B | 10/10/10 09:30 | 11/10/10 09:30 | Germany | Berlin | 10 |
2 | A | 10/10/10 10:40 | 11/10/10 10:40 | France | Bordeaux | 25 |
2 | A | 10/10/10 11:50 | 11/10/10 11:50 | USA | New York | 10 |
When you add this file to an empty project, during the selection of the activity/task column, a little box "Columns to use as unique tasks identifiers" appears as shown in this image:
Here the columns selected for grouping are CaseId and Country. It means that during the processing of the file by the Mining, all the events that share the same values for their CaseId, Activity and Country columns will be merged as only one event.
You can also uncheck the checkbox if you do not want to regroup similar events into one.
When a group is defined, you then have to define a grouped tasks aggregation for all the dimensions and metrics that you add to the column mapping. Indeed, as multiple events that may have different values for the dimensions and metrics will be merged into one event, you need to define an aggregation to declare which value you want to keep for this new event.
For the dimensions:
As shown in the image bellow, I selected "First value not null" in the "Aggregation of grouped tasks" box for the City dimension column.
For the grouped tasks aggregation of a dimension column, you have the choice between:
- First value not null
- Last value not null
For the metrics:
As shown in the image bellow, I selected "Average value" in the "Aggregation of grouped tasks" box for the Price metric column.
For the grouped tasks aggregation of a metric column, you have the choice between:
- First value not null
- Last value not null
- Minimum value
- Maximum value
- Average value
- Median value
- Sum of values
Consequently, within a case, all the events that have the same values for the CaseId, Activity and Country columns will be grouped together, and the new values for the dimension and metric columns are computed according to their related Aggregation of grouped tasks
If the timestamp columns are not defined in the columns to use for grouping (here columns StartDate and EndDate were not defined in the Columns to use as unique tasks identifiers box), we don't have to define an aggregation as for the dimensions or metrics: * The lowest timestamp of all the events of a group will be used as the new start timestamp of the new single event. * The highest timestamp of all the events of a group will be used as the new end timestamp of the new single event.
After the validation of the final column mapping, during the ingestion of the file, some events will be grouped together in the following way:
For CaseId 1: * The first and third events of this case have the same values for their Activity (A) and Country (France) columns. Consequently, they are grouped together to only make one event of activity A and of country France. * The second event of this case is not grouped, as no other event in this case has an Activity named B and a Country named Germany. * The fourth and fifth events of this case have the same values for their Activity (C) and Country (Germany) columns. Consequently, they are grouped together to only make one event of activity C and of country Germany.
For CaseId 2: * The first and third events of this case have the same values for their Activity (A) and Country (France) columns. Consequently, they are grouped together to only make one event of activity A and of country France. * The second event of this case is not grouped, as no other event in this case has an Activity named B and a Country named Germany. * The fourth event of this case is not grouped, it has the same Activity (A) as the first and third events of this case, but its Country (USA) is different.
After grouping the similar events together, it gives us this list of events:
CaseId | Activity | StartDate | EndDate | Country | City | Price |
---|---|---|---|---|---|---|
1 | A | 10/10/10 08:38 | 11/10/10 10:42 | France | Paris | 20 |
1 | B | 10/10/10 09:40 | 11/10/10 09:40 | Germany | Berlin | 20 |
1 | C | 10/10/10 11:50 | 11/10/10 12:50 | Germany | Munich | 15 |
2 | A | 10/10/10 08:20 | 11/10/10 10:40 | France | Rennes | 15 |
2 | B | 10/10/10 09:30 | 11/10/10 09:30 | Germany | Berlin | 10 |
2 | A | 10/10/10 11:50 | 11/10/10 11:50 | USA | New York | 10 |
For CaseId 1: * The first event of this case in the new list of events was created by grouping the first and third events of this case in the initial list of events (before grouping). * CaseId was 1 for the two events that were grouped, so it stays at 1 for the new single event. * Activity was A for the two events that were grouped, so it stays at A for the new single event. * StartDate was 10/10/10 08:38 for the first event that was grouped, and 10/10/10 10:42 for the second one. The lowest timestamp (10/10/10 08:38) is used as the start timestamp of the new single event. * EndDate was 11/10/10 08:38 for the first event that was grouped, and 11/10/10 10:42 for the second one. The highest timestamp (11/10/10 10:42) is used as the end timestamp of the new single event. * Country was France for the two events that were grouped, so it stays at France for the new single event. * City was Paris for the first event that was grouped, and Toulouse for the second one. In the column mapping, First value not null was defined as the aggregation of grouped tasks for this dimension, consequently, as Paris is the first value to come, it is the one used for the new single event. * Price was 10 for the first event that was grouped, and 30 for the second one. In the column mapping, Average value was defined as the aggregation of grouped tasks for this metric, consequently, 20 is the value of this metric for the new single event (20 being the result of the average of 10 and 30). * The second event of this case in the new list of events is identical to the second event of this case in the initial list of events (before grouping), as we couldn't group it with other events. * The third event of this case in the new list of events was created by grouping the fourth and fifth events of this case in the initial list of events (before grouping). * CaseId was 1 for the two events that were grouped, so it stays at 1 for the new single event. * Activity was C for the two events that were grouped, so it stays at C for the new single event. * StartDate was 10/10/10 11:50 for the first event that was grouped, and 10/10/10 12:50 for the second one. The lowest timestamp (10/10/10 11:50) is used as the start timestamp of the new single event. * EndDate was 11/10/10 11:50 for the first event that was grouped, and 11/10/10 12:50 for the second one. The highest timestamp (11/10/10 12:50) is used as the end timestamp of the new single event. * Country was Germany for the two events that were grouped, so it stays at Germany for the new single event. * City was Munich for the first event that was grouped, and Hamburg for the second one. In the column mapping, First value not null was defined as the aggregation of grouped tasks for this dimension, consequently, as Munich is the first value to come, it is the one used for the new single event. * Price was 10 for the first event that was grouped, and 20 for the second one. In the column mapping, Average value was defined as the aggregation of grouped tasks for this metric, consequently, 15 is the value of this metric for the new single event (15 being the result of the average of 10 and 20).
For CaseId 2: * The first event of this case in the new list of events was created by grouping the first and third events of this case in the initial list of events (before grouping). * CaseId was 2 for the two events that were grouped, so it stays at 2 for the new single event. * Activity was A for the two events that were grouped, so it stays at A for the new single event. * StartDate was 10/10/10 08:20 for the first event that was grouped, and 10/10/10 10:40 for the second one. The lowest timestamp (10/10/10 08:20) is used as the start timestamp of the new single event. * EndDate was 11/10/10 08:20 for the first event that was grouped, and 11/10/10 10:40 for the second one. The highest timestamp (11/10/10 10:40) is used as the end timestamp of the new single event. * Country was France for the two events that were grouped, so it stays at France for the new single event. * City was Rennes for the first event that was grouped, and Bordeaux for the second one. In the column mapping, First value not null was defined as the aggregation of grouped tasks for this dimension, consequently, as Rennes is the first value to come, it is the one used for the new single event. * Price was 5 for the first event that was grouped, and 25 for the second one. In the column mapping, Average value was defined as the aggregation of grouped tasks for this metric, consequently, 15 is the value of this metric for the new single event (15 being the result of the average of 5 and 25). * The second event of this case in the new list of events is identical to the second event of this case in the initial list of events (before grouping), as we couldn't group it with other events. * The third event of this case in the new list of events is identical to the fourth event of this case in the initial list of events (before grouping), as we couldn't group it with other events.
This new list of events is then used as the data in the Mining project.