Using SQL Queries

In the last section, it was shown how to load the entire dataframe and do operations with Pandas. With SQL you can load the dataframe and do the operations in the same query, hence why it is faster. Those requests are often more powerful and performant than the Pandas methods presented above. However, the Pandas methods are simpler to use and understand.

In those requests, the name of the table is accessible through the name attribute and must be given in between double quotes. The use of f-strings is highly recommended.

A description of the tables is available here. It contains a glossary and descriptions of the different elements in a table.

Therefore, the following request returns all the columns of the given datasource for the specified process key.

ds = my_project.edges_datasource
edges_filtered = ds.request(f'SELECT * FROM \"{ds.name}\" WHERE "processkey" = \'<Your Process Key>\'')

We can also calculate the mean value of the duration column. For example:

edge_duration_mean = ds.request(f'SELECT AVG(duration) FROM \"{ds.name}\"')

In a similar manner, the subsequent requests return respectively the minimum and the maximum value of the enddate column:

edge_enddate_min = ds.request(f'SELECT MIN(enddate) FROM \"{ds.name}\"')
edge_enddate_max = ds.request(f'SELECT MAX(enddate) FROM \"{ds.name}\"')

Finally, the following SQL statement lists the number of cases by detail destination, sorted high to low:

count = ds.request(f'SELECT COUNT(caseid), detail_7_lpkr_destination FROM \"{ds.name}\" GROUP BY detail_7_lpkr_destination ORDER BY COUNT(caseid) DESC')