Using Pandas methods

The Pandas methods is the simplest option when it comes to handling your dataset (compared to SQL queries)) but may be less performant. Pandas can also be used to easily plot graphs.

If you want to see the structure of the datasource, you can use the .columns method:

df = my_project.edges_datasource.load_dataframe(load_limit=1000)
print(df.columns)

Operations can be done on dataframe. For example, we can calculate the mean value of the duration column:

duration_mean = df['duration'].to_numpy().mean()

We can also get the maximum or minimum of the enddate column:

enddate_max = df['enddate'].max()
enddate_min = df['enddate'].min()

Moreover, in the next example, we group the dataframe by case ID:

by_caseid = df.groupby('caseid')

Furthermore, the Pandas .describe() method can be applied to our dataframe:

stats_summary = df.describe()

This method returns a statistics summary of the dataframe provided. It does the following operations for each column:

count the number of not-empty values
calculate the mean (average) value
calculate the standard deviation
get the minimum value
calculate the 25% percentile
calculate the 50% percentile
calculate the 75% percentile
get the maximum value

It then stores the result of all previous operations in a new dataframe (here, stats_summary).

If need be, you can directly use the datasource's connection and cursor methods, which can be used as specified in the Python Database API:

ds = my_project.edges_datasource
ds.connection
ds.cursor