Using Pandas methods

The Pandas methods is the simplest option when it comes to handling your dataset (compared to SQL queries)) but may be less performant. Pandas can also be used to easily plot graphs.

If you want to see the structure of the datasource, you can use the .columns method:

df = my_project.edges_datasource.load_dataframe(load_limit=1000)
print(df.columns)

Operations can be done on dataframe. For example, we can calculate the mean value of the duration column:

duration_mean = df['duration'].to_numpy().mean()

We can also get the maximum or minimum of the enddate column:

enddate_max = df['enddate'].max()
enddate_min = df['enddate'].min()

Moreover, in the next example, we group the dataframe by case ID:

by_caseid = df.groupby('caseid')

Furthermore, the Pandas .describe() method can be applied to our dataframe:

stats_summary = df.describe()

This method returns a statistics summary of the dataframe provided. It does the following operations for each column:

  • count the number of not-empty values
  • calculate the mean (average) value
  • calculate the standard deviation
  • get the minimum value
  • calculate the 25% percentile
  • calculate the 50% percentile
  • calculate the 75% percentile
  • get the maximum value

It then stores the result of all previous operations in a new dataframe (here, stats_summary).

If need be, you can directly use the datasource's connection and cursor methods, which can be used as specified in the Python Database API:

ds = my_project.edges_datasource
ds.connection
ds.cursor