Pandas dataframe directly as DB media column type

Because I want to work with Pandas dataframes in memory (& using data science libraries and not as DB rows) and because they take a bit of tidying (especially dates) it would be nice to avoid encoding problems etc and store/read dataframes DIRECTLY SOMEHOW as DB media column types (without going via csv).

Possible?
How?

3 Likes

You can get to a media object from a DataFrame by serializing the DataFrame.

You can get to a DataFrame from a media object by reading the media object’s bytes.

Here is an example showing some back-and-forth conversions for both json and csv content types.

com-gif-maker%20(1)

clone the app:
https://anvil.works/build#clone:FIMSJJBV5EYKZQ7K=NJH74Q6AGAH4HHPWH7N2TVID

8 Likes

Thnx, that’s close.
I was hoping the raw DF could somehow be stored because of probs I’m having with string date parsing . .

@paul.k.pallaghy did you have any luck with date parsing. I had a similar issue and wondered how you got around it?

1 Like

I hope @paul.k.pallaghy got it to work. But another option would be to pickle the dataframe, which can then be stored as an object. This is the more Pandas/Python way to do it rather than Json. Methods in Pandas are called pandas.read_pickle and pandas.to_pickle.

Another option for serialization is HD5, but this really only has advantages for very large datasets.

If the issue is dtypes (column types) in Pandas there are other methods to handle this after the DataFrame has been created, including some very spiffy date handlers.

3 Likes

Not much luck.
I stand by my assertion that FORCED conversion to time zone-aware (without ANY recourse) is a fool’s errand.

@meredydd can you please see that it is not just me that REALLY dislikes FORCED zone-aware datetimes (between server & local).

I was actually referring to this @paul.k.pallaghy:

@robert I have not had a chance to try this but I will!

1 Like

I recently dealt with the timezone datetimes in data tables issue. Having timezone aware datetimes in data tables is fairly annoying, but in the long run is probably better.

Since I tend to use Pandas for my data stuff in server modules I had to use string methods to cut up the datetime string then convert it to a Pandas datetime format. It’s definitely annoying, but it works.

Once you get the data table into pandas you can convert it as so:

pd.to_datetime(datetime_entry[:10] + ' ' + datetime_entry[11:16], utc=True).tz_convert('Asia/Ulaanbaatar').strftime('%Y-%m-%d %H:%M')

I used Asia/Ulaanbaatar since that’s where I’m at. By default Anvil stores datetimes in UTC+0 (it would be nice to be able to customize this per app).

The code here demonstrates pulling a data table as a CSV, storing it as a Pandas dataframe, then converting the datetime feature to a Pandas datetime feature (which is waaaay easier for me than normal Python datetimes).

#pull entire submissions table from Data Table and store to dataframe
table = app_tables.submissions.search(pset=pset_num).to_csv().get_bytes()
df = pd.read_csv(BytesIO(table))

#set up empty dataframe to store leaderboard data
leaderboard = pd.DataFrame(columns=['name', 'last_score', 'best_score','latest_entry','entries'])

#loop through each name in the dataframe and build the leaderboard table
for user in df['name'].unique():
  latest_entry = df[df['name'] == user]['entry'].max()
  leaderboard = leaderboard.append({
        'name': user,
        'last_score': round(df[df['name'] == user].sort_values(by='entry')['score'].iloc[-1],3),
        'best_score': round(df[df['name'] == user]['score'].min(),3),
        'latest_entry': pd.to_datetime(latest_entry[:10] + ' ' + latest_entry[11:16], utc=True).tz_convert('Asia/Ulaanbaatar').strftime('%Y-%m-%d %H:%M'),
        'entries': df[df['name'] == user].shape[0],
  },
  ignore_index = True
  )
2 Likes