Code Snippet - Remove Nans From Excel Data

serpenttracker · April 7, 2020, 10:49am

So in case anyone else has tried to do this or it might come up I thought I would share. I’m making a mass snake importer for my platform so people can bring their data over from the competition.

It was a mess and wasn’t even formatted right, etc lol. So there were also a lot of nan values I didn’t want to deal with so after some different testing I found this to work pretty well and is pretty fast.

import pandas as pd

def to_dict_dropna(data):
  return [{k:v for k,v in m.items() if pd.notnull(v)} for m in data.to_dict(orient='records')]

reptile_df = pd.read_csv('/tmp/reptiscan_import/reptiles.xls',delimiter="\t")
reptiles_data = to_dict_dropna(reptile_df)

Which now gives me nice pretty data which came from a old windows 97 formatted tab delimited excel file lol.

{
'id': 236080, 
'name': 'Sir Hiss', 
'origin': 'Twilight Balls',
'sex': 'Male', 'born': '2018-01-01'
}

(admin - moved to show & tell)

stucork · April 7, 2020, 10:57am

related post here - i’ve found the same problem before and came up with pretty much the same solution.

serpenttracker · April 7, 2020, 10:59am

Are you deleting out the tmp dir each time or how you think that should work? My concern is if I have a bunch of people importing data I don’t want the same folders and stuff. The issue is the data this comes from typically will have a same name and a zip file with the same file names like reptiles.xls, etc.

So I’m thinking after I’m done with the import cleaning it up or using some uuid or something to ensure I have a specific folder for the import inside the /tmp if that makes sense.

stucork · April 7, 2020, 11:12am

that was an example for someone who had a persistent csv file in a datatable

I guess it depends how you get the mediaobject
if it’s a mediaobject that you send from the client
and
if you use (from the docs) https://anvil.works/docs/media#files-in-server-modules

with anvil.media.TempFile(media_object) as file_name:
    # Now there is a file in the filesystem containing the contents of media_object.
    # The file_name variable is a string of its full path.

then file_name will be some random string
/tmp/4x1fqtt89ktff0gjtwqwimktbkldmtsr

serpenttracker · April 7, 2020, 11:14am

yea but in my case its a zip file that then gets extracted to a folder, I guess maybe I can use this file name as the folder name or something.

robert · April 7, 2020, 4:08pm

Could you show a print of the messy Excel file as it was imported by pandas? Did a simple reptile_df.dropna() not work?

serpenttracker · April 7, 2020, 4:20pm

Can’t show the data it’s personal info but yea that just wiped out rows it seemed and caused issues.

robert · April 7, 2020, 4:26pm

Ah gotcha. I know what you mean as the dropna method is basically a hammer and you needed a scalpel it seems.

campopianoa · April 7, 2020, 4:28pm

Good post on stack overflow:

serpenttracker · April 7, 2020, 4:35pm

Yep this came from that I think I adapted it some can’t remember. I know the compat option is for older versions I think