Removing duplicate rows from a data table

email · December 17, 2022, 3:46pm

Hi all,
I am bit of a python newbie and I have been finding the anvil community forum a huge help on my learning journey. (A massive shout out to @nickantonaccio for his 'Learn the Python Anvil Framework tutorial/bible - this really helped to get me started).

I’m currently playing around with an app that pulls data from an external data source based on a keyword search entered by the use. It then stores the retrieved data frame in a datable. I have had success with this part, however, what I am finding is that I am creating a lot of duplicates rows in the data table when subsequent searches are executed and added to the table.

If the data was in a dataframe I could use, for example,

df.drop_duplicates(subset=['ID'])

Is there an equivalent method I could use for searching the data table? I’ve had a look around the forums and tutorials and I can’t see anything but I apologised if I have missed something.

Thanks in advance for any input/comments.

p.colbert · December 17, 2022, 4:23pm

It’s not built in, but it should be easy for you to build using Queries.

jshaffstall · December 17, 2022, 4:42pm

You could also take the approach of filtering after pulling from the external source, but before adding to the data table. That way you never add the duplicates in the first place (assuming that’s the goal).

The general logic would be:

results = fetch from external source

for result in results:
    row = app_tables.whatever.get(keyfield=result['keyfield'])

    if not row:
        app_tables.whatever.add_row(fields filled out from result)

stefano.menci · December 17, 2022, 7:53pm

You can query the table ordering by the ID column, then go through the result and delete every row that has the same ID as the previous row.

email · December 18, 2022, 1:27pm

Thanks @jshaffstall, I did consider this but the challenge I have is it is usually subsequent searches that are causing duplicates to be added.
So Search1 pulls in unique data with no duplicates within the results and these are then added to the data table.
Search2 pulls in new data and whilst there are no duplicates within the results of Search2 there may be some rows that are duplicates of Search1 which would only become apparent when they are added to the data table.

email · December 18, 2022, 1:30pm

Thanks @stefano.menci and @p.colbert for your responses. I did look at queries but couldn’t get my head around how I would make it work. I will have another look and consider how I may sort by ID and delete as suggested.

jshaffstall · December 18, 2022, 3:29pm

The algorithm I posted handles this case. The search for duplicates is not within the search results, but the entire table. The results of Search2 would still be unable to add duplicates (based on whatever key field you use in the app_tables.get call) to the data table.

email · December 18, 2022, 4:19pm

Ah, I’m showing my ignorance! Thank you so much, this does exactly what I need it to do!
I’ve slightly amended it to check for 2 columns, if both columns match it doesn’t add it to the table, if only one does then it still adds it. You’ve saved me many hours so thanks again.

Right… on to the next challenge!

stefano.menci · December 19, 2022, 2:55pm

Here is a quick example:

previous_row_id = 'non_existing_id'
for row in app_tables.table_1.search(tables.order_by('ID')):
    this_row_id = row['ID']
    if this_row_id == previous_row_id:
        row.delete()
    previous_row_id = this_row_id

And here is how you can test it on a server console (careful, these actions can’t be undone, make sure you test it on a test table first!):

email · December 24, 2022, 8:18am

Hey @stefano.menci, thank you so much for the taking the time to pull all that together, much appreciated!
I’ll work through it and give it a go.

Merry Christmas!