Server code took too long

def to_db(csv_object, job_name):
  df = pd.DataFrame([row.replace('"', '').split(",") for row in csv_object.get_bytes().decode(encoding='UTF-8').split("\n")])
  df.to_sql(name = job_name, con = engine, if_exists = 'replace')
  return True

Is there any way to make this function run faster?

Which bit is running slowly? The comprehension loop or the call to “to_sql”?
That might help determine if it’s the database write or not.

edit - actually, I’m no good with Pandas - probably best left to someone else. Sorry!

1 Like

I can’t even test that because I think my problem is that I have bugs in my list comprehension.

[row.replace(’"’, ‘’).split(",") for row in csv_object.get_bytes().decode(encoding=‘UTF-8’).split("\n")]

For some reason, some rows are missing information eg:

column1, column2, column3
val1, val2, val3,

gets turned into

column1, column2, column3
val1, ‘’, ’ ',

I am trying the below now:
[row.split(",") for row in csv_object.get_bytes().decode(encoding=‘UTF-8’).split("\n")]

Do you have a sample CSV containing “good” and “bad” rows that I could run here to test? I might be able to help if I can watch it process something.

1 Like

I had a Garbage in problem where my csv had the column names duplicated at the end of the spreadsheet with a different number of columns. Thus! I was getting (Garbage out) meaning that I would error.

So is it all sorted now?

Yes, my code worked but I forgot to double check the input csv. I thought that the problem was the size of the csv, but it was that the data was bad. Thanks!

1 Like

I was short on time previously, but I have a script that has the information you where asking for:

t1 = t.time()
csv = csv_object.get_bytes().decode(encoding='UTF-8').split("\n")
print(t.time()-t1)

0.0034492015838623047

df_string = [row.replace('"', '').split(",") for row in csv[(completed+1):(completed+1+chunk_size)]]
print(t.time()-t1)

0.0036323070526123047

df = pd.DataFrame(df_string, columns=header)
print(t.time()-t1)

0.006586551666259766

df.to_sql(name = job_name, 
          con = engine, 
          if_exists = 'append', 
          index = False)
print(t.time()-t1)

3.614392042160034

It seems that the to_sql method takes forever.

It seems that uplink is the only way to go, ElephantSQL takes too long for the time they allocate to Sync our server modules.

3.6 seconds does seem like a long time to perform an SQL action (I’m assuming it’s an INSERT of some sort?)

Out of curiosity, how big is the data you’re writing to the DB? Or how big is the data already as it could be the indexing slowing it down?

Just seems slightly excessive to me.

31 columns and around 30k rows would be the largest sized csv.

I tried dumping the entire csv into the elephant SQL database, but it would take too long and give me the error.

So I am now trying to call anvil server functions a chunk at a time, it still seems to give me the server took too long even with 2 rows at a time in each server call.

Any reason you can’t use the PostgreSQL equivalent of MySQL’s LOAD INFILE? I think its “COPY” - PostgreSQL: Documentation: 16: COPY

Without waiting for an answer :slight_smile: I’m going to assume you can, so the way I upload 100k+ rows in a few seconds is to write the uploaded file to /tmp/<randomfilename> (you can do this with Anvil) then LOAD INFILE LOCAL “/tmp/<filename>” etc etc.

Have a read of this thread :

I got the code in there, but I seem to still need a bug exterminator for this app lol.

error is:
ProgrammingError: (psycopg2.ProgrammingError) must be superuser to COPY to or from a file HINT: Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.
code is:

@anvil.server.callable
def to_db(csv_object, job_name):
  
  if job_name == 'final':
    tmp_name = "/tmp/final"
    with open(tmp_name, 'wb') as fp:
      fp.write(csv_object.get_bytes())
    engine.execute("COPY final FROM '/tmp/final' DELIMITER ','")

Hmm…MySQL doesn’t give me that problem.
I need to do some reading :slight_smile: