Google Drive And LOAD DATA INFILE

david.wylie · August 18, 2017, 8:55am

I’ve uploaded a 500,000 row CSV to Google Drive. All good.

I want to try and use PyMySQL to do a “LOAD DATA LOCAL INFILE …” to get the data in that file into the database, as 500k inserts are not desirable (plus 500k is a small sample anyway).

Is it possible to refer to that Google Drive file in the SQL statement? In the normal scheme of things I would point the SQL to the path of the file, but the file is not really on a “path” as such, is it?

Can’t quite get my head around it … is it even possible?

Very broadly speaking, trying to do this :

fdir=app_files.anviltest
f=fdir.get("new_textfile.txt")
with connection.cursor() as cursor:
    cursor.execute("drop table if exists fred")
    cursor.execute("create table fred(myfield varchar(50))")
    sql = "LOAD DATA LOCAL INFILE %s INTO TABLE fred" % f
    cursor.execute(sql)

which bombs out with pymysql complaining of “None” being in the sql string.

EDIT - actually, the IDE doesn’t autocomplete the folder name “anviltest” in the server module, but does on the client form. I know it’s there because I can use the file uploader control to put files in there (in this case I uploaded new_textfile.txt). Should I be able to? Anvil doesn’t complain on that line, though.

EDIT 2 - ok, typo on my side has changed the error to this :

"You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near “<Google Drive File: new_file.txt> INTO TABLE fred” at line 1)

PyMySQL is obviously looking for a traditional path - any suggestions?

david.wylie · August 18, 2017, 10:06am

I have one solution, and that is to upload the file not to google drive but to another server using “Uplink”. The local process can then create an sql file to be loaded from a more traditional file system. Not tried that yet but it should work.

Would still like to know if there’s a self contained way in Anvil to achieve the same result.

daviesian · August 18, 2017, 10:32am

Hi David,

You’re quite right, the MySQL client library requires a real local filename, and the Google Drive file is not local. Thankfully, that’s easy to fix. Just write the contents of that Google Drive file into a local file, then use that:

import random, os

# Generate a temporary random local file name
tmp_name = "/tmp/%s" % "".join([random.choice("0123456789abcdef") for x in range(32)])

# Read the Google Drive file into the temporary local file
with open(tmp_name, 'wb') as f:
  f.write(app_files.my_file.get_bytes())

### Do whatever you need to do with the local file
### ...

# Delete the local file now we're finished with it
os.unlink(tmp_name)

Does that do what you want?

david.wylie · August 18, 2017, 10:43am

I’ve been assuming that we don’t have the ability to access a local file system.
So that local file is on your server?

daviesian · August 18, 2017, 10:46am

Correct, your server modules are executed in real Python running on a real server. There is strict isolation between users, and no guarantee that anything will persist between server calls (that’s what the anvil.server.session object is for).

david.wylie · August 18, 2017, 11:43am

Yay! It works!
I am uploading the file directly to a tmp file (bypassing google drive) and executing the load file from there. Works perfectly, 5 seconds for a 100k file is perfectly acceptable for now.

Here’s what I’m doing. This is only a guide with no error checking, so beware! It also only works with a single field file/database table.

On the client (form) :

def file_loader_1_change (self, files, **event_args):
  # This method is called when a new file is loaded into this FileLoader
  for f in files:
    anvil.server.call('read_csv',f)

On the server :

def test_db(tmp_name):
    connection = pymysql.connect(host='x.x.x.x',
                              user='xxx',
                              password='xxx',
                              port=XXX,
                              db='xxx',
                              local_infile=True)
    
    with connection.cursor() as cursor:
        cursor.execute("drop table if exists fred")
        cursor.execute("create table fred(myfield varchar(50))")
        sql = "LOAD DATA LOCAL INFILE '%s' INTO TABLE fred" % tmp_name
        cursor.execute(sql)
        connection.commit()
  
    connection.close()
  
@anvil.server.callable
def read_csv(csv_object):
  csv_bytes=csv_object.get_bytes()
  # Convert bytes to a string.
  csv_string=str(csv_bytes,"utf-8")
  # Create a list of lines split on \n
  line_list=csv_string.split('\n')
  
  # Generate a temporary random local file name
  tmp_name = "/tmp/%s" % "".join([random.choice("0123456789abcdef") for x in range(32)])
  with open(tmp_name, 'w') as fp:
    for line in line_list:
      fp.write(line + "\n")

  test_db(tmp_name)
  os.unlink(tmp_name)

daviesian · August 18, 2017, 12:34pm

Great, glad that works. Is there a reason you decode the CSV file, split it up, then write it to the temporary file line by line? Why not just dump the entire file in directly with a single fp.write(csv_object.get_bytes())? Of course, you will need to open fp in 'wb' mode for that to work correctly.

david.wylie · August 18, 2017, 12:47pm

It was part of another routine to verify the contents of the uploaded file. I just cut & paste it for speed.
In the real world I never like saving unverified data to anywhere.