1

I am attempting to copy a csv (which has a header and quote character ") with python 3.6 to a table on a remote postgres 10 server. It is a large CSV (2.5M rows, 800MB) and while I previously imported it into a dataframe and then used dataframe.to_sql, this was very memory intensive so I switched to using COPY.

Using COPY with psycopg2 or sqlalchemy would work fine but the remote server does not have access to the local file system.

Using psql in the terminal I have successfully run the query below to populate the table. I don't think using \copy is possible with psycopg2 or sqlalchemy.

\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '"' NULL ''

However when I try to use a one line psql -c command like below, it does not work and I get the error:

ERROR: COPY quote must be a single one-byte character.

psql -U user -h ip -d db -w pw -c "\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '"' NULL ''"

Could you tell me why this is the case?

This one-line -c psql statement would be easier to implement with the subprocess module in python than having to open a terminal and execute a command which I'm not sure how to do. If you could suggest a workaround or different methodology that would be great.

====== Per Andrew's suggestion to escape the quote character this worked on the command line. However when implementing it in python like below, a new error comes up:

/bin/sh: -c: line 0: unexpected EOF while looking for matching `''

/bin/sh: -c: line 1: syntax error: unexpected end of file

"\"\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '\"' NULL ''\""
cmd = f'psql -U {user} -h {ip} -d {db} -w {pw} -c {copy_statement}'
subprocess.call(cmd, shell=True)
eamon1234
  • 1,555
  • 3
  • 19
  • 38
  • do you have any network connection between the two computers? – brddawg Oct 16 '17 at 13:27
  • The situation arose as I'm converting a local codebase to use docker. Using docker compose there's two containers (python3.6 conda and postgis). I could stick with the original COPY if you could suggest how to configure docker-compose? Is it as simple as opening up a port? The error when using COPY with docker says 'file does not exist' (in relation to the postgis server container). – eamon1234 Oct 16 '17 at 13:36
  • you could use a bat file to run execute a terminal command using the method that was successful – brddawg Oct 16 '17 at 13:51
  • Thanks for the suggestion. I'm on a mac, but yea doing a .sh might be the way to go, or I could just add that step to the Dockerfile for the python container – eamon1234 Oct 16 '17 at 13:59
  • 1
    Is it a shell escaping issue? try `'\"'` for the `QUOTE` argument. – Andrew Gelnar Oct 16 '17 at 14:45
  • this might help: https://stackoverflow.com/questions/6930242/error-copy-delimiter-must-be-a-single-one-byte-character it suggests reformatting the source file. If you needed a format after that, post-processing should work on the psql – brddawg Oct 16 '17 at 14:46
  • Thanks Andrew, that worked in the terminal, see my edit to the OP above for the problem in Python now. – eamon1234 Oct 16 '17 at 15:15

1 Answers1

2

Try not to use shell=True if you can avoid it. better to tokenize the command yourself to help sh.

subprocess.call(["psql", "-U", "{user}", "-h", "{ip}", "-d", "{db}", "-w", "{pw}", "-c", "{copy statement}"])

In this case then your copy statement could be as it is passed to psql verbatim, because there are no shell quoting issues to take into account. (N.B. still have to quote this for python, so the string would remain as is).


If you still want to use shell=True then you have to escape the string literal for both python and shell

"\"\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '\\\"' NULL ''\""

will create a string in python which will be

"\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '\"' NULL ''\"

Which is what we found out we needed on our shell in the first place!


Edit (clarifying something from the comments):

subprocess.call, when not using shell=True, takes an iterable of arguments.

So you could have

psql_command = "\"\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '\\\"' NULL ''\""
# user, hostname, password, dbname all defined elsewhere above.
command = ["psql",
    "-U", user,
    "-h", hostname,
    "-d", dbname,
    "-w", password,
    "-c", psql_command,
]

subprocess.call(command)

See https://docs.python.org/2/library/subprocess.html#subprocess.call or https://docs.python.org/3/library/subprocess.html#subprocess.call

extra edit :- Please note that to avoid shell injection, you should be using the method described here. See the warning section of https://docs.python.org/2/library/subprocess.html#frequently-used-arguments

Andrew Gelnar
  • 633
  • 4
  • 14
  • By tokenize it myself you mean hand-building the query in the subprocess.call? So no variables, just strings? In that case, your solution worked however it means I have to hard code the variables in the strings which was not what I was aiming for. – eamon1234 Oct 16 '17 at 15:31
  • @user578582 - you can still use variables - `subprocess.call` takes a list of tokens as its first arguments, so you can use any arbitrary array of strings. I'll update the answer with an example. – Andrew Gelnar Oct 16 '17 at 15:39