2

The following command works well

$ psql -c "copy (select * from foo limit 3) to stdout csv header"

# output
column1,column2
val1,val2
val3,val4
val5,val6

However the following does not:

$ psql -c "copy (select * from foo limit 3) to program 'gzip -f --stdout' csv header"

# output
COPY 3

Why do I have COPY 3 as the output from this command? I would expect that the output would be the compressed CSV string, after passing through gzip.

The command below works, for instance:

$ psql -c "copy (select * from foo limit 3) to stdout csv header" | gzip -f -c

# output (this garbage is just the compressed string and is as expected)
߉T`M�A �0 ᆬ}6�BL�I+�^E�gv�ijAp���qH�1����� FfВ�,Д���}������+��

How to make a single SQL command that directly pipes the result into gzip and sends the compressed string to STDOUT?

Jivan
  • 21,522
  • 15
  • 80
  • 131
  • gzip'ing the traffic is the job of a network layer, not the job of SQL statements. It used to be available for postgres through OpenSSL, until compression was booted out of SSL for [security reasons](https://en.wikipedia.org/wiki/CRIME). You can stil use compressing VPNs but probably not if the remote is a public cloud service. – Daniel Vérité Mar 19 '21 at 14:33
  • `gzip'ing the traffic is the job of a network layer, not the job of SQL statements` -> this is arbitrary. The SSL commands in question are perfectly capable of outputting a compressed gzip csv file, I see no fundamental reason why this would be "their job" any more than compressing an STDOUT output as well. – Jivan Mar 19 '21 at 15:14
  • 1
    An SQL client is not like `wget`. It's not meant to stream unstructured contents until the server closes the connection. COPY uses a row-by-row protocol where the client needs to interpret the bytes it receives. That's not possible if they're gzipped and the protocol doesn't know about it. – Daniel Vérité Mar 19 '21 at 17:22
  • @DanielVérité I see, makes sense. thanks for the explanation – Jivan Mar 19 '21 at 18:08

4 Answers4

3

When you use COPY ... TO PROGRAM, the PostgreSQL server process (backend) starts a new process and pipes the file to the process's standard input. The standard output of that process is lost. It only makes sense to use COPY ... TO PROGRAM if the called program writes the data to a file or similar.

If your goal is to compress the data that go across the network, you could use sslmode=require sslcompression=on in your connect string to use the SSL network compression feature I built into PostgreSQL 9.2. Unfortunately this has been deprecated and most OpenSSL binaries are shipped with the feature disabled.

There is currently a native network compression patch under development, but it is questionable whether that will make v14.

Other than that, you cannot get what you want at the moment.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • "There is no way to achieve what you want" + the fact that PostgreSQL doesn't offer built-in gzip compression unless going through specific lib install (which makes it unavailable on AWS RDS) is annoying. I don't understand why such a basic requirement (gzip the output of a postgresql command) has not been taken care of. – Jivan Mar 19 '21 at 12:52
  • Redirecting `COPY ... TO STDOUT` into a file and then compressing that file would all happen in the client, therefore it wouldn't avoid sending all the uncompressed data over the network. – Jivan Mar 19 '21 at 12:58
  • I have yet to see a PostgreSQL installation without zlib, and I doubt that Amazon does that. To explain the matter in more detail, I have added more information to the answer. – Laurenz Albe Mar 19 '21 at 15:49
  • Interesting. Thanks for sharing this. As a matter of understanding, is there any reason why either adding a `compression` option to `copy to stdout`, or adding an option to retrieve the output piped through a program in the case of `copy to program 'gzip -f -c`, has never been considered? – Jivan Mar 19 '21 at 16:12
  • I guess it has never occurred to anybody to do something as weird as calling a program on the database server, piping data into it, reading its standard output and relaying the result back to the client. – Laurenz Albe Mar 19 '21 at 16:30
  • good point... I believe the "compression" broader topic was believed to be solved with OpenSSL and then this happened... which caught people off-guard – Jivan Mar 19 '21 at 16:36
2

copy is running gzip on the server and not forwarding the STDOUT from gzip on to the client.

You can use \copy instead, which would run gzip on the client:

psql -q -c "\copy (select * from foo limit 3) to program 'gzip -f --stdout' csv header"

This is fundamentally the same as piping to gzip, which you show in your question.

Jeremy
  • 6,313
  • 17
  • 20
  • Thanks. When I do this, there is no output at all, which makes me believe nothing is sent to STDOUT. I tried with `-o 'hello.csv.gz'` but it again just sends "COPY 3" to this file. – Jivan Mar 19 '21 at 12:57
  • actually I was wrong, I just forgot the `\` in `\copy` — it seems to work with it – Jivan Mar 19 '21 at 13:16
  • ah but you say `it would run gzip on the client` — why is it so? is there absolutely no way to run that on the server and capture the output? – Jivan Mar 19 '21 at 13:46
  • 1
    No, you can't run gzip on the server and capture the output unless you save the output to a file and copy that file. What is the use case? Are you trying to reduce network out? – Jeremy Mar 19 '21 at 13:55
  • I am indeed trying to reduce network out. The use case is a query executed on Amazon RDS PostgreSQL which returns a massive load (that we can't make smaller apart from gzip'ing it). Because the query targets RDS, we can't access a file created by the server's psql. – Jivan Mar 19 '21 at 14:03
2

If the goal is to compress the output of copy so it transfers faster over the network, then...

psql "postgresql://ip:port/dbname?sslmode=require&sslcompression=1"

It should display "compression active" if it's enabled. That probably requires some server config variable to be enabled though.

Or you can simply use ssh:

ssh user@dbserver "psql -c \"copy (select * from foo limit 3) to stdout csv header\" | gzip -f -c" >localfile.csv.gz

But... of course, you need ssh access to the db server.

If you don't have ssh to the db server, maybe you have ssh to another box in the same datacenter that has a fast network link to the db server, in that case you can ssh to it instead of the db server. Data will be transferred uncompressed between that box and the database, compressed on the box, and piped via ssh to your local machine. That will even save cpu on the database server since it won't be doing the compression.

If that doesn't work, well then, why not put the ssh command into the "to program" and have the server send it via ssh to your machine? You'll have to setup your router and open a port, but you can do that. Of course you'll have to find a way to put the password in the ssh command line, that's usually a big no-no, but maybe just for once. Or just use netcat instead, that doesn't require a password.

Also, if you want speed, please, use zstd instead of gzip.

Here's an example with netcat. I just tested it and it worked.

On destination machine which is 192.168.0.1:

nc -lp 65001 | zstd -d >file.csv

In another terminal:

psql -c "copy (select * from foo) to program 'zstd -9 |nc -N 192.168.0.1 65001' csv header" test

Note -N option for netcat.

bobflux
  • 11,123
  • 3
  • 27
  • 27
-1

You can use copy to PROGRAM:

COPY foo_table to PROGRAM 'gzip > /tmp/foo_table.csv' delimiters',' CSV HEADER;
Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
mshabou
  • 524
  • 3
  • 6
  • I want the compressed output to be sent to STDOUT, not to a file – Jivan Mar 19 '21 at 11:47
  • if you need to do this from the psql, you can use the \copy, which is a psql command, different from the copy , which is a sql commande – mshabou Mar 19 '21 at 12:07