0

I have PostgreSQL database dump by pg_dump version 9.5.2, which contains DDLs and also INSERT INTO statements for each table in given database. Dump looks like this:

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';

CREATE TABLE unimportant_table (
    id integer NOT NULL,
    col1 character varying
);

CREATE TABLE important_table (
    id integer NOT NULL,
    col2 character varying NOT NULL,
    unimportant_col character varying NOT NULL
);

INSERT INTO unimportant_table VALUES (123456, 'some data split into
-  multiple
-  lines
just for fun');

INSERT INTO important_table VALUES (987654321, 'some important data', 'another crap split into

-  lines');

...
-- thousands of inserts into both tables

The dump file is really large and it is produced by another company, so I am not able to influence the export process. I need create 2 files from this dump:

  1. All DDL statements (all statements that doesn't start with INSERT INTO)
  2. All INSERT INTO important_table statements (I want restore only some tables from dump)

If all statements would be on single line without new line character in the data, it will be very easy to create 2 SQL script by grep, for example:

grep -v '^INSERT INTO .*;$' my_dump.sql > ddl.sql
grep -o '^INSERT INTO important_table .*;$' my_dump.sql > important_table.sql
# Create empty structures
psql < ddl.sql
# Import only one table for now
psql < important_table.sql

Firstly I was thinking about using grep but I did not find, how to process multiple lines at once, then I tried sed but it is returning only single line inserts. I also used https://regex101.com/ to find out right regular expression but I don't know how to combine it with grep or sed:

^(?!(INSERT INTO)).*$                              -- for ddl
^INSERT INTO important_table(\s|[[:alnum:]])*;$    -- for inserts

I found similar question pcregrep multiline SQL match but there is no answer. Also, I don't mind if the solution will work with grep, sed or whatever you suggest, but it should work on Ubuntu 18.04.4 TLS.

32cupo
  • 850
  • 5
  • 18
  • 36

1 Answers1

1

Here is a bash based solution that uses perl one-liners to prepare your SQL dump data for the subsequent grep statements.
In my approach, the goal is to get one SQL statement on one line through a script that I called prepare.sh. It got a little more complicated because I wanted to accomodate for semicolons and quotes within your insert data strings (these, along with the line breaks, are represented by their hex codes in the intermediate output):

EDIT: In response to @32cupo's comment, below is a modified set of scripts that avoids xargs with large data sets (although I don't have huge dump files to test it with):

#!/bin/bash
perl -pne 's/;(?=\s*$)/__ENDOFSTATEMENT__/g' \
| perl -pne 's/\\/\\\\x5c/g' \
| perl -pne 's/\n/\\\\x0a/g' \
| perl -pne 's/"/\\\\x22/g' \
| perl -pne 's/'\''/\\\\x27/g' \
| perl -pne 's/__ENDOFSTATEMENT__/;\n/g' \

Then, a separate script (called ddl.sh) includes your grep statement for the DDL (and, with the help of the loop, only feeds smaller chunks (lines) into xargs):

#!/bin/bash
while read -r line; do
    <<<"$line" xargs -I{} echo -e "{}"
done < <(grep -viE '^(\\\\x0a)*insert into')

Another separate script (called important_table.sh) includes your grep statement for the inserts into important-table:

#!/bin/bash
while read -r line; do
    <<<"$line" xargs -I{} echo -e "{}"
done < <(grep -iE '^(\\\\x0a)*insert into important_table')

Here is the set of scripts in action (please also note that I spiced up your insert data with some semicolons and quotes):

~/$ cat dump.sql
SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';

CREATE TABLE unimportant_table (
    id integer NOT NULL,
    col1 character varying
);

CREATE TABLE important_table (
    id integer NOT NULL,
    col2 character varying NOT NULL,
    unimportant_col character varying NOT NULL
);

INSERT INTO unimportant_table VALUES (123456, 'some data split into
-  multiple
-  lines
;just for fun');

INSERT INTO important_table VALUES (987654321, 'some important ";data"', 'another crap split into
-  lines;');

...
-- thousands of inserts into both tables
~/$ cat dump.sql | ./prepare.sh | ./ddl.sh >ddl.sql
~/$ cat ddl.sql
SET statement_timeout = 0;

SET lock_timeout = 0;

SET client_encoding = 'UTF8';


CREATE TABLE unimportant_table (
    id integer NOT NULL,
    col1 character varying
);


CREATE TABLE important_table (
    id integer NOT NULL,
    col2 character varying NOT NULL,
    unimportant_col character varying NOT NULL
);


...
-- thousands of inserts into both tables

~/$ cat dump.sql | ./prepare.sh | ./important_table.sh > important_table.sql
~/$ cat important_table.sql


INSERT INTO important_table VALUES (987654321, 'some important ";data"', 'another crap split into
-  lines;');
lab9
  • 596
  • 2
  • 8
  • Thank you for your answer. When I tried your code on your example it worked, also when I executed it on truncated dump, but when I tried it on my dumps (9GB, 16GB), it creates only empty files. I tried to run only `cat large-dump.sql | ./prepare.sh` and it outputs nothing. It seems that I am using VM with not enough memory for my dumps, so I will try it with larger VM and let you know. – 32cupo Aug 05 '20 at 11:50
  • After VM resizing, it is failing on `xargs: argument line too long`. Can you suggest any change, how to omit xargs? – 32cupo Aug 05 '20 at 14:53
  • hello @32cupo, I see. 9GB - 16GB is overwhelming xargs. I will check and see if there are different ways. – lab9 Aug 06 '20 at 11:18
  • @32cupo, I've edited the answer to provide a modified set of scripts. – lab9 Aug 06 '20 at 12:02