0

I have to parse large csvs approx 1gb, map the header to the database columns, and format every row. I.E the csv has "Gender" Male but my database only accepts enum('M', 'F', 'U').

Since the files are so large I have to use node streams, to transform the file and then use load data infile to upload it all at once.

I would like granular control over the inserts, which load data infile doesn't provide. If a single line has incorrect data the whole upload fails. I am currently using mysqljs, which doesn't provide an api to check if the pool has reached queueLimit and therefore I can't pause the stream reliably.

I am wondering if I can use apache kafka or spark to stream the instructions and it will be added to the database sequentially. I have skimmed through the docs and read some tutorials but none of them show how to connect them to the database. It is mostly consumer/producer examples.

I know there are multiple ways of solving this problem but I am very much interested in a way to seamlessly integrate streams with databases. If streams can work with I.O why not databases? I am pretty sure big companies don't use load data infile or add chunks of data to array repeatedly and insert to database.

Yasin Yaqoobi
  • 1,888
  • 3
  • 27
  • 38
  • Why aren't you just reading the stream and run the INSERT statements in sequence? – Evert Mar 08 '18 at 05:42
  • Kafka Connect shows examples of reading/writing from /to JDBC connections – OneCricketeer Mar 08 '18 at 07:16
  • Can you explain your technology choices/constraints? If you have a CSV file, and you want to get it into a database, there are simpler ways than using node, or indeed than using Kafka. – Robin Moffatt Mar 08 '18 at 09:39
  • @Evert I can't do inserts in sequence because of back pressure. It crashes the program after a while. – Yasin Yaqoobi Mar 08 '18 at 14:45
  • @RobinMoffatt The problem is that I have to parse and clean/transform every row. To make things worse these rows are always not in the same order so I built a form where the user maps the csv header to the right column. I am just curious about streams and database operations. How do big companies do it? I am pretty sure they don't use data-infile or populate an array repeatedly with chunk of data and insert it. – Yasin Yaqoobi Mar 08 '18 at 14:51
  • I have some doubt whether parallelizing to the same mysql table will actually benefit you. The inserts still have to happen in order, so even if you parallelize, mysql will put them in sequence. If sequential inserts in mysql is your bottleneck, have you considered trying to optimize that? – Evert Mar 08 '18 at 18:51
  • @Evert the problem is that the read operation is too fast for mysql. Even if I optimize inserts it won't be fast enough and will build back-pressure causing the app to crash. – Yasin Yaqoobi Mar 08 '18 at 18:56
  • @YasinYaqoobi well, the solution sounds pretty simple then. Don't continue reading the file. Read the file as you are doing inserts and don't read ahead. – Evert Mar 08 '18 at 20:14

0 Answers0