I have to parse large csvs approx 1gb, map the header to the database columns, and format every row. I.E the csv has "Gender" Male but my database only accepts enum('M', 'F', 'U')
.
Since the files are so large I have to use node streams, to transform the file and then use load data infile
to upload it all at once.
I would like granular control over the inserts, which load data infile
doesn't provide. If a single line has incorrect data the whole upload fails. I am currently using mysqljs, which doesn't provide an api to check if the pool has reached queueLimit and therefore I can't pause the stream reliably.
I am wondering if I can use apache kafka or spark to stream the instructions and it will be added to the database sequentially. I have skimmed through the docs and read some tutorials but none of them show how to connect them to the database. It is mostly consumer/producer examples.
I know there are multiple ways of solving this problem but I am very much interested in a way to seamlessly integrate streams with databases. If streams can work with I.O why not databases? I am pretty sure big companies don't use load data infile
or add chunks of data to array repeatedly and insert to database.