3

I am working in Node.JS, and we are opening a read stream from a CSV, loading that data into chunks, processing each line of those chunks by regex and comparing to another data file, then passing it into a write stream to write to a new file

The issue is that the second read stream (called beta), for the comparison file, is taking longer to run than the first read stream (called alpha) in some cases, and this causes the issue of not all the comparison data being ready to go and readable, resulting in null values. I am unsure how I can manage to get alpha to hold execution and not run until beta has called readStream.on(end), the way I think I should go about it, because in all my attempts with promises and await and while loops, it either freezes the program completely or it doesn't wait at all for beta to finish running and ends execution before its even done. The only solution I have found is to take beta directly into the main code and put alpha hardcoded into the beta readStream.on(end) function, however due to their being multiple permutations of how beta will run based on the data we're using, that means I will have to have multiple repeats of alpha in each instance of beta in the main code locked in switch or if statements, and I do not like that at all

This has been a massive problem for a few days and I am at the end of my rope about it. Nothing has worked

To note, the secondary script for running beta is referenced by a require statement, and the variables and scripts are returned to it using module.exports. I tried to put the entire beta stream into a function in module.exports and wait for a return of true, or to change a value, or to call the alpha from that function by passing into the secondary script and returning it, however it still does not ever wait for beta to finish before it takes off and does its own thing

I'm sorry for the long text

This is the beta filestream, as it currently is implemented. I would prefer it be contained in postValid.js, as that's where the data from this filestream gets used and there's going to be multiple permutations of postValid based on the data

postValid = require("./postValid.js");
postValid.country = country;
fs.createReadStream(postValid.country + " ADDRESS REF DATA.csv")
    .pipe(csv({e: null, headers: false, separator: '\t'}))
    //Indicate start of reading
    .on('resume', (data) => console.log("Reading complete postal code file..."))
    //Pass data to be buffered and chunked to the processing script
    .on('data', (data) => {
        //Each line of data gets processed as needed and stored here
    })
    .on('end', () => {
        postValid.complete = true;
        console.log("Done reading");
        ThisFunc();
    });

This is the alpha filestream, as it currently is. Its in the file I want, but I have to lock it within a function to work this and I would rather it be the first to run for future possible data sets

function ThisFunc() {
    //Do a quick parse of the csv to get a row count
    fs.createReadStream(fileName)
        .pipe(csv())
        .on('resume', () => {
            console.log("Getting file length...");
            postValid.complete = false;
        })
        .on('data', () => initLen++)
        .on('end', () => {
            console.log("The csv is " + initLen + " lines long, beginning processing")

        //Length of 1/100 of the file size, needs to be modular to file length rounded up
        chunkLen = Math.round(initLen/100);

        //Read the CVS filestream
        fs.createReadStream(fileName)
          .pipe(csv())
          //Indicate start of reading
          .on('resume', () => {
            console.log("Loading...");
          })
          //Pass data to be buffered and chunked to the processing script
          .on('data', (data) => {
            //Lines of data get passed into the processing script here
          })
          //End of reading
          .on('end', () => {
            //Read the final chunk of lines that don't fit and process them
            console.log("100%");

            //Print results
            console.log("File Length: " + results.length);
            console.log("Processed: " + finalAdd.length);
            console.log("Found Code: " + postCodes);
          });
    });
}

I removed a bit of the code to keep it legal, this should contain all the important bits. The files can be anything between several million lines or just a few thousand

James Z
  • 12,209
  • 10
  • 24
  • 44
  • Can you show us the code you have so far and describe where the problem in that code occurs? It's a whole lot easier for us to offer a correction to your code rather than write a whole example from scratch. Also, how big are these files? – jfriend00 Mar 24 '20 at 15:17
  • I added the code in, modified for reasons of legal – TheBasementNerd Mar 24 '20 at 16:42
  • 1
    If you really want to read the entire file before reading the other, why are you using streams at all? – Bergi Mar 24 '20 at 18:07
  • 1
    Which csv module are you using? Please provide a link to the NPM or Github page for it. I need to see what events it emits. Also, I'm with Bergi in that I don't understand why you're using streams at all if you want to entirely read one file before reading the other. – jfriend00 Mar 24 '20 at 18:39
  • 1
    Also, your code shows three streams, but your text talks about comparing one stream to another. I don't understand why there are three streams and I don't see any code attempting to do comparisons so can't really tell from this code what the objective of the code is. – jfriend00 Mar 24 '20 at 18:40
  • @Bergi That's a fair point, I was using a stream to read the first one so I figured it was best to use for this as well, but since I'm holding the data anyways it might be best to just use readline or something. I'm not sure if that'll work with this csv-parser I'm using though, which is for reading the csv extremely fast and parsing out bits I need – TheBasementNerd Mar 24 '20 at 18:42
  • @jfriend00 The first stream you see in the alpha code is used to get the length of the file, so that it can be broken down into chunks for processing, and the beta stream stores the file in a buffer which is stored and used later in the code in the actual comparison function, the streams are not directly compared because I have to assemble the data in a way that the file doesn't have Also the npm is csv-parser – TheBasementNerd Mar 24 '20 at 18:43
  • Well, there isn't enough info here about what you're actually trying to accomplish (the end result you're trying to achieve or where the problem occurs in this code) for me to understand how to help. This is just the shell of creating three streams and doesn't show any of the code that is trying to actually accomplish something. – jfriend00 Mar 24 '20 at 18:50
  • I am trying to take lines of data (the alpha stream), process them (different and irrelevant file), then validate them compared to data from another file (the beta stream reads that file), and then write them out (different and irrelevant file). I am not putting more details in because I don't know how confidential the project is intended to be, I just need help trying to make the alpha stream only run when the beta stream is finished, but without hardcoding alpha into the end of beta as it currently is, which requires them to be in the same file. When in separate files, alpha finishes first – TheBasementNerd Mar 24 '20 at 18:53
  • I did say in the post where my problem occurs, what is occuring, and what I need to solve. I need alpha to run after beta but with them in separate files – TheBasementNerd Mar 24 '20 at 18:56
  • Well, you get a completion event from the beta stream. When that event occurs, you can call some function in another module and do whatever you want with the alpha stream in that function. This is just event driven programming. When this event occurs, call this function. – jfriend00 Mar 24 '20 at 19:10
  • I tried that, again as stated in the post. The alpha stream kept firing before it finished – TheBasementNerd Mar 24 '20 at 19:13

0 Answers0