0

I am using Alpakka and Akka to process a CSV file. Since I have a bunch of CSV files that have to be added to the same stream, I would like to add a field that contains information from the file name or request. Currently I have something like this:

val source = FileIO.fromPath(Paths.get("10002070.csv"))
  .via(CsvParsing.lineScanner())

Which streams a Sequence of Lists (lines) of ByteStrings (fields). The goal would be something like:

val filename = "10002070.csv"
val source = FileIO.fromPath(Path.get(filename))
    .via(CsvParsing.lineScanner())
    .via(AddCSVFieldHere(filename))

Creating a structure similar to:

10002070.csv,max,estimated,12,1,0

Where the filename is a field non-existent in the original source.

I thing it does not look very pretty to inject values mid-stream, plus eventually I would like to determine the filenames passed to the parsing in a stream stage that reads a directory.

What is the correct/canonical way to pass values through stream stages for later re-use?

Jeffrey Chung
  • 19,319
  • 8
  • 34
  • 54
Falk Schuetzenmeister
  • 1,497
  • 1
  • 16
  • 35
  • I´d use fan out with one stream for the content and another for the name´s file, and zip whenever it is neccesary...https://doc.akka.io/docs/akka/2.5.3/scala/stream/stream-graphs.html – Emiliano Martinez Mar 26 '18 at 19:36
  • Yeah, that is a solution I am thinking off. The problem, I am running into is that after fanning out and processing the streams have different numbers of elements 1 vs number of lines. How can I make sure that the stream from the directory reading get advanced once when the file stream comes to an end (the files have different size). Maybe mergePreferred would do that but I am a little bit afraid of a race condition. – Falk Schuetzenmeister Mar 26 '18 at 19:41
  • Are you trying to create one stream for the whole directory?, maybe it´s better to spawn one stream for each file. I think that could be better, otherwise you should create some kind of join process.. if I Have understood well.. – Emiliano Martinez Mar 26 '18 at 19:55

1 Answers1

1

You could transform the stream with map to add the file name to each List[ByteString]:

val fileName = "10002070.csv"
val source =
  FileIO.fromPath(Path.get(fileName))
    .via(CsvParsing.lineScanner())
    .map(List(ByteString(fileName)) ++ _)

For example:

Source.single(ByteString("""header1,header2,header3
                           |1,2,3
                           |4,5,6""".stripMargin))
  .via(CsvParsing.lineScanner())
  .map(List(ByteString("myfile.csv")) ++ _)
  .runForeach(row => println(row.map(_.utf8String)))

// The above code prints the following:
// List(myfile.csv, header1, header2, header3)
// List(myfile.csv, 1, 2, 3)
// List(myfile.csv, 4, 5, 6)

The same approach is applicable in the more general case in which you don't know the file names upfront. If you want to read all the files in a directory (assuming that all of these files are csv files), concatenate the files into a single stream, and preserve the file name in each stream element, then you could do so with Alpakka's Directory utility in the following manner:

val source =
  Directory.ls(Paths.get("/my/dir")) // Source[Path, NotUsed]
    .flatMapConcat { path =>
       FileIO.fromPath(path)
         .via(CsvParsing.lineScanner())
         .map(List(ByteString(path.getFileName.toString)) ++ _)
    }
Jeffrey Chung
  • 19,319
  • 8
  • 34
  • 54