I have 2 csv files with sorted data: File 1: numbers sorted (~1GB) File 2: numbers sorted + extra data (~20GB)
I need to lookup all numbers from file 1 in file 2 and do some processing (numbers in file 2 that are not present in file 1 are skipped).
So far I have:
object MainQueue extends IOApp {
override def run(args: List[String]): IO[ExitCode] =
program[IO].compile.drain.as(ExitCode.Success)
def program[F[_]: Sync: ContextShift](): Stream[F, Unit] =
for {
number <- numberStream
record <- records
.through(parser())
.through(findRecord(number))
_ <- Stream.emit(println(s"$number <-> $record"))
} yield ()
def findRecord[F[_]](phone: Long): Pipe[F, Long, Long] =
_.dropWhile(r => {
println(s"Reading $r")
r < phone
}).head //halts the stream
def numberStream[F[_]](): Stream[F, Long] =
Stream(100L, 120L)
//TODO: make stream continue and not halt and restart
def records[F[_]: Sync: ContextShift](): Stream[F, String] =
Stream
.resource(Blocker[F])
.flatMap { bec =>
readAll[F](Paths.get("small.csv"), bec, 4096)
}
.through(text.utf8Decode)
.through(text.lines)
def parser[F[_]](): Pipe[F, String, Long] = ??? //parse
def writer[F[_]](): Pipe[F, Long, Unit] =
_.map(v => {
println(s"Found: $v")
})
}
Which prints:
Reading 50
Reading 100
100 <-> 100
Reading 50
Reading 100
Reading 120
120 <-> 120
Which means the 2nd stream restarts for each value in File 1, how do I keep the position last read and go from there? Numbers are sorted so no point started over. I am super new to scala and fs2 so an explanation of what I am misunderstanding would be much appreciated.
Thanks!