0

I have two (or more) parallel text files stored in S3 - i.e. line 1 in first file corresponds to line 1 in second file etc. I want to read these files as columns into a single dask dataframe. What would be the best/easiest/fastest way to do it?

PS. I can read each of them into a separate dataframe, but then I cannot join them on index because dataframe index values seem to be neither unique nor monotonic. At the same time the correspondence of lines is defined by their position in each file.

evilkonrex
  • 255
  • 2
  • 10

1 Answers1

1

Unfortunately dask.dataframe breaks up large files by bytes, and not by line. It is decently hard to seek to a particular line in a large file without reading through all of it first.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Right, would it be possible to generate a global monotonic (or at least unique) index? I suppose I could do it manually by using map_partitions() and combining local (partition-internal) index value with partition number. I was wondering whether something similar is already available in the framework. – evilkonrex Oct 18 '17 at 16:30
  • You would probably have to cook up something manually. I don't know of any existing code to do this. – MRocklin Oct 18 '17 at 16:35