3

I am trying to ingest some files and each of them is being read as a single column string (which is expected since it is a fixed-width file) and I have to split that single value into different columns. This means that I must access the dataframe but I must use writeStream since it is a streaming dataframe. This is an example of the input:

"64 Apple 32.32128Orange12.1932 Banana 2.45"

Expected dataframe:

64, Apple, 32.32
128, Orange, 12.19
32, Banana, 2.45

Notice how every column has the same amount of characters (3,6,5)<-This is what the META_SIZE has. Therefore each row has 14 characters each (sum of columns).

I tried using forEach as the following example but it is not doing anything:

two_d = []
streamingDF = (
  spark.readStream.format("cloudFiles")
    .option("encoding", sourceEncoding)
    .option("badRecordsPath", badRecordsPath)
    .options(**cloudfiles_config)
    .load(sourceBasePath)
    )

def process_row(string):

      rows = round(len(string)/chars_per_row)
      for i in range(rows):
        current_index = 0
        two_d.append([])
        for j in range(len(META_SIZES)):
          two_d[i].append(string[(i*chars_per_row+current_index) : (i*chars_per_row+current_index+META_SIZES[j])].strip())
          current_index += META_SIZES[j]
    
        print(two_d[i])
    
query = streamingDF.writeStream.foreach(process_row).start()

I will probably do a withColumn to add them instead of the list or use that list and make it a streaming dataframe if possible and better.

Edit: I added an input example and explained META_SIZES

1 Answers1

0

Assuming the inputs are something like the following.

...
"64 Apple 32.32"
"128 Orange 12.19"
"32 Banana 2.45"
...

You can do this.

streamingDF = (
  spark.readStream.format("cloudFiles")
    .option("encoding", sourceEncoding)
    .option("badRecordsPath", badRecordsPath)
    .options(**cloudfiles_config)
    .load(sourceBasePath)
    )

#remove this line if strings are already utf-8
lines = stream_lines.select(stream_lines['value'].cast('string'))


lengths = (lines.withColumn('Count', functions.split(lines['value'], ' ').getItem(0))
                .withColumn('Fruit', functions.split(lines['value'], ' ').getItem(1)
                .withColumn('Price', functions.split(lines['value'], ' ').getItem(1))

Note that "value" is set as the default column name when reading a string using readStream. If clouds_config contains anything changing the column name of the input you will have to alter the column name in the code above.

Dharman
  • 30,962
  • 25
  • 85
  • 135
BoomBoxBoy
  • 1,770
  • 1
  • 5
  • 23
  • 1
    Unfortunately there is no delimeter. The input is more like: ..."64 Apple 32.32128Orange12.1932 Banana 2.45"... Notice the lack of line breaks and that every "column" is made up up the same amount of character, same as row since it is a sum of every column – Iñaki Zabaleta Nov 15 '21 at 08:37