How to use writeStream from a pyspark streaming dataframe to chop the values into different columns?

Question

I am trying to ingest some files and each of them is being read as a single column string (which is expected since it is a fixed-width file) and I have to split that single value into different columns. This means that I must access the dataframe but I must use writeStream since it is a streaming dataframe. This is an example of the input:

"64 Apple 32.32128Orange12.1932 Banana 2.45"

Expected dataframe:

64, Apple, 32.32
128, Orange, 12.19
32, Banana, 2.45

Notice how every column has the same amount of characters (3,6,5)<-This is what the META_SIZE has. Therefore each row has 14 characters each (sum of columns).

I tried using forEach as the following example but it is not doing anything:

two_d = []
streamingDF = (
  spark.readStream.format("cloudFiles")
    .option("encoding", sourceEncoding)
    .option("badRecordsPath", badRecordsPath)
    .options(**cloudfiles_config)
    .load(sourceBasePath)
    )

def process_row(string):

      rows = round(len(string)/chars_per_row)
      for i in range(rows):
        current_index = 0
        two_d.append([])
        for j in range(len(META_SIZES)):
          two_d[i].append(string[(i*chars_per_row+current_index) : (i*chars_per_row+current_index+META_SIZES[j])].strip())
          current_index += META_SIZES[j]
    
        print(two_d[i])
    
query = streamingDF.writeStream.foreach(process_row).start()

I will probably do a withColumn to add them instead of the list or use that list and make it a streaming dataframe if possible and better.

Edit: I added an input example and explained META_SIZES

score 0 · Answer 1 · edited Nov 12 '21 at 17:46

Assuming the inputs are something like the following.

...
"64 Apple 32.32"
"128 Orange 12.19"
"32 Banana 2.45"
...

You can do this.

streamingDF = (
  spark.readStream.format("cloudFiles")
    .option("encoding", sourceEncoding)
    .option("badRecordsPath", badRecordsPath)
    .options(**cloudfiles_config)
    .load(sourceBasePath)
    )

#remove this line if strings are already utf-8
lines = stream_lines.select(stream_lines['value'].cast('string'))


lengths = (lines.withColumn('Count', functions.split(lines['value'], ' ').getItem(0))
                .withColumn('Fruit', functions.split(lines['value'], ' ').getItem(1)
                .withColumn('Price', functions.split(lines['value'], ' ').getItem(1))

Note that "value" is set as the default column name when reading a string using readStream. If clouds_config contains anything changing the column name of the input you will have to alter the column name in the code above.

Unfortunately there is no delimeter. The input is more like: ..."64 Apple 32.32128Orange12.1932 Banana 2.45"... Notice the lack of line breaks and that every "column" is made up up the same amount of character, same as row since it is a sum of every column — Iñaki Zabaleta, Nov 15 '21 at 08:37

How to use writeStream from a pyspark streaming dataframe to chop the values into different columns?

1 Answers1