I am trying to ingest some files and each of them is being read as a single column string (which is expected since it is a fixed-width file) and I have to split that single value into different columns. This means that I must access the dataframe but I must use writeStream since it is a streaming dataframe. This is an example of the input:
"64 Apple 32.32128Orange12.1932 Banana 2.45"
Expected dataframe:
64, Apple, 32.32
128, Orange, 12.19
32, Banana, 2.45
Notice how every column has the same amount of characters (3,6,5)<-This is what the META_SIZE has. Therefore each row has 14 characters each (sum of columns).
I tried using forEach as the following example but it is not doing anything:
two_d = []
streamingDF = (
spark.readStream.format("cloudFiles")
.option("encoding", sourceEncoding)
.option("badRecordsPath", badRecordsPath)
.options(**cloudfiles_config)
.load(sourceBasePath)
)
def process_row(string):
rows = round(len(string)/chars_per_row)
for i in range(rows):
current_index = 0
two_d.append([])
for j in range(len(META_SIZES)):
two_d[i].append(string[(i*chars_per_row+current_index) : (i*chars_per_row+current_index+META_SIZES[j])].strip())
current_index += META_SIZES[j]
print(two_d[i])
query = streamingDF.writeStream.foreach(process_row).start()
I will probably do a withColumn to add them instead of the list or use that list and make it a streaming dataframe if possible and better.
Edit: I added an input example and explained META_SIZES