Cleaning up text files for spark dataframe

Asked Mar 09 '23 at 16:27

Active Mar 09 '23 at 17:23

Viewed 82 times

I want to create a spark dataframe by reading some text files. However, the text files have some weird formatting. This is one example of the text file:

These are the problems I am facing:

In the first few lines, there are some headers which consists of 3 lines (e.g. the Student Identification Number takes 3 lines)
Each student's data consists of 2 lines, where the Code and Transfer columns are not in the same line.
There is another few lines of headers (after No. 00000025) which should be omitted from the middle

Expecting to read the text file in a spark dataframe like this:

edited Mar 09 '23 at 17:23

Abdennacer Lachiheb

4,388
7
30
61

asked Mar 09 '23 at 16:27

Mash

Take a look at this page, please - https://medium.com/@11amitvishwas/how-to-handle-bad-records-corrupt-records-in-apache-spark-392f2991cbb5 – Urmat Zhenaliev Mar 09 '23 at 17:03

Cleaning up text files for spark dataframe

0 Answers0