0

I want to create a spark dataframe by reading some text files. However, the text files have some weird formatting. This is one example of the text file:

enter image description here

These are the problems I am facing:

  1. In the first few lines, there are some headers which consists of 3 lines (e.g. the Student Identification Number takes 3 lines)

  2. Each student's data consists of 2 lines, where the Code and Transfer columns are not in the same line.

  3. There is another few lines of headers (after No. 00000025) which should be omitted from the middle

Expecting to read the text file in a spark dataframe like this: enter image description here

Abdennacer Lachiheb
  • 4,388
  • 7
  • 30
  • 61
Mash
  • 5
  • 2
  • Take a look at this page, please - https://medium.com/@11amitvishwas/how-to-handle-bad-records-corrupt-records-in-apache-spark-392f2991cbb5 – Urmat Zhenaliev Mar 09 '23 at 17:03

0 Answers0