Problem is pretty weird. If I work with the uncompressed file, there is no issue. But, if I work with the compressed bz2 file, I get an index out of bounds error.
From what I've read, it apparently is spark-csv parser that doesn't detect the end of line character and reads the whole thing as a huge line. The fact that it works on the uncompressed csv but not the .csv.bz2 file is pretty weird to me.
Also, like I said, it only happens when doing a dataframe union. I tried to do rdd union with the spark context, same error.