2

I'm using pyspark to read and process some data from local .plt files. Here is what the file looks like:

Geolife trajectory
WGS 84
Altitude is in Feet
Reserved 3
0,2,255,My Track,0,0,2,8421376
0
39.984094,116.319236,0,492,39744.2451967593,2008-10-23,05:53:05
39.984198,116.319322,0,492,39744.2452083333,2008-10-23,05:53:06
39.984224,116.319402,0,492,39744.2452662037,2008-10-23,05:53:11
39.984211,116.319389,0,492,39744.2453240741,2008-10-23,05:53:16
......

As is shown above, I'm not interested in the beginning 6 rows, what I want are the rows which start from the 7th row. So I want to use spark session to read this file from the 7th row. Here is the code I've tried but failed:

from pyspark.sql import SparkSession
session = SparkSession.builder.appName('file reader').master('local[*]').getOrCreate()
df = session.read.\
     option('delimiter', ',').\
     option('header', 'false').\
     csv('test.plt')
df.show()

Could somebody give me some advice? Thank you for your attention.

YQ.Wang
  • 1,090
  • 1
  • 17
  • 43
  • Take a look at this question: `https://stackoverflow.com/questions/44077404/how-to-skip-lines-while-reading-a-csv-file-as-a-dataframe-using-pyspark` – chrymxbrwn Jan 24 '19 at 09:09
  • @chrymxbrwn Unfortunately, That question is using spark context rather than spark session. – YQ.Wang Jan 24 '19 at 09:14
  • You should exploit some special property of a column to filter out top 6 rows. That would be the most efficient way to do it. – cph_sto Jan 24 '19 at 09:43

3 Answers3

4
from pyspark.sql.types import *
from pyspark.sql import SparkSession
session = SparkSession.builder.appName('file reader').master('local[*]').getOrCreate()
schema = StructType([StructField("a", FloatType()),
                     StructField("b", FloatType()),
                     StructField("c", IntegerType()),
                     StructField("d", IntegerType()),
                     StructField("e", FloatType()),
                     StructField("f", StringType()),
                     StructField("g", StringType())])
df=session.read.option('mode','DROPMALFORMED').csv('test.plt',schema)
Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • You just need to pass your custom schema like above and use it while reading the CSV making the reader option "DROPMALFORMED". You will not get these rows – Subhasish Guha Jan 24 '19 at 12:15
  • @Arnon, just a quick remark - This `option` will work more than 99.99% of times. but, it's very strict. It's not rare in big data that sometimes there is an additional value, let's say at 8th position, call it `h`. Then this record will be filtered out, though it's understandable. That said, `DROPMALFORMED` is a very handy option. – cph_sto Jan 24 '19 at 13:28
  • @cph_sto - if you want to retain the problematic lines and deal with them later - you can add another field to the schema (e.g. a StringField("corrupt",StringType()) and then set another option instead of the mode one- option('columnNameOfCorruptRecord','corrupt') - in which case you'd get all the problematic records in this field and you can filter/try to handle them later – Arnon Rotem-Gal-Oz Jan 24 '19 at 18:01
  • Yes, I just did that a one hour back. Since, all records are concatenated and put out in column corrupt, so I could see which records were classified as malformed and why were they classified so. Thanks for brining this option to my notice. – cph_sto Jan 24 '19 at 18:08
  • you can always go to https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html to see the list of options available – Arnon Rotem-Gal-Oz Jan 24 '19 at 19:33
1

In addition, to the great method suggested by @Arnon Rotem-Gal-Oz, we can also exploit some special property of any column, if there is a one present.

In YQ. Wang's data, we can see the 6th column is a date, and the chances are pretty negligible that the 6th column in the header will also be a date. So, the ideas is to check for this special property for the 6th column. to_date() converts a string to date. If this column is not date, then to_date() will return Null and we will filter out all such rows using .where() clause -

from pyspark.sql.functions import to_date
from pyspark.sql.types import FloatType, StringType, StructType, StructField
df = spark.read.schema(schema)\
                    .format("csv")\
                    .option("header","false")\
                    .option("sep",',')\
                    .load('test.plt')\
                    .where(to_date(col('f'),'yyyy-MM-dd').isNotNull())
df.show()
+---------+----------+----+---+---------+----------+--------+
|        a|         b|   c|  d|        e|         f|       g|
+---------+----------+----+---+---------+----------+--------+
|39.984093| 116.31924|   0|492|39744.246|2008-10-23|05:53:05|
|  39.9842| 116.31932|   0|492|39744.246|2008-10-23|05:53:06|
|39.984222|116.319405|   0|492|39744.246|2008-10-23|05:53:11|
| 39.98421| 116.31939|   0|492|39744.246|2008-10-23|05:53:16|
+---------+----------+----+---+---------+----------+--------+

There are downsides to this method too, like if the date was missing, then the whole row gets filtered out.

cph_sto
  • 7,189
  • 12
  • 42
  • 78
0

Assuming that of data from the 7th line onwards follows pattern in which you have shown:

from pyspark.sql import SparkSession
session = SparkSession.builder.appName('file reader').master('local[*]').getOrCreate()
data = session.read.textFile('test.plt')

header = data.head(6)  # the first six rows

filtered = data.filter(row => row != header)
               .withColumn("a", split(col("value"), ",").getItem(0))
               .withColumn("b", split(col("value"), ",").getItem(1))
               .withColumn("c", split(col("value"), ",").getItem(2))
               .withColumn("d", split(col("value"), ",").getItem(3))
               .withColumn("e", split(col("value"), ",").getItem(4))
               .withColumn("f", split(col("value"), ",").getItem(5))
               .withColumn("g", split(col("value"), ",").getItem(6))
               .drop("value")
Dave Canton
  • 178
  • 4