I have a sequence file whose values look like
(string_value, json_value)
I don't care about the string value.
In Scala I can read the file by
val reader = sc.sequenceFile[String, String]("/path...")
val data = reader.map{case (x, y) => (y.toString)}
val jsondata = spark.read.json(data)
I am having a hard time converting this to PySpark. I have tried using
reader= sc.sequenceFile("/path","org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
data = reader.map(lambda x,y: str(y))
jsondata = spark.read.json(data)
The errors are cryptic but I can provide them if that helps. My question is, is what is the right syntax for reading these sequence files in pySpark2?
I think I am not converting the array elements to strings correctly. I get similar errors if I do something simple like
m = sc.parallelize([(1, 2), (3, 4)])
m.map(lambda x,y: y.toString).collect()
or
m = sc.parallelize([(1, 2), (3, 4)])
m.map(lambda x,y: str(y)).collect()
Thanks!