0

I have a dataframe and I want to slice all the values of that column but I don't know how to do this?

My DataFrame

+-------------+------+
|    studentID|gender|
+-------------+------+
|1901000200   |     M|
|1901000500   |     M|
|1901000500   |     M|
|1901000500   |     M|
|1901000500   |     M|
+-------------+------+

I have converted the studentID into string but not able to remove first 190 from it. I want below output.

+-------------+------+
|    studentID|gender|
+-------------+------+
|   1000200   |     M|
|   1000500   |     M|
|   1000500   |     M|
|   1000500   |     M|
|   1000500   |     M|
+-------------+------+

I tried below method but it is giving me error.

students_data = students_data.withColumn('studentID',F.lit(students_data["studentID"][2:]))

TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'NoneType'>, respectively.
Subham
  • 671
  • 9
  • 23
  • Yes I did the same way but when i tried to convert the `studentID` again to int it gives me some weird negative integer value. – Subham Mar 09 '20 at 07:06

1 Answers1

1
from pyspark.sql import functions as F

# replicating the sample data from the OP.
students_data = sqlContext.createDataFrame(
[[1901000200,'M'],
[1901000500,'M'],
[1901000500,'M'],
[1901000500,'M'],
[1901000500,'M']],
["studentid", "gender"])

# unlike a simple python list transformation - we need to define the last position in the transform
# in case you aren't sure about the length one can define a random large number say 10k.
students_data = students_data.withColumn(
  'studentID',
  F.lit(students_data["studentID"][4:10000]).cast("string"))

students_data.show()

Output:

+---------+------+
|studentID|gender|
+---------+------+
|  1000200|     M|
|  1000500|     M|
|  1000500|     M|
|  1000500|     M|
|  1000500|     M|
+---------+------+
Sunny Shukla
  • 342
  • 2
  • 8