8

Closely related to: Spark Dataframe column with last character of other column but I want to extract multiple characters from the -1 index.


I have the following pyspark dataframe df

+----------+----------+
|    number|event_type|
+----------+----------+
|0342224022|        11|
|0112964715|        11|
+----------+----------+

I want to extract 3 characters from the last index of the number column.

I tried the following:

from pyspark.sql.functions import substring 
df.select(substring(df['number'], -1, 3), 'event_type').show(2)

# which returns:

+----------------------+----------+
|substring(number,-1,3)|event_type|
+----------------------+----------+
|                     2|        11|
|                     5|        11|
+----------------------+----------+

The below is the expected output (and I'm not sure what the output above is):

+----------------------+----------+
|substring(number,-1,3)|event_type|
+----------------------+----------+
|                   022|        11|
|                   715|        11|
+----------------------+----------+

What am I doing wrong?

Note: Spark version 1.6.0

akilat90
  • 5,436
  • 7
  • 28
  • 42

2 Answers2

21

This is how you use substring. Your position will be -3 and the length is 3.

pyspark.sql.functions.substring(str, pos, len)

You need to change your substring function call to:

from pyspark.sql.functions import substring
df.select(substring(df['number'], -3, 3), 'event_type').show(2)
#+------------------------+----------+
#|substring(number, -3, 3)|event_type|
#+------------------------+----------+
#|                     022|        11|
#|                     715|        11|
#+------------------------+----------+
pault
  • 41,343
  • 15
  • 107
  • 149
pissall
  • 7,109
  • 2
  • 25
  • 45
  • Shouldn't the position be -1, as per the documentation: https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html?highlight=lo#pyspark.sql.functions.substring I'm currently on mobile. I will update once I have access to a computer. – akilat90 Apr 12 '18 at 10:16
  • You need to specify the starting position, and then the length till which you want to look at. So starting from position -3, look for 3 string characters. – pissall Apr 12 '18 at 10:22
  • My starting position should be the last index. I didn't get how the last index of a string became -3 instead of -1 – akilat90 Apr 12 '18 at 10:27
  • 1
    You're talking about reverse numbers. Please try the solution and then tell me if it gives you the desired output. You'll understand yourself. – pissall Apr 12 '18 at 10:47
  • 2
    You want last 3 numbers, right? You can't count the length in reverse. You need to specify a starting position, which will be -3 which is first character from the last 3 numbers. And then you look for 3 characters from that position. – pissall Apr 12 '18 at 10:48
  • 1
    Yes, this works. Your third comment was helpful. Thanks. – akilat90 Apr 12 '18 at 16:14
  • @pissall is it possible to extract the first characters excluding the last 3 .. is that suppose to be substring(string,1,-3) does not seem to work – E B Jan 03 '19 at 23:18
  • @EB You need to specify index and length. `-3` is not a length. Do you have variable or fixed length of the items you want to sub? – pissall Jan 04 '19 at 04:55
0

This can also be done with SQL code thanks to "expr" function:

from pyspark.sql.functions import expr

df.select(expr('RIGHT(number, 3) AS substring'), 'event_type').show(2)
gcollar
  • 86
  • 1
  • 9