0

I am trying below python code to convert in pyspark. Please let me know what's wrong in pyspark version of code:-

Original python version:-

for i in range(0,km_data.count()):

  if i==0:

     km_data['risk'].iloc[i]=not_lapsed+lapsed
      

  else:

    km_data['risk'].iloc[i]=km_data['risk'].iloc[i-1]-(km_data['lapsed'].iloc[i-1])-(km_data['censored'].iloc[i])

Pyspark version used:-

for i in range(0,km_data.count()):

  if i==0:

    km_data.collect()[i]['risk']=not_lapsed+lapsed

  else:

    km_data.collect()[i]['risk']=km_data.collect()[i-1]['risk']-(km_data.collect()[i-1]['lapsed'])-(km_data.collect()[i-1]['cencosred'])

Basically I am looking for equivalent of iloc in pyspark which can help me getting the results. Please ignore indentation issues as I have typed this code using mobile.

  • You want to share the sample input and output in Table format for a better visibility ? – dsk Jul 08 '20 at 05:39
  • @dsk, actually using collect() I am getting error as row object does not support assignment. – Shashank Paliwal Jul 08 '20 at 05:47
  • That is because you are assigning a value here km_data.collect()[i]['risk']= ... try using when() and otherwise() combination , which is nothing but if else in Python – dsk Jul 08 '20 at 05:54
  • @dsk could you please show me an example of using when and otherwise with loop? – Shashank Paliwal Jul 08 '20 at 05:58
  • Please follow this - https://stackoverflow.com/questions/39982135/apache-spark-dealing-with-case-statements – dsk Jul 08 '20 at 06:00

1 Answers1

0

Collect() can be used here, and I can see you did the same-

X = df.collect()[0]['age']  
or 
X = df.collect()[0][1]  #row 0 col 1 

is there anything other than you are looking for ?

dsk
  • 1,863
  • 2
  • 10
  • 13