0
from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()
data = [('James','Smith','M',3000), ('Anna','Rose','F',4100),
  ('Robert','Williams','M',6200)
]
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df2 = df.select(lit("D").alias("S"), "*")
df2.show()
Output:
----------
+---+---------+--------+------+------+
|  S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
|  D|    James|   Smith|     M|  3000|
|  D|     Anna|    Rose|     F|  4100|
|  D|   Robert|Williams|     M|  6200|
+---+---------+--------+------+------+

Required Output:

  • Need to add an extra row "T" and count of row for column- "firstname" like below. Column "firstname" can be of any type .
+---+---------+--------+------+------+
|  S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
|  D|    James|   Smith|     M|  3000|
|  D|     Anna|    Rose|     F|  4100|
|  D|   Robert|Williams|     M|  6200|
|  T|      3  |        |      |      |
+---+---------+--------+------+------+

Tried creating a new data frame with trailer values and apply union as suggested on most of the stacoverflow solution- but both the dataframe should have same no of columns. Is there any better way to have the count in the trailer irrespective of column type of "firstname" column.

Derek O
  • 16,770
  • 4
  • 24
  • 43
RickyS
  • 13
  • 5

1 Answers1

0

Since you want to create a new row irrespective of column type, you can write a function that takes the column name as an input, and returns a dictionary containing all of the necessary information for the new row including the number of entries in that column.

To create an output pyspark dataframe like the one you've shown, every column will have to be a string type because the new row will have to contain an empty string '' for the columns lastname, gender, salary. You cannot have mixed types in pyspark columns (see here), so when you create a union between df2 and total_row_df, any columns that are string type in total_row_df be coerced to a string type in the resulting dataframe.

from pyspark.sql.functions import count

def create_total_row(col_name):
    total_row = {}
    for col in df2.columns:
        if col == 'S':
            total_row[col] = 'T'
        elif col == col_name:
            total_row[col] = df2.select(count(df2[col_name])).collect()[0][0]
        else:
            total_row[col] = ''
    return total_row

total_row = create_total_row('firstname')
total_row_df = spark.createDataFrame([total_row])
df2.union(total_row_df).show()

Result:

+---+---------+--------+------+------+
|  S|firstname|lastname|gender|salary|
+---+---------+--------+------+------+
|  D|    James|   Smith|     M|  3000|
|  D|     Anna|    Rose|     F|  4100|
|  D|   Robert|Williams|     M|  6200|
|  T|        3|        |      |      |
+---+---------+--------+------+------+
Derek O
  • 16,770
  • 4
  • 24
  • 43
  • Thanks a lot , it helped .But I am looking for a solution without using Union and new data frames. – RickyS Oct 14 '22 at 16:51
  • is there any particular reason for this constraint? it seems like creating a new dataframe that's one row and performing a union is pretty straightforward and is suggested [here](https://stackoverflow.com/questions/52685609/add-new-rows-to-pyspark-dataframe). i suppose you could do something like `df_rows = df.collect()` to retrieve an array of rows, then append the new row to df_rows, but this isn't very performant as far as I know – Derek O Oct 14 '22 at 23:58
  • no there is no constraint like that .But I was looking for any other alternative we have . – RickyS Oct 18 '22 at 09:29