0

I have a list of string elements, having around 17k elements. I have to create new columns in a dataframe having integer 0 as all their elements and the columns should have the names of the elements present in the list.

How do i do this?

Example list

['V1045','71752','31231']

Format required:

ID    V1045   71752    31231
1     0       0        0
2     0       0        0
3     0       0        0
4     0       0        0

The dataframe has around 700,000 rows.

ernest_k
  • 44,416
  • 5
  • 53
  • 99

2 Answers2

1

You can easily generate that data:

This list will be used for column names:

l = ['ID', 'V1045','71752','31231']

Then a range with required indices is created, with static zeroes used as values:

df = sc.parallelize(range(700000))\
       .map(lambda l: [l, 0, 0, 0])\
       .toDF(l)

When you call .show(), it returns something like:

+---+-----+-----+-----+
| ID|V1045|71752|31231|
+---+-----+-----+-----+
|  0|    0|    0|    0|
|  1|    0|    0|    0|
|  2|    0|    0|    0|
|  3|    0|    0|    0|
|  4|    0|    0|    0|
+---+-----+-----+-----+
only showing top 5 rows
pault
  • 41,343
  • 15
  • 107
  • 149
ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • You could also use `sc.range()` for example: `df = sc.range(100).map(lambda x: [x] + [0]*len(l)).toDF(["ID"] + l)` – pault Apr 04 '18 at 14:35
  • @ErnestKiwele see [this answer](https://stackoverflow.com/a/48709440/5858851) for a detailed explanation of why `sc.range()` is better. – pault Apr 04 '18 at 14:50
0

If you already had a dataframe, the easiest way to add columns is to use withColumn(). You can add the value 0 to every row using pyspark.sql.functions.lit().

For example:

l = ['V1045','71752','31231']
for new_col in l:
    df = df.withColumn(new_col, f.lit(0))

df.show(n=5)
#+---+-----+-----+-----+
#| ID|V1045|71752|31231|
#+---+-----+-----+-----+
#|  0|    0|    0|    0|
#|  1|    0|    0|    0|
#|  2|    0|    0|    0|
#|  3|    0|    0|    0|
#|  4|    0|    0|    0|
#+---+-----+-----+-----+
#only showing top 5 rows

Remember that spark is lazy, so these operations are not happening in a loop as shown here.

df.explain()
#== Physical Plan ==
#*Project [ID#111L, 0 AS V1045#114, 0 AS 71752#118, 0 AS 31231#123]
#+- Scan ExistingRDD[ID#111L]

You probably shouldn't use sc.parallelize(range()), especially if you're using python 2 as explained in this post.

pault
  • 41,343
  • 15
  • 107
  • 149
  • Thanks for your answer. But the thing is, I tried this and it was not successful just because of the sheer size of the dataframe that will be made. There are about 700,000 rows and around 17,000 new columns (The list has 17,000 string elements) have to be added, which leads me to a memory error. – Rahul Khandelwal Apr 05 '18 at 09:50
  • I don't have documentation to back this up (will look for it) but Spark does not scale well for wide DataFrames (in practice I have found the limit to be around 1400 columns, but I suppose it depends on the size of the data). You can have as many rows as you want because rows are split amongst executors, but a single row is not split. You may want to consider breaking your DataFrame into multiple tables and using an ID as a primary key to join them. – pault Apr 05 '18 at 12:46
  • Or maybe you should look into [`SparseVector`](http://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector) or [`DenseVector`](http://spark.apache.org/docs/2.2.0/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector). There's likely a better way to do what you're trying to do. – pault Apr 05 '18 at 14:25