How to make null matrix of size n*k using pyspark?

Question

We have a requirement if making n*n matrix in pyspark for some calculation.Using pyspark its possible we tried to do that like below:

similarity_matrix = np.zeros(shape=(data1.count(),data1.count()))

similarity_matrix = spark.createDataFrame(similarity_matrix)

Here data is our dataframe of 80K length.Is there any way to do this in pyspark as we are getting memory error while doing this

I believe in ML libraries you find that. One i googled from `pyspark.mllib.linalg import DenseMatrix` ... `DenseMatrix(2,2,range(4))` — PIG, Oct 09 '19 at 14:37
Please post the error message. Currently your approach isn't going to work as you can't create a dataframe from a numpy array like that. That causes the following error: `TypeError: Can not infer schema for type: `. Maybe you want to look at [this](https://stackoverflow.com/questions/45063591/creating-spark-dataframe-from-numpy-matrix). — cronoik, Oct 09 '19 at 14:43
I see the below error: ```Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-4172965528199358051.py", line 326, in exec(code) File "", line 1, in MemoryError``` — user11571614, Oct 10 '19 at 02:12

score 0 · Answer 1 · answered Oct 09 '19 at 15:43

0

size = 80000
schema = StructType([StructField(str(i), IntegerType(), True) for i in range(size)])
rdd = ss.range(size).rdd.map(lambda x: np.zeros(size))
df = ss.createDataFrame(rdd, schema)

or in scala

scala> val df = spark.range(80000).map(s=>new Array[Int](80000))
df: org.apache.spark.sql.Dataset[Array[Int]] = [value: array<int>]

scala> df.first.size
res35: Int = 80000

scala> df.count
res36: Long = 80000

answered Oct 09 '19 at 15:43

chlebek

2,431
1
8
20

Pyspark one will not work i guess as it will be array of numbers right – user11571614 Oct 10 '19 at 04:13

How to make null matrix of size n*k using pyspark?

1 Answers1