0

We have a requirement if making n*n matrix in pyspark for some calculation.Using pyspark its possible we tried to do that like below:

similarity_matrix = np.zeros(shape=(data1.count(),data1.count()))

similarity_matrix = spark.createDataFrame(similarity_matrix)

Here data is our dataframe of 80K length.Is there any way to do this in pyspark as we are getting memory error while doing this

DennisLi
  • 3,915
  • 6
  • 30
  • 66
  • I believe in ML libraries you find that. One i googled from `pyspark.mllib.linalg import DenseMatrix` ... `DenseMatrix(2,2,range(4))` – PIG Oct 09 '19 at 14:37
  • Please post the error message. Currently your approach isn't going to work as you can't create a dataframe from a numpy array like that. That causes the following error: `TypeError: Can not infer schema for type: `. Maybe you want to look at [this](https://stackoverflow.com/questions/45063591/creating-spark-dataframe-from-numpy-matrix). – cronoik Oct 09 '19 at 14:43
  • I see the below error: ```Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-4172965528199358051.py", line 326, in exec(code) File "", line 1, in MemoryError``` – user11571614 Oct 10 '19 at 02:12

1 Answers1

0
size = 80000
schema = StructType([StructField(str(i), IntegerType(), True) for i in range(size)])
rdd = ss.range(size).rdd.map(lambda x: np.zeros(size))
df = ss.createDataFrame(rdd, schema)

or in scala

scala> val df = spark.range(80000).map(s=>new Array[Int](80000))
df: org.apache.spark.sql.Dataset[Array[Int]] = [value: array<int>]

scala> df.first.size
res35: Int = 80000

scala> df.count
res36: Long = 80000
chlebek
  • 2,431
  • 1
  • 8
  • 20