How to create a Spark dataframe to feed to sparks random forest implementation from a list of np.arrays (generated by RDKit)?

Question

I am trying to generate molecular descriptors using RDKit and then perform machine learning on them all using Spark. I have managed to generate the descriptors and I have found the following code for doing Random Forest. That code loads the dataframe from a file stored in svmlight format and I can create such a file using dump_svmlight_file but writing to file doesn't feel very "Sparky".

I have come this far:

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from sklearn.datasets import dump_svmlight_file

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
df = spark.read.option("header","true")\
               .option("delimiter", '\t').csv("acd_logd_100.smiles")
mols = df.select("canonical_smiles").rdd.flatMap(lambda x : x)\
         .map(lambda x: Chem.MolFromSmiles(x))\
         .map(lambda x: AllChem.GetMorganFingerprintAsBitVect(x, 2, nBits=1024))\
         .map(lambda x: np.array(x))
spark.createDataFrame(mols)

But clearly I can't create a DataFrame from my RDD of np.arrays like this. (I get a strange error message about ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()).

I guess I also need to add the y values and somehow tell the Random forest implementation what in the dataframe is x and what is y but I can't yet create a dataframe at all from this data. How to do this?

EDIT: I have tried to go via pyspark.ml.linalg.Vectors to create a dataframe loosely based on Creating Spark dataframe from numpy matrix but I can not seem to create a Vector as something like:

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from sklearn.datasets import dump_svmlight_file

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession

from pyspark.ml.linalg import Vectors

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
df = spark.read.option("header","true")\
               .option("delimiter", '\t').csv("acd_logd_100.smiles")
mols = df.select("canonical_smiles").rdd.flatMap(lambda x : x)\
         .map(lambda x: Chem.MolFromSmiles(x))\
         .map(lambda x: AllChem.GetMorganFingerprintAsBitVect(x, 2, nBits=1024))\
         .map(lambda x: np.array(x))\
         .map(lambda x: Vectors.sparse(x))
print(mols.take(5))         

mydf = spark.createDataFrame(mols,schema=["features"])

I get:

TypeError: only size-1 arrays can be converted to Python scalars

which I don't understand at all.

I don't know much about spark, but if I was trying this with pandas I would try adding the line `mols = np.vstack(mols)` before creating the dataframe, otherwise pandas would store the numpy arrays in one column rather than expanding them across columns. Maybe this will work for you also. — Oliver Scott, Jan 15 '21 at 11:19
Ah, I guess you you will also need to convert a numpy array to a format that you can read with spark, sorry if this doesn't help. Maybe [this](https://stackoverflow.com/questions/45063591/creating-spark-dataframe-from-numpy-matrix) answer will help also. — Oliver Scott, Jan 15 '21 at 11:32

score 0 · Answer 1 · answered Jan 22 '21 at 13:43

So if you found your way here I thought I would share what I ended up with. I went with dense vectors in the end because it was easier. The only way I came up with to go from the RDKit vector was to first create a numpy.array and then a Spark Vectors.dense from that. I also had realised that I need to haul the y values along for the entire transformation, apparently you can't add that column to the ataframe at the end once the x values are sorted out, hence the complicated touple.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from sklearn.datasets import dump_svmlight_file

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession

from pyspark.ml.linalg import Vectors

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
df = spark.read.option("header","true")\
               .option("delimiter", '\t').csv("acd_logd_100.smiles")

print(df.select("canonical_smiles", "acd_logd").rdd)

data = df.select("canonical_smiles", "acd_logd").rdd.map( lambda row: (row.canonical_smiles, float(row.acd_logd)) )\
         .map( lambda x: (Chem.MolFromSmiles(x[0]), x[1]) )\
         .map( lambda x: (AllChem.GetMorganFingerprintAsBitVect(x[0], 2, nBits=1024), x[1]) )\
         .map( lambda x: (np.array(x[0]),x[1]) )\
         .map( lambda x: (Vectors.dense(x[0].tolist()),x[1]) )\
         .map( lambda x: (x[0],x[1]))\
         .toDF(["features", "label"] )

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="indexedFeatures")

# Chain indexer and forest in a Pipeline
pipeline = Pipeline(stages=[featureIndexer, rf])

# Train model.  This also runs the indexer.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "label", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(
    labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

rfModel = model.stages[1]
print(rfModel)  # summary only

spark.stop()

How to create a Spark dataframe to feed to sparks random forest implementation from a list of np.arrays (generated by RDKit)?

1 Answers1