I am trying to generate molecular descriptors using RDKit and then perform machine learning on them all using Spark. I have managed to generate the descriptors and I have found the following code for doing Random Forest. That code loads the dataframe from a file stored in svmlight format and I can create such a file using dump_svmlight_file
but writing to file doesn't feel very "Sparky".
I have come this far:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from sklearn.datasets import dump_svmlight_file
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
df = spark.read.option("header","true")\
.option("delimiter", '\t').csv("acd_logd_100.smiles")
mols = df.select("canonical_smiles").rdd.flatMap(lambda x : x)\
.map(lambda x: Chem.MolFromSmiles(x))\
.map(lambda x: AllChem.GetMorganFingerprintAsBitVect(x, 2, nBits=1024))\
.map(lambda x: np.array(x))
spark.createDataFrame(mols)
But clearly I can't create a DataFrame from my RDD of np.arrays like this. (I get a strange error message about ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
).
I guess I also need to add the y values and somehow tell the Random forest implementation what in the dataframe is x and what is y but I can't yet create a dataframe at all from this data. How to do this?
EDIT:
I have tried to go via pyspark.ml.linalg.Vectors
to create a dataframe loosely based on Creating Spark dataframe from numpy matrix but I can not seem to create a Vector as something like:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from sklearn.datasets import dump_svmlight_file
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
df = spark.read.option("header","true")\
.option("delimiter", '\t').csv("acd_logd_100.smiles")
mols = df.select("canonical_smiles").rdd.flatMap(lambda x : x)\
.map(lambda x: Chem.MolFromSmiles(x))\
.map(lambda x: AllChem.GetMorganFingerprintAsBitVect(x, 2, nBits=1024))\
.map(lambda x: np.array(x))\
.map(lambda x: Vectors.sparse(x))
print(mols.take(5))
mydf = spark.createDataFrame(mols,schema=["features"])
I get:
TypeError: only size-1 arrays can be converted to Python scalars
which I don't understand at all.