Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
1 answer

K-Means on time series data with Apache Spark

I have a data pipeline system where all events are stored in Apache Kafka. There is an event processing layer, which consumes and transforms that data (time series) and then stores the resulting data set into Apache Cassandra. Now I want to use…
1
vote
0 answers

What is the maximum number of column being supported by apache spark dataframe

SPARK-Version: 1.5.2 with yarn 2.7.1.2.3.0.0-2557 I'm running into a problem while I'm exploring the data through spark-shell that I'm trying to create a really fat dataframe that with 3000 columns. Code as below: val valueFunctionUDF = udf((valMap:…
EdwinGuo
  • 1,765
  • 2
  • 21
  • 27
1
vote
1 answer

Non-integer ids in Spark MLlib ALS

I'd like to use val ratings = data.map(_.split(',') match { case Array(user,item,rate) => Rating(user.toInt,item.toInt,rate.toFloat) }) val model = ALS.train(ratings,rank,numIterations,alpha) However, the user data i get…
ZMath_lin
  • 523
  • 2
  • 6
  • 14
1
vote
2 answers

how to print Map[String, Array[Float]] in scala?

I am using word2vec function which is inside mllib library of Spark. I want to print word vectors which I am getting as output to "getVectors" function My code looks like this: import org.apache.spark._ import org.apache.spark.rdd._ import…
Aditi
  • 820
  • 11
  • 27
1
vote
0 answers

has training error using pyspark ALS

I run Spark on a virtual machine and implemented ALS library to train my data. rawRatings = sc.textFile('data/ratings.csv').map(lambda x: x.replace('\t', ',')) parsedRatings = rawRatings.map(lambda x: x.split(',')).map(lambda x: Rating(int(x[0]),…
1
vote
0 answers

Spark Correlation Coefficient

I have an specific application in which I am trying to verify the strong positive relationship between many of the time series data that I am reading. I Should elaborate more: I have a lot of actors which are distributed, and they generate…
1
vote
0 answers

How to deal with categoricalFeaturesInfo?

How do I deal with categoricalFeaturesInfo in RandomForest? I created a list of variables like this: alllist = listdouble + listint + listcategorielfeatures But when I create LabeledPoint I lose this order. How can I keep type of my variable like…
malouke
  • 529
  • 2
  • 5
  • 6
1
vote
2 answers

Why does ALS.trainImplicit give better predictions for explicit ratings?

Edit: I tried a standalone Spark application (instead of PredictionIO) and my observations are the same. So this is not a PredictionIO issue, but still confusing. I am using PredictionIO 0.9.6 and the Recommendation template for collaborative…
1
vote
1 answer

how to keep records information when working in Mllib

I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When…
HHH
  • 6,085
  • 20
  • 92
  • 164
1
vote
0 answers

Best Practice of mapping String to a unique Integer in distributed mode

I have a dataset with 40K entries, each entry look like the following : product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600…
Jay
  • 717
  • 11
  • 37
1
vote
3 answers

what is setNumClasses in LogisticRegressionWithLBFGS Spark-Mllib

I couldn't understand what is the significance of setNumClasses here also couldn't find anything in the sparkmllib documentation. new LogisticRegressionWithLBFGS() .setNumClasses(10)
Naresh
  • 5,073
  • 12
  • 67
  • 124
1
vote
1 answer

Inverse of a spark RowMatrix

I am trying to inverse a spark rowmatrix. The function I am using is below. def computeInverse(matrix: RowMatrix): BlockMatrix = { val numCoefficients = matrix.numCols.toInt val svd = matrix.computeSVD(numCoefficients, computeU = true) val…
Debasish
  • 113
  • 1
  • 9
1
vote
1 answer

text type independent variable to numeric type spark naive bayes

I have doubt with Naive bayes with numeric and non numeric features . like I have 5 independent independent parameter on these i want to classify data . Male,Suspicion of Alcohol,Weekday,12am-4am,75,30-39 Male,Moving Traffic…
mahendra singh
  • 384
  • 1
  • 13
1
vote
1 answer

Understanding Spark MLlib LDA input format

I am trying to implement LDA using Spark MLlib. But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown : 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0…
Amit Kumar
  • 2,685
  • 2
  • 37
  • 72
1
vote
0 answers

Why we can not define our own folds when we are using CrossValidator?

I have been using cross validation process in order to train a Naive Bayes Model and I realize that it uses kFold method to get the random sampling data in order to create the folds. This method return an Array[(RDD[T], RDD[T])] of tuples, which I…