Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
1
vote
0 answers

Debug ArrayOutOfBoundsException in PySpark mllib

I'm trying to get started with mllib in PySpark, and after having built a dataset I'm trying to run a basic logistic regression. > train.take(4) [LabeledPoint(0.0, (4,[485,909,1715,2023],[1.0,1.0,1.0,1.0])), LabeledPoint(0.0,…
Patrick McCarthy
  • 2,478
  • 2
  • 24
  • 40
1
vote
1 answer

Efficient way of row/column sum of a IndexedRowmatrix in Apache Spark

I have a matrix in a CoordinateMatrix format in Scala. The Matrix is sparse and the entires look like (upon coo_matrix.entries.collect), Array[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = Array( MatrixEntry(0,0,-1.0),…
Kent Carlevi
  • 133
  • 1
  • 11
1
vote
2 answers

Convert String to Double in Scala / Spark?

I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data…
schnee
  • 1,050
  • 2
  • 9
  • 20
1
vote
1 answer

CountVectorizerModel error with apache spark - Java API

I am working with the sample code follow document of Apache Spark: https://spark.apache.org/docs/latest/ml-features.html#countvectorizer import java.util.Arrays; import org.apache.spark.SparkConf; import…
1
vote
0 answers

linear regression with spark: wrong prediction

I am trying to run the linear regression with spark but it gives me really wrong predictions: The data source: The program: def linear_regression(data): """ Run the linear regression algorithm on the data to perform the prediction """ …
rom
  • 3,592
  • 7
  • 41
  • 71
1
vote
3 answers

how to remove the error : NumberFormatException.java:65 ? when we implement the code of classification in apache-spark

import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import…
1
vote
1 answer

Features with High Cardinality ( How to Vectorize them?)

I am trying to run a machine learning problem using scikit learn on a dataset and one of the columns(feature) has high cardinality around 300K unique values.How do I vectorize such a feature. Using DictVectorizer would not be a solution as the…
Gayatri
  • 2,197
  • 4
  • 23
  • 35
1
vote
1 answer

TF - IDF rdds into readable format using spark

I am trying to calculate TF-IDF for documents of strings and I am referring http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf link. import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import…
Mrunmayee
  • 495
  • 3
  • 9
  • 16
1
vote
1 answer

Can't use Vector from Spark ML Lib for the DataFrame

When I'm trying to use UDF that returns the Vector object, Spark throws the following exception: Cause: java.lang.UnsupportedOperationException: Not supported DataType: org.apache.spark.mllib.linalg.VectorUDT@f71b0bce How can I use Vector in my…
Zyoma
  • 1,528
  • 10
  • 17
1
vote
2 answers

Spark's OnlineLDAOptimizer causing IndexOutOfBoundsException in Java

I'm using Latent Dirichlet Allocation in the Java version of Spark. The following line works fine: LDAModel ldaModel = new LDA()// .setK( NUM_TOPICS )// .setMaxIterations( MAX_ITERATIONS )// …
Ben Allison
  • 7,244
  • 1
  • 15
  • 24
1
vote
1 answer

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity. I am trying to train a corpus of data and then use it for text…
Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
1
vote
2 answers

Handling Missing values in SVM in apache spark ML Lib

I have a classification task. I want to use apache spark ml lib SVM algorithm for classification. I have input data which is n-dimensional. In the feature vectors some of dimensions may be missing. How to approach with missing values? I think it…
hard coder
  • 5,449
  • 6
  • 36
  • 61
1
vote
1 answer

how to predict the values in mllib

Hi i am new to spark mllib.I already have one r model.I am trying the same model with spark mllib.here is R model code. R code. delhi <- read.delim("UItrain.txt", na.strings = "") delhi$lnprice <- log(delhi$price) heddel <- lm(lnprice ~ bedrooms+…
arun abimaniyu
  • 167
  • 2
  • 12
1
vote
1 answer

Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

I'm following the documentation example Example: Estimator, Transformer, and Param And I got error msg 15/09/23 11:46:51 INFO BlockManagerMaster: Registered BlockManager Exception in thread "main" java.lang.NoSuchMethodError: …
keypoint
  • 2,268
  • 4
  • 31
  • 59
1
vote
1 answer

Spark MLlib LDA: the possible reasons behind generating always very similar LDA topics?

I am applying the MLlib LDA example on various corpora downloaded from enter link description here I am filtering out the stopwords, and also excluding the very frequent terms and the very rare terms. The problem is that I am always having topics…
Rami
  • 8,044
  • 18
  • 66
  • 108