machine learning - generating vector from text data for KMeans using spark -


i new spark , machine learning. trying cluster using kmeans data like

1::hi how 2::i fine, how 

in data, separator :: , actual text cluster second column has text data. after reading on spark official page , numerous articles have written following code not able generate vector provide input kmeans.train step.

import org.apache.spark.sparkconf import org.apache.spark.sparkcontext import org.apache.spark.mllib.clustering.{kmeans, kmeansmodel} import org.apache.spark.mllib.linalg.vectors  val sc = new sparkcontext("local", "test")   val sqlcontext= new org.apache.spark.sql.sqlcontext(sc) import sqlcontext.implicits._  import org.apache.spark.ml.feature.{hashingtf, idf, tokenizer}  val rawdata = sc.textfile("data/mllib/km.txt").map(line => line.split("::")(1))  val sentencedata = rawdata.todf("sentence")  val tokenizer = new tokenizer().setinputcol("sentence").setoutputcol("words")  val wordsdata = tokenizer.transform(sentencedata)  val hashingtf = new hashingtf().setinputcol("words").setoutputcol("rawfeatures").setnumfeatures(20)  val featurizeddata = hashingtf.transform(wordsdata)  val clusters = kmeans.train(featurizeddata, 2, 10) 

i getting following error

<console>:27: error: type mismatch;  found   : org.apache.spark.sql.dataframe     (which expands to)  org.apache.spark.sql.dataset[org.apache.spark.sql.row]  required: org.apache.spark.rdd.rdd[org.apache.spark.mllib.linalg.vector]        val clusters = kmeans.train(featurizeddata, 2, 10) 

please suggest how process input data kmeans

thanks in advance.

finaly working after replacing following code.

val hashingtf = new hashingtf().setinputcol("words").setoutputcol("rawfeatures").setnumfeatures(20)  val featurizeddata = hashingtf.transform(wordsdata)  val clusters = kmeans.train(featurizeddata, 2, 10) 

with

val hashingtf = new hashingtf().setnumfeatures(1000).setinputcol(tokenizer.getoutputcol).setoutputcol("features")  val kmeans = new kmeans().setk(2).setfeaturescol("features").setpredictioncol("prediction")  val pipeline = new pipeline().setstages(array(tokenizer, hashingtf, kmeans)) 

Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -