Spark Window Function top N items performance issues -


i trying top n items in dataset .

initially did this.

var  df  = seq( (1 , "row1") , (2,"row2"), (1,"row11")  , (1 , null) ).todf()  df=df.select($"_1".alias("p_int"), $"_2".alias("p_string"))  val resultdf =df.where($"p_string".isnotnull).select( $"p_int" ,$"p_int" +1  , upper($"p_string") , rank().over(window.partitionby($"p_int").orderby( $"p_string" ))  "rankindex", row_number().over(window.partitionby($"p_int").orderby( $"p_string" )) "rownumber" ).where($"rownumber" <=  2 ) 

but want avoid performance cost of operation "where($"rownumber" <= 10 )"

so decided following

var  df  = seq( (1 , "row1") , (2,"row2"), (1,"row11")  , (1 , null) ).todf()  df=df.select($"_1".alias("p_int"), $"_2".alias("p_string"))  val test =df.where($"p_string".isnotnull).select( $"p_int" ,$"p_int" +1  , upper($"p_string") , rank().over(window.partitionby($"p_int").orderby( $"p_string" ))  "rankindex", row_number().over(window.partitionby($"p_int").orderby( $"p_string" )) "rownumber" )  implicit val encoder = rowencoder(test.schema)  var  temp =test.mappartitions( _.take(2)) 

however , testing seems show not produce correct output .

any thoughts why . wouldn't take function on iterator obtained window dataset first n elements in iterator?

partitions of dataset don't have have one-to-one correspondence partition by clause. magic in over (partition ...) happens on lower level , single physical partition process multiple ids.

also don't save work. correctly generates row_numbers spark have shuffle, sort , scan data. you'll need lower level mechanisms avoid full shuffle , sort (for example aggregator binary heap).


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -