Spark Window Function top N items performance issues -
i trying top n items in dataset .
initially did this.
var df = seq( (1 , "row1") , (2,"row2"), (1,"row11") , (1 , null) ).todf() df=df.select($"_1".alias("p_int"), $"_2".alias("p_string")) val resultdf =df.where($"p_string".isnotnull).select( $"p_int" ,$"p_int" +1 , upper($"p_string") , rank().over(window.partitionby($"p_int").orderby( $"p_string" )) "rankindex", row_number().over(window.partitionby($"p_int").orderby( $"p_string" )) "rownumber" ).where($"rownumber" <= 2 )
but want avoid performance cost of operation "where($"rownumber" <= 10 )"
so decided following
var df = seq( (1 , "row1") , (2,"row2"), (1,"row11") , (1 , null) ).todf() df=df.select($"_1".alias("p_int"), $"_2".alias("p_string")) val test =df.where($"p_string".isnotnull).select( $"p_int" ,$"p_int" +1 , upper($"p_string") , rank().over(window.partitionby($"p_int").orderby( $"p_string" )) "rankindex", row_number().over(window.partitionby($"p_int").orderby( $"p_string" )) "rownumber" ) implicit val encoder = rowencoder(test.schema) var temp =test.mappartitions( _.take(2))
however , testing seems show not produce correct output .
any thoughts why . wouldn't take function on iterator obtained window dataset first n elements in iterator?
partitions of dataset
don't have have one-to-one correspondence partition by
clause. magic in over (partition ...)
happens on lower level , single physical partition process multiple ids.
also don't save work. correctly generates row_numbers
spark have shuffle, sort , scan data. you'll need lower level mechanisms avoid full shuffle , sort (for example aggregator
binary heap).
Comments
Post a Comment