spark sql - whether to use row transformation or UDF -
i having input table (i) 100 columns , 10 million records. want output table (o) has 50 columns , these columns derived columns of i.e. there 50 functions map column(s) of 50 columns of o i.e. o1 = f(i1) , o2 = f(i2, i3) ..., o50 = f(i50, i60, i70).
in spark sql can in 2 ways:
- row transformation entire row of parsed (ex: map function) 1 one produce row of o.
- use udf guess work @ column level i.e. take existing column(s) of input , produce 1 of corresponding column of o i.e. use 50 udf functions.
i want know 1 of above 2 more efficient (higher distributed , parallel processing) , why or if equally fast/performant, given processing entire input table , producing entirely new output table o i.e. bulk data processing.
i going write whole thing catalyst optimizer, simpler note jacek laskowski says in book mastering apache spark 2:
"use higher-level standard column-based functions dataset operators whenever possible before reverting using own custom udf functions since udfs blackbox spark , not try optimize them."
jacek notes comment on spark development team:
"there simple cases in can analyze udfs byte code , infer doing, pretty difficult in general."
this why spark udfs should never first option.
that same sentiment echoed in cloudera post, author states "...using apache spark’s built-in sql query functions lead best performance , should first approach considered whenever introducing udf can avoided."
however, author correctly notes may change in future spark gets smarter, , in meantime, can use expression.gencode
, described in chris fregly’s talk, if don't mind tightly coupling catalyst optimizer.
Comments
Post a Comment