spark sql - whether to use row transformation or UDF -


i having input table (i) 100 columns , 10 million records. want output table (o) has 50 columns , these columns derived columns of i.e. there 50 functions map column(s) of 50 columns of o i.e. o1 = f(i1) , o2 = f(i2, i3) ..., o50 = f(i50, i60, i70).

in spark sql can in 2 ways:

  1. row transformation entire row of parsed (ex: map function) 1 one produce row of o.
  2. use udf guess work @ column level i.e. take existing column(s) of input , produce 1 of corresponding column of o i.e. use 50 udf functions.

i want know 1 of above 2 more efficient (higher distributed , parallel processing) , why or if equally fast/performant, given processing entire input table , producing entirely new output table o i.e. bulk data processing.

i going write whole thing catalyst optimizer, simpler note jacek laskowski says in book mastering apache spark 2:

"use higher-level standard column-based functions dataset operators whenever possible before reverting using own custom udf functions since udfs blackbox spark , not try optimize them."

jacek notes comment on spark development team:

"there simple cases in can analyze udfs byte code , infer doing, pretty difficult in general."

this why spark udfs should never first option.

that same sentiment echoed in cloudera post, author states "...using apache spark’s built-in sql query functions lead best performance , should first approach considered whenever introducing udf can avoided."

however, author correctly notes may change in future spark gets smarter, , in meantime, can use expression.gencode, described in chris fregly’s talk, if don't mind tightly coupling catalyst optimizer.


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -