spark sql - whether to use row transformation or UDF -

i having input table (i) 100 columns , 10 million records. want output table (o) has 50 columns , these columns derived columns of i.e. there 50 functions map column(s) of 50 columns of o i.e. o1 = f(i1) , o2 = f(i2, i3) ..., o50 = f(i50, i60, i70).

in spark sql can in 2 ways:

row transformation entire row of parsed (ex: map function) 1 one produce row of o.
use udf guess work @ column level i.e. take existing column(s) of input , produce 1 of corresponding column of o i.e. use 50 udf functions.

i want know 1 of above 2 more efficient (higher distributed , parallel processing) , why or if equally fast/performant, given processing entire input table , producing entirely new output table o i.e. bulk data processing.

i going write whole thing catalyst optimizer, simpler note jacek laskowski says in book mastering apache spark 2:

"use higher-level standard column-based functions dataset operators whenever possible before reverting using own custom udf functions since udfs blackbox spark , not try optimize them."

jacek notes comment on spark development team:

"there simple cases in can analyze udfs byte code , infer doing, pretty difficult in general."

this why spark udfs should never first option.

that same sentiment echoed in cloudera post, author states "...using apache spark’s built-in sql query functions lead best performance , should first approach considered whenever introducing udf can avoided."

however, author correctly notes may change in future spark gets smarter, , in meantime, can use expression.gencode, described in chris fregly’s talk, if don't mind tightly coupling catalyst optimizer.

Search This Blog

New Generation Education

spark sql - whether to use row transformation or UDF -

Comments

Post a Comment

Popular posts from this blog

cookies - Yii2 Advanced - Share session between frontend and mainsite (duplicate of frontend for www) -

angular - password and confirm password field validation angular2 reactive forms -

javascript - Angular2 intelliJ config error.. Cannot find module '@angular/core' -