apache spark - How to unnest array with keys to join on afterwards? -
i have 2 tables, namely table1
, table2
. table1
big, whereas table2
small. also, have udf function interface defined below:
--table1-- id 1 2 3 --table2-- category b c d e f g udf: foo(id: int): list[string]
i intend call udf firstly corresponding categories: foo(table1.id)
, return wrappedarray, want join every category
in table2
more manipulation. expected result should this:
--view-- id,category 1,a 1,c 1,d 2,b 2,c 3,e 3,f 3,g
i try find unnest method in hive, no luck, me out? thanks!
i believe want use explode
function or dataset's flatmap
operator.
explode
function creates new row each element in given array or map column.
flatmap
operator returns new dataset first applying function elements of dataset, , flattening results.
after execute udf foo(id: int): list[string]
you'll end dataset
column of type array
.
val fooudf = udf { id: int => ('a' ('a'.toint + id).tochar).map(_.tostring) } // table1 fooudf applied val table1 = spark.range(3).withcolumn("foo", fooudf('id)) scala> table1.show +---+---------+ | id| foo| +---+---------+ | 0| [a]| | 1| [a, b]| | 2|[a, b, c]| +---+---------+ scala> table1.printschema root |-- id: long (nullable = false) |-- foo: array (nullable = true) | |-- element: string (containsnull = true) scala> table1.withcolumn("fooexploded", explode($"foo")).show +---+---------+-----------+ | id| foo|fooexploded| +---+---------+-----------+ | 0| [a]| a| | 1| [a, b]| a| | 1| [a, b]| b| | 2|[a, b, c]| a| | 2|[a, b, c]| b| | 2|[a, b, c]| c| +---+---------+-----------+
with that, join
should quite easy.
Comments
Post a Comment