apache spark - How to unnest array with keys to join on afterwards? -


i have 2 tables, namely table1 , table2. table1 big, whereas table2 small. also, have udf function interface defined below:

--table1-- id 1 2 3  --table2-- category b c d e f g  udf: foo(id: int): list[string] 

i intend call udf firstly corresponding categories: foo(table1.id), return wrappedarray, want join every category in table2 more manipulation. expected result should this:

--view--  id,category 1,a 1,c 1,d 2,b 2,c 3,e 3,f 3,g 

i try find unnest method in hive, no luck, me out? thanks!

i believe want use explode function or dataset's flatmap operator.

explode function creates new row each element in given array or map column.

flatmap operator returns new dataset first applying function elements of dataset, , flattening results.

after execute udf foo(id: int): list[string] you'll end dataset column of type array.

val fooudf = udf { id: int => ('a' ('a'.toint + id).tochar).map(_.tostring) }  // table1 fooudf applied val table1 = spark.range(3).withcolumn("foo", fooudf('id))  scala> table1.show +---+---------+ | id|      foo| +---+---------+ |  0|      [a]| |  1|   [a, b]| |  2|[a, b, c]| +---+---------+  scala> table1.printschema root  |-- id: long (nullable = false)  |-- foo: array (nullable = true)  |    |-- element: string (containsnull = true)  scala> table1.withcolumn("fooexploded", explode($"foo")).show +---+---------+-----------+ | id|      foo|fooexploded| +---+---------+-----------+ |  0|      [a]|          a| |  1|   [a, b]|          a| |  1|   [a, b]|          b| |  2|[a, b, c]|          a| |  2|[a, b, c]|          b| |  2|[a, b, c]|          c| +---+---------+-----------+ 

with that, join should quite easy.


Comments

Popular posts from this blog

cookies - Yii2 Advanced - Share session between frontend and mainsite (duplicate of frontend for www) -

angular - password and confirm password field validation angular2 reactive forms -

php - Permission denied. Laravel linux server -