apache spark - How to unnest array with keys to join on afterwards? -


i have 2 tables, namely table1 , table2. table1 big, whereas table2 small. also, have udf function interface defined below:

--table1-- id 1 2 3  --table2-- category b c d e f g  udf: foo(id: int): list[string] 

i intend call udf firstly corresponding categories: foo(table1.id), return wrappedarray, want join every category in table2 more manipulation. expected result should this:

--view--  id,category 1,a 1,c 1,d 2,b 2,c 3,e 3,f 3,g 

i try find unnest method in hive, no luck, me out? thanks!

i believe want use explode function or dataset's flatmap operator.

explode function creates new row each element in given array or map column.

flatmap operator returns new dataset first applying function elements of dataset, , flattening results.

after execute udf foo(id: int): list[string] you'll end dataset column of type array.

val fooudf = udf { id: int => ('a' ('a'.toint + id).tochar).map(_.tostring) }  // table1 fooudf applied val table1 = spark.range(3).withcolumn("foo", fooudf('id))  scala> table1.show +---+---------+ | id|      foo| +---+---------+ |  0|      [a]| |  1|   [a, b]| |  2|[a, b, c]| +---+---------+  scala> table1.printschema root  |-- id: long (nullable = false)  |-- foo: array (nullable = true)  |    |-- element: string (containsnull = true)  scala> table1.withcolumn("fooexploded", explode($"foo")).show +---+---------+-----------+ | id|      foo|fooexploded| +---+---------+-----------+ |  0|      [a]|          a| |  1|   [a, b]|          a| |  1|   [a, b]|          b| |  2|[a, b, c]|          a| |  2|[a, b, c]|          b| |  2|[a, b, c]|          c| +---+---------+-----------+ 

with that, join should quite easy.


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -