I want to measure Spark's performance of dataframe aggregation. Count or Collect action? -
i want create performance results statistics of dataframe on spark. calling count() action after groupby , measuring time took.
df.groupby('student').sum('marks').count()
however, found if use collect() instead of count(), results took 10 time more time. why?
and, method should use count() or collect() if performing benchmarking tests above.
thanks.
spark dataframes use query optimizer (called catalyst) speed spark pipelines. in case, there 2 possibilities what's happening:
collect more expensive count. involves taking of data distributed across cluster, serializing it, sending across network driver, , deserializing it. count involves computing number once per task , sending (much smaller).
catalyst counting number of unique "student" values. result of "sum" never used, never needs computed!
Comments
Post a Comment