Avoid reduceByKey when the input and output value types are different
For example, consider writing a transformation that finds all the unique strings corresponding to each key. One way would be to use map to transform each element into a Set and then combine the Sets with reduceByKey
:
This code results in tons of unnecessary object creation because a new Set must be allocated for each record. It’s better to use aggregateByKey
, which performs the map-side aggregation more efficiently:
PreviousAvoid groupByKey when performing an associative reductive operationNextAvoid the flatMap-join-groupBy pattern
Last updated