Avoid groupByKey when performing a group of multiple items by key

As already showed in [21] let's suppose we've got a RDD items like:

(3922774869,10,1)
(3922774869,11,1)
(3922774869,12,2)
(3922774869,13,2)
(1779744180,10,1)
(1779744180,11,1)
(3922774869,14,3)
(3922774869,15,2)
(1779744180,16,1)
(3922774869,12,1)
(3922774869,13,1)
(1779744180,14,1)
(1779744180,15,1)
(1779744180,16,1)
(3922774869,14,2)
(3922774869,15,1)
(1779744180,16,1)
(1779744180,17,1)
(3922774869,16,4)
...

which represent (id, age, count) and we want to group those lines to generate a dataset for which each line represent the distribution of age of each id like this ((id, age) is unique):

which is (id, (age, count), (age, count) ...)

The easiest way is first reduce by both fields and then use groupBy:

Which returns an RDD[(Long, Iterable[(Int, Int)])], for the input above it would contain these two records:

But if you have a very large dataset, in order to reduce shuffling, you should not to use groupByKey.

Instead you can use aggregateByKey:

This will result in:

And you will be able to print the values as:

Shuffling can be a great bottleneck. Having many big HashSet's (according to your dataset) could also be a problem. However, it's more likely that you'll have a large amount of ram than network latency which results in faster reads/writes across distributed machines.

Here are more functions to prefer overgroupByKey:

  • combineByKey can be used when you are combining elements but your return type differs from your input value type. You can see an example here

  • foldByKey merges the values for each key using an associative function and a neutral "zero value".

Last updated

Was this helpful?