which represent (id, age, count) and we want to group those lines to generate a dataset for which each line represent the distribution of age of each id like this ((id, age) is unique):
Shuffling can be a great bottleneck. Having many big HashSet's (according to your dataset) could also be a problem. However, it's more likely that you'll have a large amount of ram than network latency which results in faster reads/writes across distributed machines.
Here are more functions to prefer overgroupByKey:
combineByKey can be used when you are combining elements but your return type differs from your input value type. You can see an example here
foldByKey merges the values for each key using an associative function and a neutral "zero value".