Shuffling can be a great bottleneck. Having many big HashSet's (according to your dataset) could also be a problem. However, it's more likely that you'll have a large amount of ram than network latency which results in faster reads/writes across distributed machines.
Here are more functions to prefer overgroupByKey:
can be used when you are combining elements but your return type differs from your input value type. You can see an example here
merges the values for each key using an associative function and a neutral "zero value".