rdd.groupByKey().mapValues(_.sum) will produce the same results as
rdd.reduceByKey(_ + _). However, the former will transfer the entire dataset across the network, while the latter will compute local sums for each key in each partition and combine those local sums into larger sums after shuffling.
As already showed in  let see example of word count you can process RDD and find the frequency of word using both the transformations
word count using
val wordPairsRDD = rdd.map(word => (word, 1))val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _).collect()
See in diagram how RDD are process and shuffle over the network As you see in above diagram all worker node first process its own partition and count words on its own machine and then shuffle for final result.
On the other hand if we use
groupByKeyfor word count as follow:
val wordCountsWithGroup = rdd.groupByKey().map(t => (t._1, t._2.sum)).collect()
Let see diagram how RDD are process and shuffle over the network using groupByKey
As you see above all worker node shuffle data and at final node it will be count words so using
groupByKeyso lot of unnecessary data will be transfer over the network.
So avoid using
groupByKeyas much as possible.