Apache Spark - Best Practices and Tuning
  • Introduction
  • RDD
    • Don’t collect large RDDs
    • Don't use count() when you don't need to return the exact number of rows
    • Avoiding Shuffle "Less stage, run faster"
    • Picking the Right Operators
      • Avoid List of Iterators
      • Avoid groupByKey when performing a group of multiple items by key
      • Avoid groupByKey when performing an associative reductive operation
      • Avoid reduceByKey when the input and output value types are different
      • Avoid the flatMap-join-groupBy pattern
      • Use TreeReduce/TreeAggregate instead of Reduce/Aggregate
      • Hash-partition before transformation over pair RDD
      • Use coalesce to repartition in decrease number of partition
    • TreeReduce and TreeAggregate Demystified
    • When to use Broadcast variable
    • Joining a large and a small RDD
    • Joining a large and a medium size RDD
  • Dataframe
    • Joining a large and a small Dataset
    • Joining a large and a medium size Dataset
  • Storage
    • Use the Best Data Format
    • Cache Judiciously and use Checkpointing
  • Parallelism
    • Use the right level of parallelism
    • How to estimate the size of a Dataset
    • How to estimate the number of partitions, executor's and driver's params (YARN Cluster Mode)
  • Serialization and GC
    • Tuning Java Garbage Collection
    • Serialization
  • References
    • References
Powered by GitBook
On this page

Was this helpful?

  1. RDD
  2. Picking the Right Operators

Avoid groupByKey when performing an associative reductive operation

PreviousAvoid groupByKey when performing a group of multiple items by keyNextAvoid reduceByKey when the input and output value types are different

Last updated 2 years ago

Was this helpful?

For example, rdd.groupByKey().mapValues(_.sum) will produce the same results as rdd.reduceByKey(_ + _). However, the former will transfer the entire dataset across the network, while the latter will compute local sums for each key in each partition and combine those local sums into larger sums after shuffling.

As already showed in let see example of word count you can process RDD and find the frequency of word using both the transformations groupByKeyand reduceBykey.

word count using reduceBykey:

val wordPairsRDD = rdd.map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
  .reduceByKey(_ + _)
  .collect()

See in diagram how RDD are process and shuffle over the network

As you see in above diagram all worker node first process its own partition and count words on its own machine and then shuffle for final result.

On the other hand if we use groupByKeyfor word count as follow:

val wordCountsWithGroup = rdd
  .groupByKey()
  .map(t => (t._1, t._2.sum))
  .collect()

Let see diagram how RDD are process and shuffle over the network using groupByKey

As you see above all worker node shuffle data and at final node it will be count words so using groupByKeyso lot of unnecessary data will be transfer over the network.

So avoid using groupByKeyas much as possible.

[2]