Apache Spark - Best Practices and Tuning
  • Introduction
  • RDD
    • Don’t collect large RDDs
    • Don't use count() when you don't need to return the exact number of rows
    • Avoiding Shuffle "Less stage, run faster"
    • Picking the Right Operators
      • Avoid List of Iterators
      • Avoid groupByKey when performing a group of multiple items by key
      • Avoid groupByKey when performing an associative reductive operation
      • Avoid reduceByKey when the input and output value types are different
      • Avoid the flatMap-join-groupBy pattern
      • Use TreeReduce/TreeAggregate instead of Reduce/Aggregate
      • Hash-partition before transformation over pair RDD
      • Use coalesce to repartition in decrease number of partition
    • TreeReduce and TreeAggregate Demystified
    • When to use Broadcast variable
    • Joining a large and a small RDD
    • Joining a large and a medium size RDD
  • Dataframe
    • Joining a large and a small Dataset
    • Joining a large and a medium size Dataset
  • Storage
    • Use the Best Data Format
    • Cache Judiciously and use Checkpointing
  • Parallelism
    • Use the right level of parallelism
    • How to estimate the size of a Dataset
    • How to estimate the number of partitions, executor's and driver's params (YARN Cluster Mode)
  • Serialization and GC
    • Tuning Java Garbage Collection
    • Serialization
  • References
    • References
Powered by GitBook
On this page

Was this helpful?

  1. Dataframe

Joining a large and a small Dataset

A technique to improve the performance is analyzing the DataFrame size to get the best join strategy.

If the smaller DataFrame is small enough to fit into the memory of each worker, we can turn ShuffleHashJoin or SortMergeJoin into a BroadcastHashJoin. In broadcast join, the smaller DataFrame will be broadcasted to all worker nodes. Using the BROADCAST hint guides Spark to broadcast the smaller DataFrame when joining them with the bigger one:

largeDf.join(smallDf.hint("broadcast"), Seq("id"))

This way, the larger DataFrame does not need to be shuffled at all.

Recently Spark has increased the maximum size for the broadcast table from 2GB to 8GB. Thus, it is not possible to broadcast tables which are greater than 8GB.

PreviousJoining a large and a medium size RDDNextJoining a large and a medium size Dataset

Last updated 2 years ago

Was this helpful?