Joining a large and a small RDD
Last updated
Was this helpful?
Last updated
Was this helpful?
If the small RDD is small enough to fit into the memory of each worker we can turn it into a broadcast variable and turn the entire operation into a so called map side join for the larger RDD . In this way the larger RDD does not need to be shuffled at all. This can easily happen if the smaller RDD is a dimension table.