Joining a large and a small RDD
If the small RDD is small enough to fit into the memory of each worker we can turn it into a broadcast variable and turn the entire operation into a so called map side join for the larger RDD [23]. In this way the larger RDD does not need to be shuffled at all. This can easily happen if the smaller RDD is a dimension table.
1
val smallLookup = sc.broadcast(smallRDD.collect.toMap)
2
largeRDD.flatMap { case(key, value) =>
3
smallLookup.value.get(key).map { otherValue =>
4
(key, (value, otherValue))
5
}
6
}
Copied!
Last modified 2yr ago
Copy link