Apache Spark - Best Practices and Tuning
Search…
Introduction
RDD
Don’t collect large RDDs
Don't use count() when you don't need to return the exact number of rows
Avoiding Shuffle "Less stage, run faster"
Picking the Right Operators
Avoid List of Iterators
Avoid groupByKey when performing a group of multiple items by key
Avoid groupByKey when performing an associative reductive operation
Avoid reduceByKey when the input and output value types are different
Avoid the flatMap-join-groupBy pattern
Use TreeReduce/TreeAggregate instead of Reduce/Aggregate
Hash-partition before transformation over pair RDD
Use coalesce to repartition in decrease number of partition
TreeReduce and TreeAggregate Demystified
When to use Broadcast variable
Joining a large and a small RDD
Joining a large and a medium size RDD
Dataframe
Joining a large and a small Dataset
Joining a large and a medium size Dataset
Storage
Use the Best Data Format
Cache Judiciously and use Checkpointing
Parallelism
Use the right level of parallelism
How to estimate the size of a Dataset
How to estimate the number of partitions, executor's and driver's params (YARN Cluster Mode)
Serialization and GC
Tuning Java Garbage Collection
Serialization
References
References
Powered By
GitBook
Avoid the flatMap-join-groupBy pattern
When two datasets are already grouped by key and you want to join them and keep them grouped, you can just use
cogroup
. That avoids all the overhead associated with unpacking and repacking the groups.
Previous
Avoid reduceByKey when the input and output value types are different
Next
Use TreeReduce/TreeAggregate instead of Reduce/Aggregate
Last modified
3yr ago
Copy link