Apache Spark - Best Practices and Tuning
Search…
Introduction
RDD
Don’t collect large RDDs
Don't use count() when you don't need to return the exact number of rows
Avoiding Shuffle "Less stage, run faster"
Picking the Right Operators
TreeReduce and TreeAggregate Demystified
When to use Broadcast variable
Joining a large and a small RDD
Joining a large and a medium size RDD
Dataframe
Joining a large and a small Dataset
Joining a large and a medium size Dataset
Storage
Use the Best Data Format
Cache Judiciously and use Checkpointing
Parallelism
Use the right level of parallelism
How to estimate the size of a Dataset
How to estimate the number of partitions, executor's and driver's params (YARN Cluster Mode)
Serialization and GC
Tuning Java Garbage Collection
Serialization
References
References
Powered By
GitBook
References
[1] Controlling Parallelism in Spark
http://www.bigsynapse.com/spark-input-output
[2] Avoid GroupByKey
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
[3] Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Stra…
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
[4] How-to: Tune Your Apache Spark Jobs (Part 1)
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
[5] How-to: Tune Your Apache Spark Jobs (Part 2)
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
[6] Top 5 Mistakes to Avoid When Writing Apache Spark Applications
https://intellipaat.com/blog/top-5-mistakes-writing-apache-spark-applications/
[7] Spark best practices
https://robertovitillo.com/2015/06/30/spark-best-practices/
[8] Best practice for retrieving big data from RDD to local machine
http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
[9] Optimizing Spark Machine Learning for Small Data
http://eugenezhulenev.com/blog/2015/09/16/spark-ml-for-big-and-small-data/
[10] Tuning and Debugging in Apache Spark
http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
[11] Advantage of Broadcast Variables
http://stackoverflow.com/questions/26884871/advantage-of-broadcast-variables
[12] When to use Broadcast variable?
https://blog.knoldus.com/2016/04/30/broadcast-variables-in-spark-how-and-when-to-use-them/
[13] Implement treeReduce and treeAggregate
https://issues.apache.org/jira/browse/SPARK-2174
[14] Shufflling and repartitioning of RDD’s in apache spark
https://blog.knoldus.com/2015/06/19/shufflling-and-repartitioning-of-rdds-in-apache-spark/
[15] Resource Allocation Configuration for Spark on YARN:
https://www.mapr.com/blog/resource-allocation-configuration-spark-yarn
[16] Apache Spark: Config Cheatsheet:
http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
[17] Tuning Java Garbage Collection for Apache Spark Applications:
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
[18] How to set Apache Spark Executor memory
http://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory
[19] How to interpret RDD.treeAggregate
http://stackoverflow.com/questions/29860635/how-to-interpret-rdd-treeaggregate
[20] Apache Spark 1.1: MLlib Performance Improvements
https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html
[21] Spark group multiple rdd items by key
http://stackoverflow.com/questions/36447057/spark-group-multiple-rdd-items-by-key
[22] Spark Corner Cases
http://codingjunkie.net/spark-corner-cases/
[23] Writing efficient Spark jobs
http://fdahms.com/2015/10/04/writing-efficient-spark-jobs/
[24] Efficient Data Storage for Analytics with Parquet 2.0
Efficient Data Storage for Analytics with Parquet 2.0
[25] Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
[26] Spark best practices
Spark best practices
Serialization and GC - Previous
Serialization
Last modified
1yr ago
Copy link