References

[1] Controlling Parallelism in Spark
- http://www.bigsynapse.com/spark-input-output
[2] Avoid GroupByKey
- https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
[3] Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Stra…
- http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
[4] How-to: Tune Your Apache Spark Jobs (Part 1)
- http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
[5] How-to: Tune Your Apache Spark Jobs (Part 2)
- http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
[6] Top 5 Mistakes to Avoid When Writing Apache Spark Applications
- https://intellipaat.com/blog/top-5-mistakes-writing-apache-spark-applications/
[7] Spark best practices
- https://robertovitillo.com/2015/06/30/spark-best-practices/
[8] Best practice for retrieving big data from RDD to local machine
- http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
[9] Optimizing Spark Machine Learning for Small Data
- http://eugenezhulenev.com/blog/2015/09/16/spark-ml-for-big-and-small-data/
[10] Tuning and Debugging in Apache Spark
- http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
[11] Advantage of Broadcast Variables
- http://stackoverflow.com/questions/26884871/advantage-of-broadcast-variables
[12] When to use Broadcast variable?
- https://blog.knoldus.com/2016/04/30/broadcast-variables-in-spark-how-and-when-to-use-them/
[13] Implement treeReduce and treeAggregate
- https://issues.apache.org/jira/browse/SPARK-2174
[14] Shufflling and repartitioning of RDD’s in apache spark
- https://blog.knoldus.com/2015/06/19/shufflling-and-repartitioning-of-rdds-in-apache-spark/
[15] Resource Allocation Configuration for Spark on YARN:
- https://www.mapr.com/blog/resource-allocation-configuration-spark-yarn
[16] Apache Spark: Config Cheatsheet:
- http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
[17] Tuning Java Garbage Collection for Apache Spark Applications:
- https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
[18] How to set Apache Spark Executor memory
- http://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory
[19] How to interpret RDD.treeAggregate
- http://stackoverflow.com/questions/29860635/how-to-interpret-rdd-treeaggregate
[20] Apache Spark 1.1: MLlib Performance Improvements
- https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html
[21] Spark group multiple rdd items by key
- http://stackoverflow.com/questions/36447057/spark-group-multiple-rdd-items-by-key
[22] Spark Corner Cases
- http://codingjunkie.net/spark-corner-cases/
[23] Writing efficient Spark jobs
- http://fdahms.com/2015/10/04/writing-efficient-spark-jobs/
[24] Efficient Data Storage for Analytics with Parquet 2.0
- Efficient Data Storage for Analytics with Parquet 2.0
[25] Understanding Query Plans and Spark UIs
- Understanding Query Plans and Spark UIs
[26] Spark best practices
- Spark best practices

PreviousSerialization

Last updated 2 years ago

Was this helpful?