Apache Spark - Best Practices and Tuning
  • Introduction
  • RDD
    • Don’t collect large RDDs
    • Don't use count() when you don't need to return the exact number of rows
    • Avoiding Shuffle "Less stage, run faster"
    • Picking the Right Operators
      • Avoid List of Iterators
      • Avoid groupByKey when performing a group of multiple items by key
      • Avoid groupByKey when performing an associative reductive operation
      • Avoid reduceByKey when the input and output value types are different
      • Avoid the flatMap-join-groupBy pattern
      • Use TreeReduce/TreeAggregate instead of Reduce/Aggregate
      • Hash-partition before transformation over pair RDD
      • Use coalesce to repartition in decrease number of partition
    • TreeReduce and TreeAggregate Demystified
    • When to use Broadcast variable
    • Joining a large and a small RDD
    • Joining a large and a medium size RDD
  • Dataframe
    • Joining a large and a small Dataset
    • Joining a large and a medium size Dataset
  • Storage
    • Use the Best Data Format
    • Cache Judiciously and use Checkpointing
  • Parallelism
    • Use the right level of parallelism
    • How to estimate the size of a Dataset
    • How to estimate the number of partitions, executor's and driver's params (YARN Cluster Mode)
  • Serialization and GC
    • Tuning Java Garbage Collection
    • Serialization
  • References
    • References
Powered by GitBook
On this page

Was this helpful?

  1. References

References

PreviousSerialization

Last updated 2 years ago

Was this helpful?

  • [1] Controlling Parallelism in Spark

  • [2] Avoid GroupByKey

  • [3] Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Stra…

  • [4] How-to: Tune Your Apache Spark Jobs (Part 1)

  • [5] How-to: Tune Your Apache Spark Jobs (Part 2)

  • [6] Top 5 Mistakes to Avoid When Writing Apache Spark Applications

  • [7] Spark best practices

  • [8] Best practice for retrieving big data from RDD to local machine

  • [9] Optimizing Spark Machine Learning for Small Data

  • [10] Tuning and Debugging in Apache Spark

  • [11] Advantage of Broadcast Variables

  • [12] When to use Broadcast variable?

  • [13] Implement treeReduce and treeAggregate

  • [14] Shufflling and repartitioning of RDD’s in apache spark

  • [15] Resource Allocation Configuration for Spark on YARN:

  • [16] Apache Spark: Config Cheatsheet:

  • [17] Tuning Java Garbage Collection for Apache Spark Applications:

  • [18] How to set Apache Spark Executor memory

  • [19] How to interpret RDD.treeAggregate

  • [20] Apache Spark 1.1: MLlib Performance Improvements

  • [21] Spark group multiple rdd items by key

  • [22] Spark Corner Cases

  • [23] Writing efficient Spark jobs

  • [24] Efficient Data Storage for Analytics with Parquet 2.0

  • [25] Understanding Query Plans and Spark UIs

  • [26] Spark best practices

http://www.bigsynapse.com/spark-input-output
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
https://intellipaat.com/blog/top-5-mistakes-writing-apache-spark-applications/
https://robertovitillo.com/2015/06/30/spark-best-practices/
http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
http://eugenezhulenev.com/blog/2015/09/16/spark-ml-for-big-and-small-data/
http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
http://stackoverflow.com/questions/26884871/advantage-of-broadcast-variables
https://blog.knoldus.com/2016/04/30/broadcast-variables-in-spark-how-and-when-to-use-them/
https://issues.apache.org/jira/browse/SPARK-2174
https://blog.knoldus.com/2015/06/19/shufflling-and-repartitioning-of-rdds-in-apache-spark/
https://www.mapr.com/blog/resource-allocation-configuration-spark-yarn
http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
http://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory
http://stackoverflow.com/questions/29860635/how-to-interpret-rdd-treeaggregate
https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html
http://stackoverflow.com/questions/36447057/spark-group-multiple-rdd-items-by-key
http://codingjunkie.net/spark-corner-cases/
http://fdahms.com/2015/10/04/writing-efficient-spark-jobs/
Efficient Data Storage for Analytics with Parquet 2.0
Understanding Query Plans and Spark UIs
Spark best practices