Apache Spark - Best Practices and Tuning
  • Introduction
  • RDD
    • Don’t collect large RDDs
    • Don't use count() when you don't need to return the exact number of rows
    • Avoiding Shuffle "Less stage, run faster"
    • Picking the Right Operators
      • Avoid List of Iterators
      • Avoid groupByKey when performing a group of multiple items by key
      • Avoid groupByKey when performing an associative reductive operation
      • Avoid reduceByKey when the input and output value types are different
      • Avoid the flatMap-join-groupBy pattern
      • Use TreeReduce/TreeAggregate instead of Reduce/Aggregate
      • Hash-partition before transformation over pair RDD
      • Use coalesce to repartition in decrease number of partition
    • TreeReduce and TreeAggregate Demystified
    • When to use Broadcast variable
    • Joining a large and a small RDD
    • Joining a large and a medium size RDD
  • Dataframe
    • Joining a large and a small Dataset
    • Joining a large and a medium size Dataset
  • Storage
    • Use the Best Data Format
    • Cache Judiciously and use Checkpointing
  • Parallelism
    • Use the right level of parallelism
    • How to estimate the size of a Dataset
    • How to estimate the number of partitions, executor's and driver's params (YARN Cluster Mode)
  • Serialization and GC
    • Tuning Java Garbage Collection
    • Serialization
  • References
    • References
Powered by GitBook
On this page

Was this helpful?

  1. Parallelism

How to estimate the size of a Dataset

An approximated calculation for the size of a dataset is:

number Of Megabytes = M = (N*V*W) / 1024^2

where:

    N  =  number of records

    V  =  number of variables

    W  =  average width in bytes of a variable

In approximating W, remember:

Type of variable

Width

Integers, −127 <= x <= 100

1

Integers, 32,767 <= x <= 32,740

2

Integers, -2,147,483,647 <= x <= 2,147,483,620

4

Floats single precision

4

Floats double precision

8

Strings

maximum lenght

Say that you have a 20,000-observation dataset. That dataset contains

    1  string identifier of length 20                     20

    10  small integers (1 byte each)                      10

    4  standard integers (2 bytes each)                    8

    5  floating-point numbers (4 bytes each)              20

    --------------------------------------------------------

    20  variables total                                   58

Thus the average width of a variable is:

W = 58/20 = 2.9  bytes

The size of your dataset is:

M = 20000*20*2.9/1024^2 = 1.13 megabytes

This result slightly understates the size of the dataset because we have not included any variable labels, value labels, or notes that you might add to the data. That does not amount to much. For instance, imagine that you added variable labels to all 20 variables and that the average length of the text of the labels was 22 characters.

That would amount to a total of 20*22=440 bytes or 440/10242=.00042 megabytes.

Explanation of formula

M = 20000*20*2.9/1024^2 = 1.13 megabytes

N*V*W is, of course, the total size of the data. The 1,0242 in the denominator rescales the results to megabytes.

Yes, the result is divided by 1,0242 even though 1,0002 = a million. Computer memory comes in binary increments. Although we think of k as standing for kilo, in the computer business, k is really a “binary” thousand, 210 = 1,024. A megabyte is a binary million—a binary k squared:

1 MB = 1024 KB = 1024*1024 = 1,048,576 bytes

With cheap memory, we sometimes talk about a gigabyte. Here is how a binary gig works:

1 GB = 1024 MB = 10243 = 1,073,741,824 bytes
PreviousUse the right level of parallelismNextHow to estimate the number of partitions, executor's and driver's params (YARN Cluster Mode)

Last updated 2 years ago

Was this helpful?