> For the complete documentation index, see [llms.txt](https://umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/parallelism/sparksqlshufflepartitions_draft.md).

# How to estimate the size of a Dataset

An approximated calculation for the size of a dataset is:

```
number Of Megabytes = M = (N*V*W) / 1024^2
```

where:

```
    N  =  number of records

    V  =  number of variables

    W  =  average width in bytes of a variable
```

In approximating **W**, remember:

| **Type of variable**                           | **Width**      |
| ---------------------------------------------- | -------------- |
| Integers, −127 <= x <= 100                     | 1              |
| Integers, 32,767 <= x <= 32,740                | 2              |
| Integers, -2,147,483,647 <= x <= 2,147,483,620 | 4              |
| Floats single precision                        | 4              |
| Floats double precision                        | 8              |
| Strings                                        | maximum lenght |

Say that you have a 20,000-observation dataset. That dataset contains

```
    1  string identifier of length 20                     20

    10  small integers (1 byte each)                      10

    4  standard integers (2 bytes each)                    8

    5  floating-point numbers (4 bytes each)              20

    --------------------------------------------------------

    20  variables total                                   58
```

Thus the average width of a variable is:

```
W = 58/20 = 2.9  bytes
```

The size of your dataset is:

```
M = 20000*20*2.9/1024^2 = 1.13 megabytes
```

This result slightly understates the size of the dataset because we have not included any variable labels, value labels, or notes that you might add to the data. That does not amount to much. For instance, imagine that you added variable labels to all 20 variables and that the average length of the text of the labels was 22 characters.

That would amount to a total of 20\*22=440 bytes or 440/10242=.00042 megabytes.

### **Explanation of formula**

```
M = 20000*20*2.9/1024^2 = 1.13 megabytes
```

N\*V\*W is, of course, the total size of the data. The 1,0242 in the denominator rescales the results to megabytes.

Yes, the result is divided by 1,0242 even though 1,0002 = a million. Computer memory comes in binary increments. Although we think of k as standing for kilo, in the computer business, k is really a “binary” thousand, 210 = 1,024. A megabyte is a binary million—a binary k squared:

```
1 MB = 1024 KB = 1024*1024 = 1,048,576 bytes
```

With cheap memory, we sometimes talk about a gigabyte. Here is how a binary gig works:

```
1 GB = 1024 MB = 10243 = 1,073,741,824 bytes
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/parallelism/sparksqlshufflepartitions_draft.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
