> For the complete documentation index, see [llms.txt](https://umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/rdd/when_to_use_broadcast_variable.md).

# When to use Broadcast variable

As documentation for [Spark Broadcast variables](http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables) states, they are immutable shared variable which are cached on each worker nodes on a Spark cluster.

![](/files/-Li31ADGdV9TT_OtTF3e)

## When to use Broadcast variable?

Before running each tasks on the available executors, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD.

If you have huge array that is accessed from Spark Closures, for example some reference data, this array will be shipped to each spark node with closure.\
For example if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).\
If you use broadcast it will be distributed once per node using efficient p2p protocol.

```scala
val array: Array[Int] = ??? // some huge array
val broadcasted = sc.broadcast(array)
```

And some RDD

```scala
val rdd: RDD[Int] = ???
```

In this case array will be shipped with closure each time

```scala
rdd.map(i => array.contains(i))
```

and with broadcast you'll get huge performance benefit

```scala
rdd.map(i => broadcasted.value.contains(i))
```

## Things to remember while using Broadcast variables:

Once we broadcasted the value to the nodes, we shouldn’t make changes to its value to make sure each node have exact same copy of data. The modified value might be sent to another node later that would give unexpected results.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/rdd/when_to_use_broadcast_variable.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
