You can tune the number of partitions, asking those questions. The above example is a best-case scenario where all the tasks are balanced, and there isn’t skew in data size. In our specific use case, where we are dealing with billions of rows, we have found that partitions in the range of 10k work most efficiently. Suppose we have to apply an aggregate function like a groupBy or a Window function on a dataset containing a time-series of billions of rows, and the column id is a unique identifier, and the column timestamp is the time. In order to well-balance the data, you can repartition properly the DataFrame/Dataset as follows: