To ensure a balanced distribution, select a property that is suitable for partitioning. It should be a property with not too many and not too few distinct values. For example, if the table contains customer information, country information is a suitable property for partitioning because it contains enough shared distinct values, but email addresses are not because they typically have as many distinct values as customer entries.
Another consideration when choosing the partitioning key is the number of nodes assigned in the data flow service and the number of threads defined in the batch data flow run that will use the data set as the source. In a batch run, one thread can process one partition at a time. When a partition is fully processed, the next available partition is picked up. One thread might finish processing a partition faster than the other threads. Therefore, to ensure that threads do not remain idle, the optimal number of partitions should be the number of nodes in a given data flow service instance multiplied by the number of threads defined in the batch data flow run, and multiplied by two:
number of partitions = number of nodes × number of threads × 2
In the following example, one node is assigned to the batch data flow service:
Five threads are defined in the data flow run:
The optimal number of partitions in this configuration is 1 × 5 × 2 = 10.