Set chunk size when parallelising tasks
In this scenario:
vids = [...] # a large list of variables
result = pool.map(some_func, vids) # apply some_func to every variable in parallel
the multiprocessing.Pool
will split vids
into a set of chunks, and delegate each chunk to a worker process. The problem with this is that the processing of some variables will take much longer than the processing of other variables. This could be avoided if you reduce the chunk size.
Or perhaps figure out some way to parallelise across columns, rather than across variables, although this may not be possible to achieve in a general manner.