How do you parallelize a function with multiple arguments in Python? It turns out that it is not much different than for a function with one argument, but I could not find any documentation of that online.
An “embarrassingly parallel” computing task is one in which each calculation is independent of the ones that came before it. For example, squaring each number in the list [1, 2, 3, 4, 5] is embarrassingly parallel because the square of 2 does not depend on 1, 3 on 2, and so on. Instead of each calculation waiting for the previous one to complete, multiple calculations could run simultaneously if they can use different processors. Because a calculation is performed on each element of the list, for-loops (or similar structures) are usually used.
It is easy enough to find examples of how to parallelize, in R or Python, a for loop. The canonical Python example uses the joblib library:
>>> from math import sqrt >>> from joblib import Parallel, delayed >>> Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10)) # NB: If using Python 3.x, use list(range(10)). [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]This code uses 2 processors to take the square root of the square of the number [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
That example assumes your code only has 1 argument. What happens if you have multiple arguments, e.g. if you have a nested loop? For example, what if you want to calculate the product of pairwise combination of [1, 2, 3, 4, 5] and [10, 11, 12, 13, 14, 15]? I could not find a simple explanation online for this case, which is surprising; the top Google results for “joblib parallel” are this, this, this, and this (the canonical code).
Passing multiple arguments is simple enough. That code is also simple enough, requiring just a slight modification of the canonical code.
>>> from joblib import Parallel, delayed >>> def multiple(a, b): return a*b >>> Parallel(n_jobs=2)(delayed(multiple)(a=i, b=j) for i in range(1, 6) for j in range(11, 16))This code defines a function which will take two arguments and multiplies them together. The slightly confusing part is that the arguments to the multiple() function are passed outside of the call to that function, and keeping track of the loops can get confusing if there are many arguments to pass. Below is the function I ended up writing to generate sample network data, where the network is defined by 4 parameters.
>>> from joblib import Parallel, delayed >>> vertices = [100, 1000, 10000] >>> edge_probabilities = [.1, .2, .3, .4, .5, .6] >>> power_exponents = [2, 2.5, 3, 3.5, 4] >>> graph_types = ['Erdos_Renyi', 'Barabasi', 'Watts_Strogatz'] >>> Parallel(n_jobs=6)(delayed(makeGraph)(graph_type=graph, nodes=vertex, edge_probability=prob, power_exponent=exponent) for vertex in vertices for prob in edge_probabilities for exponent in power_exponents for graph in graph_types)makeGraph is a function I created. It is too long to show here, but the idea should be clear. I tell the computer to use 6 processors to calculate the makeGraph function, and that function takes on different values from lists I have already defined.
This was very helpful, thank you for sharing this!
Any idea how this actually works, I’m not familiar with the syntax of a generator expression after an object instantiation? I.e. a = object() (generator expression code) ?