Skip to main content

Poor Man's Parallelism

·462 words·3 mins

I really like orchestration tools such as Ansible or SaltStack. They can make running tasks on a group of machines a breeze. But sometimes you can’t or don’t want to install these tools on a machine. In cases like these, it is helpful to know how to parallelize some tasks in the shell.

You can do this via Unix/shell job control:

cmd="systemctl enable --now docker.service"
hosts=(host{1..4})

for host in ${hosts[@]}
do
	ssh & $host $cmd
done

However from experience, this can be very error prone. For example, The placement of the & is important so as to background the ssh command and not the command on the remote machine. Additionally, what if you had a lot of hosts and you didn’t want to run all of them at once. Instead, you want to utilize a bounded pool of processes.

There are a few ways of doing this: most ways are messy or or fairly non-portable. On systems with the util-linux installed you might use flock or lockfile, but then you have essentially to implement semaphores using mutex locks and shell arithmetic. If you don’t have util-linux you can accomplish the same thing taking advantage of the atomicity of mkdir on most (but not all) file systems:

However rather than doing this, take a look at the fairly pervasive xargs command. According the manpage, xargs “builds and executes command lines from standard input.” It has an option -P <NUM_PROCS> that takes determines how many commands to run in parallel. With this, it is just a matter of formatting commands in a way that xargs understands.

cmd="systemctl enable --now docker.service"
hosts=(host{1..4})
numprocs=8
echo ${hosts[@]} | xargs -P $numprocs -d" " -I{} -n 1 ssh {} $cmd

Admittedly this looks a bit cryptic. It helps to know that -d is setting the delimiter from the default newline to space, -n <NUM_ARGS> sets the number of arguments to pass to each command, and -I <REPLACE_STR> is setting the replacement string for xargs so that ssh {} $cmd becomes ssh host1 $cmd for the first command and so on. The xargs command also accepts an input file option (-a <file>) where we could put each host on a newline to simplify the call.

Now we can easily create process pools in a mostly portable fashion in shell scripts. There are lots of useful things you could do with this, but here are two recipes that I came up with:

#copy a file to many nodes
function pcopy(){
filename=$1
dest=$2
shift 2
echo $* | xargs -d" " -P 8 -I{} -n 1 scp  $filename {}:$dest
}
pcopy somefile.txt host{1..8}

#retrieve files from many nodes
function pget(){
filename=$1
shift
echo $* | xargs -d" " -P 8 -I{} -n 1 scp {}:$filename $(basename $filename).{}
}

Happy shell scripting!

Author
Robert Underwood
Robert is an Assistant Computer Scientist in the Mathematics and Computer Science Division at Argonne National Laboratory focusing on data and I/O for large-scale scientific applications including AI for Science using techniques of lossy compression, and data management. He currently co-leads the AuroraGPT Data Team with Ian Foster. In addition to AI, Robert’s library LibPressio, allows users to experiment and adopt advanced compressors quickly, has over 200 average unique monthly downloads, is used in over 17 institutions worldwide, and he is also a contributor to the R&D100 winning SZ family of compressors and other compression libraries. He regularly mentors students and is the early career ambassador for Argonne to the Joint Laboratory for Extreme Scale Computing.