Poor Man's Parallelism

I really like orchestration tools such as Ansible or SaltStack. They can make running tasks on a group of machines a breeze. But sometimes you can’t or don’t want to install these tools on a machine. In cases like these, it is helpful to know how to parallelize some tasks in the shell.

You can do this via Unix/shell job control:

cmd="systemctl enable --now docker.service"
hosts=(host{1..4})

for host in ${hosts[@]}
do
	ssh & $host $cmd
done

However from experience, this can be very error prone. For example, The placement of the & is important so as to background the ssh command and not the command on the remote machine. Additionally, what if you had a lot of hosts and you didn’t want to run all of them at once. Instead, you want to utilize a bounded pool of processes.

There are a few ways of doing this: most ways are messy or or fairly non-portable. On systems with the util-linux installed you might use flock or lockfile, but then you have essentially to implement semaphores using mutex locks and shell arithmetic. If you don’t have util-linux you can accomplish the same thing taking advantage of the atomicity of mkdir on most (but not all) file systems:

However rather than doing this, take a look at the fairly pervasive xargs command. According the manpage, xargs “builds and executes command lines from standard input.” It has an option -P <NUM_PROCS> that takes determines how many commands to run in parallel. With this, it is just a matter of formatting commands in a way that xargs understands.

cmd="systemctl enable --now docker.service"
hosts=(host{1..4})
numprocs=8
echo ${hosts[@]} | xargs -P $numprocs -d" " -I{} -n 1 ssh {} $cmd

Admittedly this looks a bit cryptic. It helps to know that -d is setting the delimiter from the default newline to space, -n <NUM_ARGS> sets the number of arguments to pass to each command, and -I <REPLACE_STR> is setting the replacement string for xargs so that ssh {} $cmd becomes ssh host1 $cmd for the first command and so on. The xargs command also accepts an input file option (-a <file>) where we could put each host on a newline to simplify the call.

Now we can easily create process pools in a mostly portable fashion in shell scripts. There are lots of useful things you could do with this, but here are two recipes that I came up with:

#copy a file to many nodes
function pcopy(){
filename=$1
dest=$2
shift 2
echo $* | xargs -d" " -P 8 -I{} -n 1 scp  $filename {}:$dest
}
pcopy somefile.txt host{1..8}

#retrieve files from many nodes
function pget(){
filename=$1
shift
echo $* | xargs -d" " -P 8 -I{} -n 1 scp {}:$filename $(basename $filename).{}
}

Happy shell scripting!