Copyright © 2004 Gene Michael Stover. All rights reserved. Permission to copy, store, & view this document unmodified & in its entirety is granted.
As of 2004-Nov-08, this is still under construction. I hope to complete it within a couple of weeks.
I had an idea for an algorithm to distribute a computation job among lots of computers.
I used to be interested in distributed systems, but I haven't kept up-to-date with it since 1992, so I have little doubt that this algorithm has long since been written up. On the other hand, if the state of the literature now is what it was in 1992, it has been documented in the extremely abstract, not for programmers.
So here is my algorithm, presented by a programmer for programmers.
The slaves have a really simple algorithm. Here is the pseudocode:
You have a Big Job to compute. It could be almost anything.
You divide your Big Job into a lot of smaller Jobs.
My algorithm provides a priority queue into which you place your Jobs.
My algorithm has a loop that doles the Jobs to the Slaves.
RPC functions, but few assumptions about how they are implemented.
The coordinator's loop of my algorithm assumes that the queue has been primed with all those small Jobs. Here is the loop in pseudo-Lisp.
(defun idpa-loop (q done?)
(loop until (done?)
do (let ((request (rpc-recv)))
;; Handle the request. Doing so might
;; alter the queue or change other things
;; so that done? returns true.
(handle request)
;; Send the next job to the slave.
(let ((job (queue-get q)))
(incf (job-outstanding job))
(rpc-reply job request)))))
A Slave initiates communication with the Coordinator. That communication is effectively a Remote Procedure Call (RPC) in which the Coordinator acts as a server & the Slave acts as a client. The Slave always reports the result of the last job it did, with some kind of nil result value allowed the first time the Slave contacts the Coordinator.
The Coordinator replies to the remote procedure call with the next Job from the queue, except that we allow a reply that means ``There are no more Jobs right now. Ask me again in T seconds.'' Let's call this special case the Nil Job.
The important stuff is in the priority queue's weight function.
Let's define the relevant parts of a job. A Job has
The priority queue must always provide the next Job that we should hand to a Slave. Features of the priority queue's ordering of Jobs include:
These goals turn out to be easily accomplished.
I assume that the priority queue's implementation makes us of some kind of an isLess function to determine the relative order of the Jobs in the queue. Here is pseudocode for an isLess function.
Here is a possible implementation of that pseudocode in pseudo-Lisp.
(defun is-less (x y)
"Return true if & only if Job X should appear
in the priority queue before Job Y."
(or (prerequisit-of x y)
(and (not (prerequisit-of y x))
(< (job-outstanding x) (job-outstanding y)))))
Let's see how the algorithm works in theory with a small number of jobs & slaves.
Assume we have a Big Job to compute, & we have already converted it into smaller Jobs. For this example, assume there are only three such Jobs: J0, J1, & J2. None of the three Jobs are prerequisites for the others, so they may be executed in any order.
We create a priority queue that uses the appropriate queue-ordering function, & we stuff the three jobs into it. Then we start the loop
Let's assume we have three Slaves: S0, S1, & S3.
The first Slave, S0, requests a job, & the main loop hands it J0. Then S1 requests a job, & the main loop hands it J1. Then S2 requests a job, & the main loop hands it J2.
When S0 finishes cits work, it sends the reply to the Coordinator & requests
Gene Michael Stover 2008-04-20