Latest News:
  New DANCE compiler 2.02
 
       
   
       
 
United States Patent 5,867,649
Larson February 2, 1999

Dance/multitude concurrent computation

Abstract

This invention computes by constructing a lattice of states. Every lattice of states corresponding to correct execution satisfies the temporal logic formula comprising a Definitive Axiomatic Notation for Concurrent Execution (DANCE) program. This invention integrates into a state-lattice computational model: a polymorphic, strong type system; visibility-limiting domains; first-order assertions; and logic for proving a program correctness. This invention includes special hardware means for elimination of cache coherency among other things, necessary to construct state-lattices concurrently. The model of computation for the DANCE language consisting of four, interrelated, logical systems describing state-lattices, types, domains, and assertions. A fifth logical system interrelates the other four systems allowing proofs of program correctness. The method of the present invention teaches programs as temporal formulas satisfied by lattices of states corresponding to correct execution. State-lattices that are short and bushy allow application of many processors simultaneously thus reducing execution time.


Inventors: Larson; Brian Ralph (Mpls., MN)
Assignee: Multitude Corporation (Bayport, MN)
Appl. No.: 589933
Filed: January 23, 1996

U.S. Class: 395/200.31
Intern'l Class: G06F 015/16
Field of Search: 364/DIG. 1 MS File,DIG. 2 MS File 395/377,376,379,382,385,390,391,561,800,705,706,712,200.31,800.28,800.3,800.36 340/825.8


References Cited [Referenced By]

U.S. Patent Documents
4814978 Mar., 1989Dennis395/377.
4833468 May., 1989Larson et al.340/825.
5029080 Jul., 1991Otsuki395/377.


Other References

Reference Manual for the ADA Programming Language, US Dept. of Defense, Jul. 1980.
Communicating Sequential Processes, C.A.R. Hoare, Comm. of the ACM, Aug. 1978, vol. 21, No. 8.
An Optimum Algorithm for Mutual Exclusion in Computer Networks, Glen Ricant et al. Comm. of the ACM, Jan. 1981, vol. 21, No. 1.
An Algorithm for Mutual Exclusion in Decentralized Systems, M. Maekawa, CAM Transactions on Computer Systems, vol. 3, No. 2, May 1985.
The Information Structure of Distributed Mutual Exclusion Algorithms, Sanders, ACM Transactions on Computer Systems, vol. 5, No. 3, Aug. 1987.
A Tree-Based Algorithm for Distributed Mutal Exclusion, K. Raymond, ACM Transactions on Computer Systems, vol. 2, #1, Feb. 1989.
Algorithms for Scalable Synchronization on Shaved Memory Multiprocessors, J. Mollor-Crumley, ACM Transactions on Computer Systems, vol. 9, No. 1, Feb. 1991.
Verification of Sequential and Concurrent Programs, Apt. & Olderog; Springer-Verlag, 1991.
The Science of Programming, D. Gries Springer-Verlag, 1981.
Computer Architecture R. Karin Prentice-Hall (1989) (CPT. 5).
The Design and Analysis of Parallel Algorithms, S. Akl, Prentice-Hall, 1989.
Executing Temporal Logic Programs, B.C. Moszkowski, Cambridge University Press, Copyright 1986, pp. 1-125.

Primary Examiner: Harrell; Robert B.
Attorney, Agent or Firm: Haugen and Nikolai, P.A.

Claims



1. A method of operating a plurality of computing means comprising the steps of:

(a) transforming a temporal logic formula into machine executable form;

(b) loading the machine executable form for the temporal logic formula and data to be manipulated into said plurality of computing means; and

(c) constructing a lattice to satisfy said temporal logic formula, the lattice having a plurality of nodes, each node representing a state value that is a value of a program variable at a given moment, the plurality of nodes being interconnected by directed arcs, each directed arc representing a change in state affected by a computation associated with that arc.

2. The method of claim 1 wherein the state values are elements of strong types, including either constructed types or types defined using existential and bounded universal quantification.

3. The method of claim 1 wherein visibility of said program variables in a particular state is restricted to domains.

4. The method of claim 1 in which the state values may occupy computing means located in disparate geographical locations utilizing remote procedure call semantics.

5. The method of claim 1 and further including the step of proving correctness of the lattice construction automatically.

6. The method of claim 1 and further including the step of proving correctness of the lattice construction manually.

7. A computing system comprising:

(a) a plurality of computing means operatively coupled by an interconnection network means for communicating with one another, the network means including means for performing global memory access routing, request combining and response decombining;

(b) each of the computing means having an instruction set including combinable operations and write back capability;

(c) the computing means further having a virtual memory address translation mechanism for enforcing access constraints for segments or pages of memory, said constraints being selected from the group consisting of:

(i) memory shared, but no caching allowed,

(ii) memory protected and temporary exclusive caching with processor indicated write back,

(iii) discretionary memory access controls with exclusive demanded authorization user access rights,

(iv) mandatory memory access controls in which a user may read values at an assigned security level or lower, write values at said assigned security level or higher, and only invoke combinable operations on values at the assigned security level.
Description



BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to a method for the execution of plural computer programs being executed by a multiplicity of processors in a parallel configuration, and more specifically to performing multiple state transitions simultaneously.

2. Discussion of the Prior Art

New Parallel Language Needed

I surveyed ways to program parallel machines. Despite contentions to the contrary, I do not consider SIMD (single-instruction, multiple-datastream) machines like the Connection Machine (CM-2) to be parallel processors. SIMD machines are restricted to performing the same operation in many cells and thus can perform only a narrow range of applications. Most applications require much data-dependent branching that degrades performance on SIMD machines, such as the Cray YMP-8 and CM-2 alike. FORTRAN, with a vectorizing compiler, is appropriate for SIMD machines. Every attempt to wring efficient, massive parallelism from conventional FORTRAN programs has failed. Only programs expressed with a language embodying an inherently parallel model of computation can execute efficiently in parallel.

The challenge to those wishing to exact optimum performance from highly parallel processors is to coordinate activities in processing nodes that are doing different things. Modifications to special-purpose languages like LISP and PROLOG to incorporate futures and guards, respectively, are problematic, addressing restricted application domains of largely academic interest.

Then there are the many extensions of sequential languages. Such extensions fall into two classes: "processes" and "doall". Ada is perhaps the best example of a process-based language. Each Ada task (i.e. process) represents a single thread of control that cooperates with other tasks through message passing. Communicating Sequential Processes described in "Communicating Sequential Processes," C. A. R. Hoare, Communications of the ACM, Vol. 21, No. 8, August 1978 forms the underlying model of computation for message-passing, process-based languages. However, deciding whether an arbitrary collection of processes is deadlock-free is undecidable and proving that a particular collection of processes is deadlock-free is onerous. Furthermore, such pairwise communication and synchronization is unnecessarily restrictive thus limiting the degree of parallelism. Finally, a programmer writes (essentially) a plurality of sequential programs, rather than a single, parallel program. The "doall" is just a FORTRAN do-loop in which all the iterations are instead performed concurrently; again a collection of sequential programs.

Current, Conventional, Concurrent Computing

Traditional, sequential computing causes a sequence of states to be created so that the last state holds the data desired. FIG. 1 illustrates sequential computing starting at an initial state (101), proceeding through intermediate states (103), and terminating in a final state (102). The values of program variables at a particular instant of time comprise a state, depicted as circles. In state diagrams, time always flows down the page, that is, the state at the top of the figure occurs first followed by the state(s) connected by a line. Broken lines indicate repetition or "more of the same."

In "parallel" execution several sequences of state transitions occur concurrently. FIG. 2 depicts parallel execution in doall style. The state diagram shows state transitions occurring as "parallel" sequences. The conception of parallel processing as a collection of sequential executions characterizes the prior art that this invention supersedes.

Two mechanisms allow communication between sequential processes: message-passing or shared-memory. Load balancing and problem partitioning have produced many academic papers for either message-passing or shared-memory paradigms--nearly all yielding poor performance.

Mutual exclusion (MutEx) via message passing (i.e. for distributed systems, no shared state) is horribly inefficient. If the application needs MutEx often, performance will suffer. "An Algorithm for Mutual Exclusion in Computer Networks," Glenn Ricart & Ashok K. Agrawala, Communications of the ACM, Vol. 24, No. 1, January 1981, for a discussion of prior art O(N) request/reply algorithm, which really needs O(N.sup.2) messages when everyone wants the lock. "A.sqroot.N Algorithm for Mutual Exclusion in Decentralized Systems," Mamoru Maekawa, ACM Transactions on Computer Systems, Vol. 3, No. 2, May 1985, reduces messages to O(.check mark.N) with an elegant multidimensional geometry. Sanders' generalized MutEx algorithm in more convoluted terms in her article, "The Information Structure of Distributed Mutual Exclusion Algorithms," Beverly A. Sanders, ACM Transactions on Computer Systems, Vol. 5, No. 1, August 1987. She considers deadlock freedom (absence of possibility that computation ceases because every process is waiting to acquire a lock held by a different process), but the basic model which she proposes needs further augmentation to detect and recover from deadlock. Of course, only a wordy explanation substitutes for correctness proof. Therefore the desired interference freedom provided by MutEx is not achieved. Raymond in "A Tree-Based Algorithm for Distributed Mutual Exclusion," ACM Transactions on Computer Systems, Vol. 7, No. 1, February 1989, further reduced the number of messages to O(logN) by increasing the latency, which could easily become a significant burden if processes need the lock briefly.

Controlling access to shared variables is so difficult that designers have resorted to the message-passing model in which all state variables are local to some process and processes interact exclusively, via exchange of messages. Inevitably, two different kinds of system-wide operations barrier synchronization and remote state acquisition are required by the algorithm or application. They are notoriously inefficient. Barrier synchronization ensures that all the parallel computations in one phase are completed before the next phase is begun. Remote state acquisition occurs when the application's data cannot be partitioned disjointly so that all processes use values exclusively from its partition held in local memory. For example, advanced magnetic resonance imaging (MRI) research seeks to analyze brain scans and call to the attention of the radiologist any anomalies. Three-dimensional fast Fourier transform (FFT) is necessary for filtering and extraction of interesting features. The data consists of a 1 k by 1 k by 1 k array of floating point intensities. The first phase of the algorithm maps nicely by putting a 1 k by 1 k plane in each of 1 k processors. However integrating the third dimension finds the data in the worst possible distribution--every computation needs values that are possessed by each of the 1 k processors|

Barriers are an artifact of the popular yet problematic process paradigm. A static collection of sequential processes alternately work independently and wait for laggards. FIG. 3 depicts four processes (201) synchronized with barriers (204). Each process is actively computing (202) and then wait idly (203) until all other processes reach the barrier. The best known algorithms for barrier-synchronization take O(logP) time (with a large constant) for both message-passing and shared-memory models. This corresponds with the depth of a tree using either messages or flags.

A large body of academic work has been produced for both message-passing and parallel, random-access machine (PRAM) models, i.e. shared memory. Barrier synchronization in a message-passing model requires O(logP) time where P is number of processors, with a big constant proportional to message transmission latency. See "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors," John M. Mellor-Crummey and Michael L. Scott, ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.

Consider an ideally parallelizable application that requires synchronization. A single processor would require N steps (time, T.sub.8) to perform the application with L iterations of a single loop each requiring N/L steps. When parallelized, barrier synchronization is required at the end of each iteration taking logP steps. Real systems take many steps for each level of the synchronization tree. So the parallel time for each iteration, T.sub.i is N/(L*P)+logP and the whole application T.sub.p, is T.sub.i *L or N/P+L*logP steps. Speedup, S, is the ratio of sequential time to parallel time, T.sub.s /T.sub.p. Efficiency, E, is the speedup obtained divided by the number of processors used, S/P. In this case, E=T.sub.s /T.sub.p *P=N/(N+P*L*logP).

    ______________________________________
    Total work to be done
                    T.sub.s = N steps
    (sequential)
    Number of loop iterations
                    L
    Sequential time per iteration
                    N/L
    Parallel time per iteration
                    T.sub.j = N/(L*P) + logP
    Total parallel time
                    T.sub.p = T.sub.i *L = N/P + L*logP
    Speedup         S = T.sub.s /T.sub.p = N*P/(N + P*L*logP)
                    S = P/(1 + (L/N)*P*logP)
    Efficiency      E = S/P = N/(N + P*L*logP)
                    E = 1/(1 + (L/N)*P*logP) - O(1/logP)
    ______________________________________


When the number of processors is small, N>>P*L*logP, the efficiency can be quite good, close to 1. But as more processors are used (N.apprxeq.P) the barrier synchronization overhead begins to dominate even when only a single barrier is needed.

Therefore, an application needing barrier synchronization on any conceivable message-passing architecture (N.apprxeq.P), the minimum time of execution will be O(logP), the best speedup possible is O(P/logP), the efficiency can be no better than O(1/logP). Little wonder that efficiencies of even 10% are hard to achieve on parallel processors like CM-2, CM-5, NCube, Paragon, and T3D. They can perform no better than their theoretical model.

    ______________________________________
    Order parallel time  T.sub.p = O(logP)
    Order speedup        S = O(P/logP)
    Order efficiency     E = O(1/logP)
    ______________________________________


Parallel Random-Access Machines (PRAMs) suffer similar fate. When most people think of "shared-memory" they imagine a PRAM--a bunch of sequential processors that can read and write the same memory locations. Many PRAM variations exist. Whether the memory locations can be accessed concurrently or exclusively and what happens for concurrent accesses constitute most of the PRAM model differences. Even unrealistic operations like storing the median or geometric mean of all concurrent writes still inflicts an O(logP) penalty for barrier synchronization same as message-passing.

Sequential algorithms have been traditionally analyzed using Turing machines. Both the time (number of steps) and the space (number of tape squares) are used to measure the complexity of a computation. Additionally whether the Turing machine has a unique next action (deterministic) or a set of possible next actions (nondeterministic) substantially affects the complexity measure of a computation.

The following paragraphs summarize three chapters of a recent book on parallel algorithms entitled "Synthesis of Parallel Algorithms", John H. Reif, editor, Morgan Kaufmann, 1993, that I believe is state-of-the-art computer science theory for shared-memory computers. The following sections quote heavily those authors (Faith E. Fich, Raymond Greenlaw, and Phillip B. Gibbons) and will be indented to indicate quotation.

PRAMs attempt to model the computational behavior of a shared-memory parallel processor. The alternative is shared-nothing, message-passing parallel processors, which are popular these days. PRAMs have many variations and have a large body of published scholarship. Fich, Greenlaw, and Gibbons cover the main kinds of PRAMs, which I deem represent the conventional wisdom about how to think about parallel computation. This paradigm has, so far, not yielded efficient, general-purpose, parallel processing. Thus the motive for the present invention.

Fich's chapter begins with a concise definition of a "synchronous" PRAM.

"The PRAM is a synchronous model of parallel computation in which processors communicate via shared memory. It consists of m shared memory cells M.sub.1 . . . , M.sub.m and p processors P.sub.1, . . . , P.sub.p. Each processor is a random access machine (RAM) with a private local memory. During every step of a computation a processor may read from one shared memory cell perform a local operation, and write to one memory cell. Reads, local operations, and writes are viewed as occurring during three separate phases. This simplifies analysis and only changes the running time of an algorithm by a constant factor." (Underline mine.)

Throughout Fich's paper she assumes a synchronous PRAM: single-instruction, multiple-datastream (SIMD). This allows her to ignore synchronization costs and hassles in her analyses. The synchronous PRAM cannot model MIMD machines with their independent, data-dependent branching is and uncertain timing. Having all processors read-execute-write each step sidesteps sticky issues like what happens when a memory cell is read and written simultaneously. Furthermore the execute part of a step can perform any calculation whatsoever| So the lower bounds derived for various algorithms concentrate solely on shared-memory accesses while totally ignoring the real computation--data transformation in the execute part of the step.

Fich describes "forking" PRAM in which new processors are created with a fork operation (as in Unix). A forking PRAM can dynamically describe parallel activities at run-time. (Regular PRAMs have a fixed set of processes/processors.) But a forking PRAM still is a collection of sequential execution streams, not a single, parallel program.

"A forking PRAM is a PRAM in which a new processor is created when an existing processor executes a fork operation. In addition to creating the new processor, this operation specifies the task that the new processor is to perform (starting at the next time step)."

After defining a forking PRAM Fich never mentions it again. But she does make claims for PRAMs about "programmer's view," "ignoring lower level architectural details such as . . . synchronization," and "performance of algorithms on the PRAM can be a good predictor of their relative performance on real machines," that I completely disagree with. Since PRAMs exemplify the process paradigm (collection of sequential executions) they may adequately predict performance of algorithms and architectures similarly afflicted (all machines built so far), but certainly not what's possible.

Most distinctions between PRAMs involve memory access restrictions. The most restrictive is called exclusive-read exclusive-write (EREW). Ensuring that two processors don't write to the same memory cell simultaneously is trivial with a synchronous PRAM since all operations are under central control. The other four memory access restrictions are: concurrent-read exclusive write (CREW), COMMON, ARBITRARY, and PRIORITY. The last three restrictions differ in how concurrent writes are handled. With COMMON, values written at the same time to the same cell must have the same value. With ARBITRARY, any value written in the same step may be stored in the memory cell. With PRIORITY, the value written by the highest priority process (lowest process number) is stored in the memory cell.

Obviously an EREW program would run just fine on a CREW machine; the concurrent read hardware would not be used. The five memory restrictions Fich uses nest nicely in a power hierarchy:

PRIORITY>ARBITRARY>COMMON>CREW>EREW

Fich then spends much of her chapter showing how many steps it takes one PRAM to simulate a single step of another.

Proving that EREW can simulate a step of PRIORITY using O(log p) steps and p*m memory cells takes Fich two pages of wordy prose. Essentially the simulation takes three phases:

Phase 1: Each processor that wants to write marks its dedicated cell from the p EREW cells used for each PRIORITY cell.

Phase 2: By evaluating (in parallel) a binary tree of all processors that want to write the highest priority processor is found.

Phase 3: The winning processor writes the "real" memory cell.

Since the height of the binary tree is O(log p) it takes O(log p) steps for an EREW PRAM to emulate a PRIORITY PRAM. Much of Fich's chapter is devoted to convoluted schemes by which one model simulates another without mentioning any applications that benefit from a more powerful and expensive model.

Gibbons finally provides a reasonable PRAM model in the very last chapter of this 1000 page book. None of the other authors constrain themselves to reasonable models, thus limiting the value of their work. In Gibbons' PRAM model:

"Existing MIMD machines are asynchronous, i.e., the processors are not constrained to operate in lock-step. Each processor can proceed through its program at its own speed, constrained by the progress of other processors only at explicit synchronization points. However in the (synchronous) PRAM model, all processors execute a PRAM program in lock-step with one another. Thus in order to safely execute a PRAM program on an asynchronous machine, there must be a synchronization point after each PRAM instruction. Synchronizing after each instruction is inherently inefficient since the ability of the machine to run asynchronously is not fully exploited and there is a (potentially large) overhead in performing the synchronization. Therefore our first modification to the PRAM model will be to permit the processors to run asynchronously and then charge for any needed synchronization." (underline mine)

Additionally shared memory accesses are charged more than local accesses: "To keep the model simple we will use a single parameter, d, to quantify the communication delay to memory. This parameter is intended to capture the ratio of the median time for global memory accesses to the median time for a local operation."

I invented the Layered class of multistage interconnection networks ›U.S. Pat. No. 4,833,468, LAYERED NETWORK, Larson et al., May 23, 1989! specifically to keep d small. For the Multitude architecture (interconnected by a Layered network) the ratio of time for global memory accesses to local memory accesses is about 3 to 4. This compares with a ratio of local memory access time to cache hit time of 20 to 30.

Gibbons' asynchronous PRAM model uses four types of instructions: global read, global write, local operation, and synchronization step. The cost measures are:

    ______________________________________
    Instruction            Cost
    ______________________________________
    local operation        1
    global read or write   d
    k global reads or writes
                           d + k - 1
    synchronization barrier
                           B
    ______________________________________


The parameter B=B(p), the time to synchronize all the processors, is a nondecreasing function of the number of processors, p, used by the program. In the Asynchronous PRAM model, the parameters are assumed to obey the following constraints: 2.ltoreq.d.ltoreq.B.ltoreq.p. However a reasonable assumption for modeling most machines is that B(p).di-elect cons.O(d/log p) or B(p).di-elect cons.O(d/log p/log d).

Here again is evidence of a logP performance hit for barrier synchronization.

A synchronous EREW PRAM can be directly adapted to run on an asynchronous PRAM by inserting two synchronization barriers, one after the read phase and write phase of the synchronous PRAM time step. Thus a single step of a synchronous PRAM can be simulated in 2B+2d+1 time on an asynchronous PRAM. Therefore a synchronous EREW PRAM algorithm running in t time using p processors will take t(2B+2d+1) time on an asynchronous PRAM--a significant penalty. Gibbons shows how to maintain the O(Bt) running time but use only p/B processors by bunching the operations of B synchronous processors into a single asynchronous processor thus getting more work done for each barrier synchronization.

Gibbons transforms an EREW algorithm for the venerable "all prefix sums" operation for an asynchronous PRAM yielding an O(B log n/log B) time algorithm which isn't too bad. Gibbons presents asynchronous versions of descaling lemmas (performing the same application with fewer processors) which tack on a term for the barrier synchronizations. Still he uses many barriers in order to obtain behavior similar to synchronous PRAMs. Only when applications are synchronized just when they need it, and those needed synchronizations are really fast, B.apprxeq.d.apprxeq.1, will scaled-up parallel processing be possible, which is the contribution of this invention and Layered networks.

Gibbons considers several algorithms that run in O(B log n/log B) time. In particular, he proves an interesting theorem about summing numbers:

Given n numbers stored one per global memory location, and the following four types of instructions: L:=G, L:=L+L, G:=L and "barrier," where L is a local cell and G is a global cell, then the sum of n numbers on an Asynchronous PRAM with this instructions set requires .OMEGA.(B log n/log B) regardless of the number of processors. This is interesting because Multitude can sum n integers in O(1) time| Gibbons concludes with "subset synchronization" in which only a subset of processors in the machine need synchronize. Since the barriers involve fewer processors, and thus are shorter, summing numbers runs marginally faster.

So the best PRAM model I could find charges PRAM algorithms for their "cheating," but still offers no insights about how to efficiently compute on an MIMD machine.

Temporal Logic

Moszkowski in his Executing Temporal Logic Programs ›Cambridge University Press, 1986! presented a temporal-logic programming language. His language, Tempura, allows description of parallel execution (unsatisfactorily) within a single program expression. The foundation for Tempura is interval temporal logic. A Tempura interval is a non-empty sequence of states. A state is determined by the values of program variables at a particular instant of time. The sequence of states that form an interval provide a discrete notion of time.

The semantics of Tempura were much different than any I had previously encountered. Instead of programs being a step-by-step recipe for how to do the computation, they were logical formulas that are "satisfied" by constructing a "model" that makes the formula "true." Unfortunately the models constructed are simply sequences of states--the process model underlying PRAMs.

The semantics of what is referred to herein as DANCE programs, although also temporal logic formula based, are satisfied by constructing models with a different and superior mathematical structure that allows efficient execution on a plurality of computational engines. It is this mathematical structure that forms the core of this invention.

BACKGROUND--MATHEMATICAL FOUNDATIONS

In this section of the specification, we explain the notation used throughout this document and some necessary elementary notions from mathematical logic. Although these notations are fairly standard, this presentation is adapted from Apt and Olderog's book ›Verification of Sequential and Concurrent Programs, Krzysztof R. Apt, and Ernst-Ruidiger Olderog, Springer-Verlag, 1991!.

Sets

It is assumed that the reader is familiar with the notion of a set, a collection of elements. Finite sets may be specified by enumerating their elements between curly brackets. For example {T, F} denotes the set consisting of the Boolean constants T (true) and F (false). When enumerating the elements of a set, I sometimes use ". . . " as a notation. For example, {1, . . . , n} denotes the set consisting of the natural numbers from 1 to n where the upper bound, n, is a natural number that is not further specified.

More generally, sets are specified by referring to some property of their elements:

{x.vertline.P}

denotes the set consisting of all elements x that satisfy the property P. The bar, .vertline., can be read "such that." For example,

{x.vertline.x is an integer and x is divisible by 2}

denotes the infinite set of all even integers.

We write a A to denote that a is an element of the set A, and b A to denote that b is not an element of A. Often it is convenient to refer to a given set when defining a new set. We write

{x.ANG. A.vertline.P}

as an abbreviation for {x.vertline.x.di-elect cons.A and P}.

Some sets have standard names .o slashed. denotes the empty set, Z denotes the set of all integers, and N.sub.0 denotes the set of all natural numbers including zero.

In a set one does not distinguish repetitions of elements. Thus {T, F} and {T, T, F} are the same set. Similarly, the order of elements is irrelevant. In general, two sets A and B are equal if-and-only-if the have the same elements; in symbols: A=B.

Let A and B be sets. Then A.OR right.B (and B.OR left.A) denotes that A is a subset of B, A.andgate.B denotes the intersection of A and B, A.orgate.B denotes the union of A and B, and A-B denotes the set difference of A and B. Symbolically,

A.OR right.B if a.di-elect cons.B for every a.di-elect cons.A,

A.andgate.B={a.vertline.a.di-elect cons.A and a.di-elect cons.B},

A.orgate.B={a.vertline.a.di-elect cons.A or a.di-elect cons.B},

A-B={a.vertline.a.di-elect cons.A and a.epsilon slash.B}.

Note that A=B if both A.OR right.B and B.OR right.A. A and B are disjoint if they have no element in common that is A.andgate.B=.o slashed..

The definitions of intersection and union can be generalized to more than two sets. Let A.sub.i be a set for every element i of some other set J. Then

.andgate..sub.i.di-elect cons.J A.sub.i ={a.vertline.a.di-elect cons.A.sub.i for all i.di-elect cons.J}

.andgate..sub.i.di-elect cons.J A.sub.i ={a.vertline.a.di-elect cons.A.sub.i for some i.di-elect cons.J}

For a finite set A, card(A) denotes the cardinality, or the number of elements of A. For a non-empty, finite set A.OR right.Z, min(A) denotes the minimum of all integers in A.

Tuples

In sets the repetition of elements and their order is irrelevant. If ordering matters, we use ordered pairs and tuples. For elements a and b, not necessarily distinct, (a,b) is an ordered pair or simply pair. Then a and b are called components of (a,b). Two pairs (a,b) and (c,d) are identical (a,b)=(c,d), if-and-only-if a=c and b=d.

More generally, let n be any natural number, n.di-elect cons.N.sub.0. Then if a1, . . . , an are any n elements then (a1, . . . , an) is an n-tuple. The element ai, where i {1, . . . , n} is called the i-th component of (a1, . . . , an). An n-tuple (a1, . . . , an) is equal to an m-tuple (b1, . . . , bm) if-and-only-if n=m and ai=bi for all i {1, . . ., n}. More generally, let n be any natural number, n.di-elect cons.N.sub.0. Then if a.sub.1, . . . , a.sub.n are any n elements, then (a.sub.1, . . . , a.sub.n) is an n-tuple. The element a.sub.i, where i.di-elect cons.{1, . . . , n} is called the i-th component of (a.sub.1, . . . , a.sub.n). An n-tuple (a.sub.1, . . . , a.sub.n) is equal to an m-tuple (b.sub.1, . . . , b.sub.m) if-and-only-if n=m and a.sub.i =b.sub.i for all i.di-elect cons.{1, . . . , n}. Note that 2-tuples are pairs. Additionally, a 0-tuple is written as (), and a 1-tuple as (a) for any element a.

The Cartesian product A.times.B of sets A and B consists of all pairs (a,b) with a.di-elect cons.A and b.di-elect cons.B. The n-fold Cartesian product A.sub.1 x . . . xA.sub.n of sets A.sub.1, . . . , A.sub.n consists of all n-tuples (a.sub.1, . . . , a.sub.n) with a.sub.i .di-elect cons.A.sub.i for i.di-elect cons.{1, . . . , n}. If all A.sub.i are the same set A, then the n-fold Cartesian product Ax . . . xA is also written A.sup.n.

Relations

A binary relation R between sets A and B is a subset of the Cartesian product A.times.B, that is, R.OR right.A.times.B. If A=B then R is called a relation on A. For example,

{(a,1),(b,2),(c,2)}

is a binary relation between {a,b,c} and {1,2}. More generally, for any natural number n an n-ary relation R between A.sub.1, . . . , A.sub.n is a subset of the n-fold Cartesian product A.sub.1 x . . . xA.sub.n, that is, R.OR right.A.sub.1 x . . . xA.sub.n. Note that 2-ary relations are the same as binary relations. Instead of 1-ary and 3-ary relations one usually uses unary and ternary instead.

Consider a relation R on a set A. R is called reflexive if (a,a).di-elect cons.R for all a.di-elect cons.A; it is called irreflexive if (a,a).epsilon slash.R for all a.di-elect cons.A. R is called symmetric if for all a,b.di-elect cons.A whenever (a,b).di-elect cons.R then also (b,a).di-elect cons.R; it is called antisymmetric if for all a,b.di-elect cons.A whenever (a,b).di-elect cons.R and (b,a).di-elect cons.R, then a=b. R is called transitive if for all a,b,c.ANG.A whenever (a,b).di-elect cons.R and (b,c).di-elect cons.R, then also (a,c).di-elect cons.R.

The transitive, reflexive closure R* of a relation R on a set A is the smallest transitive and reflexive relation on A that contains R as a subset. The transitive, irreflexive closure R.sup.+ of a relation R on a set A is the smallest transitive and irreflexive relation on A that contains R as a subset. The transitive, irreflexive closure is necessary because the graphs considered later do not have self-loops. The "plus" superscript may not be standard.

The relational composition R.sub.1 .smallcircle.R.sub.2 of relations R.sub.1 and R.sub.2 on a set A is defined as follows:

R.sub.1 .smallcircle.R.sub.2 ={(a,c).vertline.there exists b.di-elect cons.A with (a,b).di-elect cons.R.sub.1 and (b,c).di-elect cons.R.sub.2 }.

For any natural number n the n-fold relational composition R.sup.n of of a relation R on a set A is defined inductively as follows:

R.sup.0 ={(a,a).vertline.a.di-elect cons.A},

R.sup.n+1 R=R.sub.n .smallcircle.R.

Note that

R*=.orgate..sub.n.di-elect cons.No R.sup.n

and that

R*=.orgate..sub.n.di-elect cons.N0-{0} R.sup.n =R*-R.sup.0

where R.sup.1 =R.

Membership of pairs in a binary relation R is mostly written in infix notation, so instead of (a,b).di-elect cons.R one usually writes aRb.

Any binary relation R.OR right.A.times.B has an inverse R.sup.-1 .OR right.B.times.A defined as:

bR.sup.-1 a iff aRb.

Functions

Let A and B be sets. A function or mapping from A to B is a binary relation .function. between A and B with the following special property: for each element a.di-elect cons.A there is exactly one element b.di-elect cons.B with a.function.b. Mostly we use prefix notation for function application and write .function.(a)=b instead of a.function.b. For some functions, however, we use postfix notation and write a.function.=b. In both cases b is called the value of .function. applied to the argument a. To indicate that .function. is a function from A to B we write

.function.:A.fwdarw.B.

The set A is called the domain of .function. and the set B is the co-domain of .function..

Consider a function .function.:A.fwdarw.B and some set X.OR right.A. Then the restriction of .function. to X is denoted by .function.›X! and defined as the Intersection of .function. (which is a subset of A.times.B) with X.times.B:

.function.›X!=.function..andgate.(X.times.B).

We are sometimes interested in functions with special properties. A function .function.:A.fwdarw.B is called one-to-one or injective if .function.(a.sub.1)=.function.(a.sub.2) for any two distinct elements a.sub.1, a.sub.2 .di-elect cons.A; it is called onto or subjective if for every element b.di-elect cons.B there exits an element a.di-elect cons.A with .function.(a)=b; it is called bijective if it is both injective and surjective.

Consider a function whose domain is a Cartesian product, say .function.:A.sub.1 x . . . xA.sub.n .fwdarw.B. Then it is customary to drop one pair of parentheses when applying .function. to an element (a.sub.1, . . . , a.sub.n).di-elect cons.A.sub.1 x . . . xA.sub.n. We also say .function. if an n-ary function.

Consider a function whose domain and co-domain coincide, say .function.:A.fwdarw.A. An element a.di-elect cons.A is called a fixed point of .function. if .function.(a)=a.

Sequences

In the following let A be a set. A sequence of elements from A of length n.gtoreq.0 is a function .function.:{1, . . . , n}.fwdarw.A. We write a sequence .function. by listing the values of .function. without punctuation in the order of ascending arguments, that is, as

a.sub.1 . . . a.sub.n

where a.sub.1 =.function.(1), . . . , a.sub.n =.function.(n). Then a.sub.1, i.di-elect cons.{1, . . . , n} is referred to as the i-th element in the sequence a.sub.1 . . . a.sub.n. A finite sequence is a sequence of any length n.gtoreq.0. A sequence of length 0 is called the empty sequence and is denoted by .epsilon..

We also allow countably infinite sequences. An infinite sequence of elements from A is a function .xi.: N.sub.0 .fwdarw.A. To exhibit the general form of an infinite sequence .xi. we typically write

.xi.:a.sub.0 a.sub.1 a.sub.2 . . .

if a.sub.i =.xi.(i) for all i.di-elect cons.N.sub.0. Then i is called the index of the element a.sub.i.

Consider now relations R.sub.1, R.sub.2, . . . on a set A. For any finite sequence a.sub.0 . . . a.sub.n of elements from A with

a.sub.0 R.sub.1 a.sub.i, a.sub.1 R.sub.2 a.sub.2, . . . , a.sub.n-1 R.sub.n a.sub.n

we write a finite chain

a.sub.0 R.sub.1 a.sub.1 R.sub.2 a.sub.2 . . . a.sub.n-1 R.sub.n a.sub.n.

For example, using the relations = and < on Z, we may write

a.sub.0 =a.sub.1

We apply this notion also to infinite sequences. Thus for any infinite sequence a.sub.0 a.sub.1 a.sub.2 . . . of elements of A with

a.sub.0 R.sub.1 a.sub.1, a.sub.1 R.sub.2 a.sub.2, a.sub.2 R.sub.3 a.sub.3, . . .

we write an infinite chain

a.sub.0 R.sub.1 a.sub.1 R.sub.2 a.sub.2 R.sub.3 a.sub.3 . . .

Such a chain stabilizes if from some index n onwards all elements a.sub.i with i.gtoreq.n are identical

a.sub.0 R.sub.1 a.sub.1 R.sub.2 a.sub.2 . . . a.sub.n-1 R.sub.n a.sub.n =a.sub.n+1 =a.sub.n+2 =. . .

Apt and Olderog use chains to describe the computations of programs. This invention does not. Chains represent the sequential paradigm of computation that permeates computer science. It is the fundamental limiting factor for both the kind of programs Apt and Olderog can handle, and the efficiency with which those programs can be executed.

Strings

A set of symbols is often called an alphabet. A string over an alphabet A is a finite sequence of symbols from A. For example 1+2 is a string over the alphabet {1,2,+}. The syntactic objects considered in this treatise are strings. We shall introduce several classes of strings: expressions, assertions, temporal predicates, programs, and correctness formulas.

We use .tbd. for the syntactic identity of strings. For example 1+2=1+2 but not 1+2=2+1. The symbol = will be used for the semantic equality of objects. Thus 1+2=2+1.

The concatenation of strings s.sub.1 and s.sub.2 yields the string s.sub.1 s.sub.2 formed by first writing s.sub.1 and then s.sub.2 without intervening spaces. A string t is called a substring of a string s if there exist strings s.sub.1 and s.sub.2 such that s=s.sub.1 ts.sub.2. Since s.sub.1 and s.sub.2 may be empty, s is always a substring of itself.

Partial Orders

A partial order is a pair (A,) consisting of a set A and reflexive, antisymmetric, and transitive relation on A. If xy for some x,y.di-elect cons.A, we say that x is less than or equal to y or y is greater than or equal to x. Since is reflexive, xx. Sometimes we consider irreflexive partial orders. These are pairs (A, ) consisting of a set A and an irreflexive and transitive relation on A.

Consider now a partial order (A, ). Let a.di-elect cons.A and X.OR right.A. Then a is called the least element of X if a.di-elect cons.X and ax for all x.di-elect cons.X. The element a is called an upper bound of X if xa for all x.di-elect cons.X. Note that upper bound of X need not be elements of X. Let U be the set of all upper bounds of X. Then a is called the least upper bound of X if a is the least element of U. Similarly for greatest lower bound.

A partial order is called complete if A contains a least element and if for every ascending chain

a.sub.0 a.sub.1 a.sub.2 . . .

of elements from A the set

{a.sub.0, a.sub.1, a.sub.2, . . . }

has a least upper bound.

Graphs

A graph is a pair (V, E) where V is a finite set of vertices

{v.sub.1, v.sub.2, . . . , v.sub.n }

and E is a finite set of edges where each edge is a pair of vertices

{(v.sub.1, v.sub.2), (v.sub.1, v.sub.3), . . . , (b.sub.i, v.sub.j)}.

All graphs we consider are directed in that the order of vertices within the pair describing an edge is significant.

The set of edges E forms a relation on the set of vertices:

E.OR right.V.times.V.

The transitive, irreflexive closure of E, namely E.sup.+, is especially important. In the temporal logic created for DANCE (i.e. this invention) E.sup.+ is a partial order that corresponds to the concept of precedes.

A graph (V, E) is a lattice if E.sup.+ is an irreflexive partial order, it has a least element 1.di-elect cons.V, and an upper bound u.di-elect cons.V. All executions of DANCE programs will create a lattice of states.

Lattices may be combined to form new lattices in two ways. Consider two lattices (V.sub.1, E.sub.1) and (V.sub.2, E.sub.2) that have no vertices in common, V.sub.1 .andgate.V.sub.2 =.o slashed., least elements l.sub.1 .di-elect cons.V.sub.1 and l.sub.2 .di-elect cons.V.sub.2, and upper bounds u.sub.1 .di-elect cons.V.sub.1 and u.sub.2 .di-elect cons.V.sub.2. Their sequential combination may be performed as follows: substitute u.sub.1 for l.sub.2 in V.sub.2 and E.sub.2 and form the union of the vertices and edges, (V.sub.1 .orgate.V.sub.2, E.sub.1 .orgate.E.sub.2). Their insertion combination may be performed as follows: choose an edge e.di-elect cons.E.sub.1, e=(v.sub.i, v.sub.j), remove it from E.sub.1, substitute v.sub.i for l.sub.2 and v.sub.j for u.sub.2 in V.sub.2 and E.sub.2 and form the union of the vertices and edges, (V.sub.1 .orgate.V.sub.2, E.sub.1 .orgate.E.sub.2). We assert, without proof, that sequential and insertion combination of lattices results in a lattice.

Proofs

A mathematical proof is a sequence of statements formulas or expressions, each of which is an axion of the logical system used, a given assumption, or a theorem derived by an inference rule from statements earlier in the sequence. The last theorem is usually the desired truth to be established such as the program meets its specifications and therefore is correct. We will exhibit proofs in the following form:

    ______________________________________
    (1) theorem 1
     ›explanation why theorem 2 is derived from theorem 1!
    (2) theorem 2
    .
    .
    (n-1) theorem n-1
    ›explanation why theorem n is derived from theorems 1 to n-1!
    (n) theorem n
    ______________________________________


Sometimes the parenthetical theorem numbers or obvious explanations may be omitted. The symbol iff will be used for if-and-only-if. The symbol .E-backward. will be used for there exists. The symbol .A-inverted. will be used for for all.

Induction

Proofs frequently rely on induction. Usually this involves proving some property P that has a natural number parameter, say n. To prove that P holds for all n.di-elect cons.N.sub.0, it suffices to proceed by induction on n, organizing the proof as follows:

induction basis: Prove that P holds for n=0.

Induction step: Prove that P holds for n+1 from the induction hypothesis that P holds for n.

The induction principle for natural numbers is based on the fact that the natural numbers can be constructed by beginning with the number 0 and repeatedly adding 1. By allowing more general construction methods, one obtains the principle of structural induction, enabling the use of more than one case at the induction basis and the induction step.

Grammars

To facilitate parsing as well as defining attributes to be synthesized or inherited, the grammar for DANCE has been written in Larson normal form or LNF.

LNF production rules:

Non-terminals are enclosed in angle brackets: .

A production consists of a single non-terminal followed by ::= and then an expression in LNF describing what can be substituted for the non-terminal. Production expressions have have only three forms:

1) A single terminal.

::="skip"

2) A pair of non-terminals.

::=

3) A single nonterminal.

::=

Literals are strings enclosed by quotation marks e.g. "MODULE"; the quote character itself is represented by """.

For convenience, the language subset that is used for explaining this invention (simple DANCE) will use a more liberal notation.

Productions within <>

Literals within ""

One or More .sup.+

Zero or more *

Choice I

Grouping ›!

Zero or One ()

::="(".sup.+ ")"

is just an identifier grammatically, but is helpful to know when a "function" name is expected, that must have a function of that name delcared somewhere.

Data, States, and Meaning

A model is a triple (T, S, ) containing a set of data types T, a set of domains containing states S, and an interpretation giving meaning to every symbol used in the language. For now, consider the data type domain, T, to be the integers. A state is a function mapping variables to values in T. For n state s in S and a variable X, let s ››X!! denote X's value in state s. Each k-place function symbol f has an interpretation s››f!! which is a function mapping k elements in T to a single value:

s››f!! ›(T.sup.k .fwdarw.T)!

What this means is that every function symbol (like "plus") has a "meaning" (like ››plus!!) that is a mapping from zero or more numbers (in this case two) to a single number.

s››plus!! ›(T.sup.2 .fwdarw.T)!

The line above doesn't say how "plus" works, just that it takes two numbers and returns one. Meanings aren't rigorously defined in this application for common arithmetic functions. Rather than develop axioms for number theory, the meanings of function symbols rely on intuition.

Moszkowski uses temporal predicates to affect side effects (cause state change). This may cause confusion since for correctness proofs we need assertions made with first-order predicate calculus. Keep the distinction clear between the text that represents executable program (temporal predicates) and text used to assert some property of state variables (assertions). Interpretations of predicate symbols are similar but map to truth values:

s››p!! ›(T.sup.k .fwdarw.{true, false})

Commands in the DANCE language are temporal predicates. Program execution tries to construct a model that makes all the temporal predicates "true." A model is a lattice or states that satisfies the temporal formula represented by program text.

OBJECTS AND ADVANTAGES

Several objects and advantages of my invention, the model of computation devised for the Definitive, Axiomatic Notation for Concurrent Execution (DANCE) computer programming language are:

Execution of many parts of the same program simultaneously, such that,

hundreds, thousands, or more processors may be efficiently used;

program correctness may be proved;

special data structures may be created that allow many elements within the data structure to be simultaneously accessed without locking, blocking, or semaphores;

the meaning of every language construct is precisely and mathematically defined; and

during program execution, processors may simultaneously self-schedule or post work for other processors to do.

Expanding and explaining the objects and advantages:

"Execution of many parts of the same program simultaneously"

The fundamental concern of this invention is reducing the time of program execution by using may processors (i.e. speedup). Existing computer architectures and computational models underlying parallel programming languages encounter fundamental, theoretical limitations of speedup. I invented the "Layered" class of multistage interconnection networks to permit construction of parallel computers comprising thousands of processors connected with a Layered network to create a fundamentally different computer architecture I called "Multitude." Unfortunately, the models of computation embodied explicitly or implicitly in programming languages adapted for parallel computers (e.g. Ada) would not allow full exploitation of the Multitude architecture computer's power and efficiency. Therefore, I invented a new computational model that allows a programmer to describe, and the machine to utilize, maximal opportunities for simultaneous execution of many parts of the same program. I endeavored to create an embodiment, the DANCE language, whose form is familiar (patterned after Ada) but whose semantics (i.e. meaning, or computational model) are radically different.

"hundreds, thousands, or more processors may be efficiently used"

Techniques that allow a dozen or two processors to cooperate effectively generally do not work for hundreds or thousands of processors. That is, the techniques don't "scale." Efficiency is of utmost importance; keeping the processors busy (almost) always doing something useful economically employs expensive equipment. Machines that use 512 processors to deliver speedup around 25 have not been, and will not be, purchased by savvy, private-sector users. Speedup greater than one thousand for a broad range of applications requires all aspects of an application to scale from description in an inherently-parallel language down to the topology of the network physically interconnecting the processors.

"program correctness may be proved"

Poor quality software is endemic today. No other product's flaws are so tolerated by society. Buggy programs inevitably result from the "group hack" methodology: code-test-fix-test-fix-release-get bug reports from users-fix-test-release-get more bug reports-etc.-etc.-etc. Testing can only show the presence of errors, not their absence. Only when a proof that a program meets its specifications has been exhibited and confirmed can we be sure that a program does what we say it does and nothing more. Correctness proofs are even more important for parallel programs; debuggers that rely on a single locus of control to step through program executions cannot be extended for debugging of parallel programs. Furthermore, parallel programs must assure noininterference--one part of the execution doesn't adversely modify the values of variables needed by another part of the execution. This invention is precise, mathematical semantics for parallel programming languages which necessarily permits formal correctness proofs.

"special data structures may be created that allow many elements within the data structure to be simultaneously accessed without locking, blocking, or semaphores"

Achieving efficiency requires existing applications to be reprogrammed using data structures that can be accessed in parallel. Constructs that ensure exclusive use of some object by forcing other users to wait inevitably impede execution thus destroying efficiency. Many such constructs have been proposed or tried: critical sections, monitors, semaphores, locks, spin waiting, snoopy caches, message passing, etc. This invention incorporates semantics for combinable operations in which many or all of the processors can access the same variable at the same clock cycle yielding results as if the accesses occurred one-at-a-time. Provision by Layered networks of combinable operations for Multitude architecture computers is crucial for efficient program execution. Automatic conversion of existing, sequential programs for parallel execution is futile. Efficient parallel execution demands data structures, and their method of access, be radically different from their sequential counterparts.

"the meaning of every language construct is precisely and mathematically defined"

The semantics of most programming languages are defined in a "natural" language, usually English. Therefore a program compiled by different compilers for the same machine may not behave the same, though both compilers are "correct." In reality the compiler defines language semantics; compiler bugs are language features. Only when language and program semantics are precisely (formally, mathematically) defined can an implementation be deemed "correct." Furthermore, proof of program correctness demands formal language semantics.

"during program execution, processors may simultaneously self-schedule or post work for other processors to do"

When an idle processor can fetch its own work to do simultaneously with many other processors processors, can be kept busy doing useful work. Self-scheduling is essential for good efficiency. In addition anonymous processor scheduling greatly facilitates fault-tolerance; permanent failure of a processor, switch, or memory requires checkpoint/restart of the program and on-the-fly hardware reconfiguration to allow graceful degradation. Clever, but brittle, mapping schemes of applications onto architectures generally implode with the loss of a single processor or link.

Additional objects and advantages such as elimination of cache coherency, concurrent memory allocation cache write-back and others will become apparent from the specification and drawings.

DESCRIPTION OF DRAWINGS

FIG. 1 is a diagram showing convention sequential execution and has one edge from every node except the final state and one edge to every node except the initial state; the final state is reachable from any state in the graph and any node is reachable from the initial state;

FIG. 2 is a diagram showing convention parallel execution as several (or many) sequential executions;

FIG. 3 is a diagram illustrating barrier synchronization which ensures that every process has completed a particular computation before any process may start its next computation;

FIG. 4 represents a simple execution in which the values of b and a may be computed concurrently;

FIG. 5 diagrammatically show sequential intervals. Sequential intervals may be represented with either text or a graph;

FIG. 6 illustrates a simple state lattice in accordance with the invention.

FIG. 7 diagrammatically illustrates an arbitrary interval. The interval, i, has multiple, concurrent computations;

FIG. 8 represents a sequential composition in which two intervals are composed sequentially by fusing the "end" of one interval with the "start" of another;

FIG. 9 illustrates a simple parallel composition in which two intervals may be composed in parallel by fusing their "start" states together and fusing their "end" states together;

FIG. 10 illustrates universal quantification: Universal, temporal quantification creates many concurrent computations each with a different value for the quantified variable, in this case X;

FIG. 11 illustrates diagrammatically an assignmentiIterval which functions to evaluate an expression using the values contained in its "start" node; the end node has the new value as the state of the assigned variable, in this case X;

FIG. 12 represents a conventional concurrently accessible queue: Using fetchadd to determine the array index for either get or put allows many intervals to access the queue concurrently;

FIG. 13 illustrates the inserting of records into a linked list: Two records can be inserted into the same place at the same time using a swap operation;

FIG. 14 illustrates a conventional linked-list queue: Using swap to manipulate pointers allows concurrent access to a dynamically-sized queue;

FIG. 15 is a graph of a single node showing an initial state condition. The "first" node contains the values for variables at the beginning of the program, before any computation occurs;

FIG. 16 illustrates the initial state of procedure "sort". Because the actual parameter names (what the caller calls the variables) and the formal parameter names (used within a procedure) usually differ we show two nodes rather than just one. Since there is no state change the two nodes can be collapsed into one;

FIG. 17 graphically shows an interval to satisfy "sort": Four subintervals are inserted into the lattice for "sort" to first partition the data (51), followed by sorts on the upper (54) and lower (57) halves, concurrent with storing the pivot (60);

FIG. 18 illustrates an interval to satisfy "partition";

FIG. 19 shows concurrent intervals satisfying the alternative command within "partition";

FIGS. 20, 21 and 22 when arranged as shown in FIG. 22A illustrate as an examplee a lattice satisfying a temporal logic formula for sorting members of the Swedish Bikini Team;

FIG. 23 shows schematically the Multitude architecture which connects processors with a layered network;

FIG. 24 shows an exemplary processing node in a Multitude architecture computer;

FIG. 25 represents diagrammatically a physical address comprised of node number and memory address;

FIG. 26 illustrates the memory of each node partitioned into global, local, and spread;

FIG. 27 shows a main memory word as comprising are 84 bits wide;

FIG. 28 represents the translation of segmented address to an effective virtual address;

FIG. 29 represents a translation of an effective virtual address to a physical address;

FIG. 30 shows a free-space list for blocks of 2 n bytes;

FIG. 31 shows the segment capability bit assignment;

FIG. 32 illustrates page descriptor fields;

FIG. 33 represents the static linking with link record;

FIG. 34 shows a lattice of states satisfying a DANCE temporal formula;

FIG. 35 illustrates hidden registers;

FIG. 36 shows that MMU translates a segmented address to a physical address;

FIG. 37 shows how a User Sergment Table defines Segmented-to-EVA translation;

FIG. 38 illustrates TLB hit address translation;

FIG. 39 illustrates TLB hit address translation with page descriptor;

FIG. 40 illustrates that a cache line holding four words needs 335 bits;

FIG. 41 shows that cache hits are determined by encoding virtual address to select cache lines, and then compare that user, segment, and offset match;

FIG. 42 is a diagram showing nesting of domains in proof of noninterference;

FIG. 43 shows count faces; and

FIG. 44 is an instruction-set architecture table for Multitude architecture processors. It describes the functionality of every instruction and defines it in terms of temporal logic.

SUMMARY

Efficient, massively-parallel computing requires work scheduling that keeps every processor busy almost all the time. The dismal 3% to 5% efficiency found by the Army High-Performance Computing Research Center at the University of Minnesota (AHPCRC) across a broad range of applications results from inherent limitations in currently-available massively-parallel computers.

This is not for lack of trying.

A new class of multistage interconnection networks to provide rapid, voluminous communication between RISC processors is disclosed in my earlier U.S. Pat. No. 4,833,468 entitled "LAYERED NETWORK". The shared-memory, multiprocessor architecture based on my "Layered" class of networks has been dubbed "Multitude". Although Multitude nicely supports Ada ›ANSI/MIL-STD-1815A! tasking and rendezvous are both difficult to use and retain an inherent sequential paradigm. Therefore, I tried to define a new language, "Samantha", that would allow programmers to define applications without circumscribing opportunities for parallel execution.

The objective of efficient computing with 1000s of processors has daunted computer architects for two decades. Achieving the objective requires major changes in the operation of virtually every aspect of hardware and software, with the important exception of fabrication technology. Sub-micron CMOS from any of a dozen silicon foundries will do fine. The trick to making a faster machine with slower parts is to buy more of those parts for less money, and then actually use a much bigger fraction of the parts doing something useful. Additionally, as fabrication technology improves, it can readily be incorporated without changing the architecture or the software.

The sequential paradigm is so powerful that it is difficult to think otherwise. It is natural to think of parallel computing as a plurality of sequential execution streams. Sometimes they cooperate by exchanging messages; sometimes they cooperate by sharing state variables; sometimes they wait until all other processors have finished a phase before proceeding to the next phase. Sequential languages have been augmented with means to fork off nearly identical copies of themselves, with hints how to use the data in parallel, and with libraries of pre-compiled procedures for message passing or shared-memory access. In all of these cases the paradigm is still replicated sequential control flow.

Virtually all of today's computer science theory deals with sequentiality or replicated sequentiality. Huge tomes of algorithms have been published and studied that are either sequential control flows or replicated sequential control flows. Regular expressions are recognized by finite-state machines; programs are parsed using a push-down automation coupled with a handle-recognizing finite-state machine. The compiler-construction tool, yacc, known and generally loathed by computer science students can only parse a subset of context-free grammars: LALR(1). The very definition of the LALR(1) class of languages has sequentiality built into it--those languages which can be parsed by push-down automata coupled with a handle-recognizing finite-state machines. The pull of sequentiality is so strong many people will not be able set it aside to consider a different way of computing.

DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

A mathematical model of computation is presented with the salient advantage that the computation may be performed by hundreds or thousands of microprocessors simultaneously and efficiently. Five mathematical systems interrelate in special ways. The five systems are: temporal logic, domains, type theory, assertions and proofs. Representing programs as special temporal logic formulae that are satisfied (executed) by lattices of state transitions form the core of this invention and provide the simultaneity and efficiency desired. Domains contain variable-value bindings that restrict visibility of variables allowing necessary proofs of non-interference, but without the locks, semaphores, or messages that hobble the prior art attempts. The type theory adapts an elegant, lambda-calculus-based, polymorphic, object-oriented type system for the temporal logic execution espoused here. Assertions in first-order predicate calculus define specifications and allowable intermediate states for programs. Since only proofs can assure program correctness, a proof system called BPL associates inference rules and/or weakest-precondition predicate transformers with each language production. A proof of soundness for BPL is presented; a logic is sound if it infers true fact from true facts.

Computational Model

A model of computation provides a mathematical foundation for the meaning of programs and their execution. Hopcraft and Ullman ›Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, 1979! present a series of increasingly complex sequential machines and their models. Hopcraft and Ullman show correspondences between the machines and classes of languages recognized by those machines. State transitions are defined for an input character, the current state, and perhaps some internal data structure such as a stack. All the machines are sequential. Hoare ›"Communicating Sequential Processes," C. A. R. Hoare, Communications of the ACM, Vol. 21, No. 8, August 1978! expands the sequential model by allowing multiple sequential machines. Others use models such as dataflow (Petri nets), logic programming (Horn clauses), or functional (lambda calculus) without much greater utility.

None of these models of computation adequately expresses parallel execution. Therefore I invented the state-lattice temporal logic model for DANCE.

Temporal Logic

Moszkowski in his Executing Temporal Logic Programs treats programs as temporal logic formulae. His language, "Tempura", allows description of traditional, process-based, parallel execution within a single program expression. The foundation for Tempura is interval temporal logic. A Tempura interval is a non-empty sequence of states. A state is determined by the values of program variables at a particular instant of time. The sequence of states that form an interval provide a discrete notion of time.

The formal model for DANCE presented here differs significantly from the formal model for Tempura:

States form a lattice not a sequence.

The domains visible at a particular state are both restricted and explicit.

Later a sophisticated, polymorphic type system will be substituted for the simple integer types presented here. Again later a first-order predicate calculus for describing sets of states very similar to those used by Apt/Dijkstra/Gries/Hoare for states in a guarded command language program is presented. A logical system for reasoning about the behavior of programs in terms of interval temporal logic lattices forms another significant, original contribution of this exposition.

Each DANCE interval may be thought of as a non-empty lattice of states partially ordered by their occurrence in time. A state is the values of a set of variables at an instant in time. The set of variables visible by a particular state is its domain.

An interval can have many intervening states between any pair of states. An interval may also be split but the subintervals must have a starting and ending state in common (i.e. every subinterval is itself a lattice, combinations of lattice must form lattices). Intervals are strung end-to-end by sequential composition invoked by the semicolons in the above program. Intervals are strung side-by-side by parallel composition invoked the ampersand.

Consider the following DANCE program fragment: ##EQU1##

First `a` is assigned the value ten. Then the values of `b` and `a` are computed in parallel. Finally `a` is assigned the sum of `a` and `b`.

In this paper, interval lattices will be depicted as a set of circles (or stretched circles) connected by undirected edges. However those edges do have direction in that time flows down the page. FIG. 4 illustrates the lattice corresponding to the previous example.

In state 301 both a and b have no value. State 302 has a assigned 10. States 303 and 304 are independently determined and may occur simultaneously. State 305 is the synchronization point for the independent computations of 303 and 304. Finally state 306 is computed. The heavy lines between states like 307 are state transitions. The actual computation occurs during state transitions like 307 and are shown as arcs between the nodes holding state.

Please note that this model explicitly rejects the notion of some "universal" sequence of states comprised of states of individual processes in "some" sequential order. Such models although popular provide poor paradigms for programming parallel processors.

Commands in the DANCE language are really temporal predicates. Program execution tries to construct a model that makes all the temporal predicates true. A model is a lattice of states that satisfies the temporal formula represented by program text. The use of the word "model" here refers to creation of a lattice of states that satisfies a particular temporal logic formula embodied in a DANCE program. The "model" of computation for DANCE refers to the general method of constructing lattices of states concurrently.

Expressions and Predicates

This section starts to use all the mathematical machinery in the "Background--Mathematical Foundations" section hereof to define the meaning of DANCE programs, which is the model of execution that forms the core of this invention. Obviously, one could merely change the grammar and symbols used in the DANCE language, but if the underlying meaning of such languages and programs is based on temporal logic formulae satisfied by lattices of states, it would not depart from the spirit and scope of the invention.

Expressions in DANCE are built inductively as follows:

Variable names or constants X, Y, Z, 5, 1509, 0

Functions: f(e.sub.1, . . . , e.sub.k), where f is a function symbol defined by M,

k.gtoreq.0, and e.sub.1, . . . , e.sub.k are expressions.

DANCE grammar for expressions:

::="("+")"

::=

::="("")"

::=

::=

The following are expressions:

X+Y

(X+Y)/Z

somefunction(X Y Z)

Formulas (constructions of temporal predicates) are built inductively as follows:

Predicates p(e.sub.1, . . . , e.sub.k), where p is a predicate symbol defined by M, k.gtoreq.0, and e.sub.1, . . . , e.sub.k are expressions. Like functions, we assume a basic set of relations complete with syntax.

Logical connectives: w.sub.1 & w.sub.2 and w.sub.1, w.sub.2 where w.sub.1 and w.sub.2 are predicates. The ampersand corresponds to "and." The semicolon corresponds to "after." One more logical connective (bounded universal quantification) will be added in a following section.

Parentheses may be used for grouping expressions. Predicates may be grouped with BEGIN-END.

DANCE programs are temporal logic formulas. Execution of a program seeks to construct an interval (e.g. model) that makes the formula true.

DANCE grammar for temporal logic formulas:

::="("+")"

::=";"

::="&"

::="BEGIN" "END"

The following are examples of formulas:

someprocedure(X Y Z)

dothis ; beforethat

dothis & together

BEGIN dothis ; beforethat END & someprocedure(X Y Z)

Intervals

From the states in S, intervals are constructed from S+, the set of all nonempty, finite lattices of states. A state is the values assigned to variables at a particular point in time. FIG. 4 depicts a lattice of states of which 305 is a state.

For example, suppose the state variables are X and Y and states s, t, and u are:

s=(X=1, Y=2) t=(X=0, Y=3) u=(X=0, Y=6)

Sequential intervals can be represented as a string of state names enclosed in <>. Some examples are

::=<(X=1, Y=2)>

::=<(X=1, Y=2) (X=0, Y=3) (X=0, Y=3) (X=0, Y=6) (X=1, Y=2)>

::=<(X=0, Y=3) (X=0, Y=3) (X=0, Y=3)>

FIG. 5 shows these sequential intervals in graph form.

DANCE expands the definition of an interval to a finite lattice instead of a sequence of states. This provides the mathematical machinery upon which correct, efficient, parallel execution will be based.

FIG. 6 depicts a simple lattice interval containing states s, t, u, v, w, and x. A textual representation could be: ##STR1## but notation good for sequential intervals becomes cumbersome for DANCE lattices so a graphical representation will be used instead.

A lattice of states, L, is a graph whose nodes are states, S.sub.L, and arcs are state transitions, T.sub.L, that must have the following properties:

An irreflexive partial order relation, .sub.L, such that

a .sub.L b=.E-backward.n>0: .E-backward.t.sub.1, . . . , t.sub.n .di-elect cons.T.sub.L : ›t.sub.1 =(a,s.sub.1) and t.sub.n =(s.sub.n-1, b) and

.A-inverted.1

A least upper bound (lub) and greatest lower bound (glb) functions are defined for any pair of states in S.sub.L as

      ______________________________________ lub(a,b)=c .congruent. a .sub.L c 
      and b .congruent..sub.L c and not .E-backward. d .di-elect cons. S.sub.L 
      : ›a .sub.L d and b .sub.L d and d .sub.L c!. glb(a,b)=c .congruent. c .sub.L 
      a and c .sub.L b and not .E-backward. d .di-elect cons. S.sub.L : ›d .sub.L 
      a and d .sub.L b and c .sub.L d! ______________________________________ 
      

There exists a distinguished states "start" and "end" defined as

start(L)=.A-inverted. s .di-elect cons.L:glb(s,start)=start,

end(L)=.A-inverted. s .di-elect cons.L:lub(s,end)=end.

Since time flows downward in interval graphs, the start state (which is the greatest lower bound of every state in the interval) is actually at the top of the graph. Conversely, the end state is at the bottom of the graph.

Construction of a lattice state graph is the act of computing. At the end, the values bound to variables are the results. State lattices may be hierarchically structured and composed.

FIG. 6 is a simple temporal lattice. In this case the lattice, L, is composed of states S.sub.L ={s,t,u,v,w,x}, (reference numbers 310, 311, 312, 313, 314, and 315 respectively) and state transitions T.sub.L ={(s,t), (s,v), (t,u), (v,w), (u,x), (w,x)}, (reference numbers 316, 317, 318, 319, 320, and 321). The partial order relation L corresponds to the notion of before. State s, 310, occurs before state w, 314, sL w, because state transitions t.sub.1 =(s,v) 317 and t.sub.2 =(v,w) 319 are in T.sub.L. The transitions form a chain from s 310 to w 314. Least upper bound (lub) and greatest lower bound (glb) are defined for all pairs of nodes. For example, the lub of u 312, and v 313 is node x 315. Similarly the glb of u 312, and v 313 is node s 310. The distinguished states start(L) and end(L) are s 310 and x 315.

Being a "partial" order provides opportunities for parallelism. Since neither u.sub.L w nor W.sub.L u, the transitions (t,u) 318, and (v,w) 319 may occur independently, even simultaneously. Good DANCE programs create execution lattices that are short and broad.

Temporal Meaning

Let I denote the set of all possible intervals. Consider an arbitrary interval iI, composed of a lattice of states with distinguished states start(i) 101 and end(i) 102 with intermediate states 103 and state transitions 307 as depicted in FIG. 7.

In defining meaning for DANCE language constructs over interval will be denoted by a capital M subscripted with an interval or state designator, the language construct in double brackets, ›› !!, the equality-by-definition symbol, .tbd., followed by some expression as to the meaning of the construct:

M.sub.interval ››construct!!.tbd.definition

Meaning of Variables

The meaning of variable names is defined by:

.sub.j ››X!!=.sub.s ››X!! where X is a variable name, X.di-elect cons.s, and s.di-elect cons.S.sub.1.

This says "the meaning of X for the interval, i, is defined as the value of X in each state, s, in the interval." If X is not a state variable of s, or s is not a state in the interval i, then X's meaning is undefined.

Meaning of Functions

Where f is a function name and e.sub.1, . . . , e.sub.k are expressions or variable names, the meaning of function symbol, f, is defined by:

.sub.i ››f(e.sub.1, . . . , e.sub.k)!!=.sub.i ››f!! (.sub.i ››e.sub.1 !!, . . . , .sub.i ››e.sub.k !!)

This says "the meaning of an expression in the interval (i) is the meaning of the function symbol (f) applied to the meaning of each of the subexpressions f uses as its operands."

DANCE grammar for functions:

::="("+")"

::=

::="FUNCTION" "("›.sup.+ ":"!.sup.+ ")" "RETURN" ("IS" )";"

Meaning of Temporal Predicates

Where p is a temporal predicate name (i.e. a procedure name) and e.sub.1, . . . , e.sub.k are expressions:

.sub.i ››p(e.sub.1, . . . , e.sub.k)!!=.sub.i ››p!!(.sub.i ››e.sub.1 !!, . . . , .sub.i ››e.sub.k !!)

This says "the meaning of a temporal predicate in the interval (i) is the meaning of the predicate symbol (p) in the interval applied to the meaning of each of the subexpressions p uses as its operands."

Simple DANCE grammar for procedures:

::="(" .sup.+ ")"

::="PROCEDURE" "(" ›.sup.+ ":" !.sup.+ ")" ("IS" )";"

Temporal Formulas

Formulas are compositions of temporal predicates. Such compositions will be covered in the later sections. For now consider that a formula is a simple temporal predicate.

A formula w is satisfied by an interval i iff the meaning of w on i is true :

:››w!!=true.

"iff" means "if and only if".

Execution of a program constructs an interval so that the interval satisfies the formula that expresses the program. If all intervals satisfy a formula that formula is deemed valid. However, valid formulas (tautologies) are rather uninteresting programs. Constructing an interval, a lattice of states, is the execution of the program. The predicates in interval temporal logic become commands in DANCE; they affect side effects.

The simplest temporal formula (and the first one to be defined) is "skip."The predicate "skip" is always true.

.sub.1 ››SKIP!!=true.

Simple DANCE grammar for skip temporal formula:

::="SKIP"

Sequential Composition

Moszkowski emphasizes about his "chop" operator. It is simply a sequential composition indicated by semicolon:

w.sub.1 ; w.sub.2

where w.sub.1 and w.sub.2 are formulas means

.sub.i ››w.sub.1 ; w.sub.2 !!=true iff

.E-backward.j,k:j.OR right.l and k.OR right.i: (.sub.j ››w.sub.1 !!=true) and (.sub.k ››w.sub.2 !!=true) and (end(j)=start(k))

This says "the meaning of w.sub.1 ; w.sub.2 is true in the interval i if-and-only-if there exists some subintervals of i, namely j and k, such that the meaning of w.sub.1 is true during j and the meaning of w.sub.2 is true during k, and k starts at the end of j." Fig. 8 depicts sequential composition of intervals.

Note that the two subintervals of i share the state end(j)=start(k), and w.sub.1 is true on the first part of the interval and w.sub.2 is true on the last part.

Simple DANCE grammar for sequential composition:

::=";"

Parallel Composition

Conjunction of formulae provides the parallelism of DANCE:

w.sub.1 & w.sub.2

where w.sub.1 and w.sub.2 are formulas means

.sub.i ››w.sub.1 & w.sub.2 !!=true iff

there exist two subintervals of i, j.OR right.i and k.OR right.i, such that

.sub.j ››w.sub.1 !!=true, .sub.k ››w.sub.2 !!=true,

start(j)=start(k)=start(i), and end(j)=end(k)=end(i).

This says "the meaning of w.sub.1 & w.sub.2 over the interval i is true if-and-only-if the meaning of w.sub.1 is true over some path of states, j, from start(i) to end(i) in the interval i and the meaning of w.sub.2 is concurrently true over some different path, k, of states from start(i) to end(i) in the interval i."

Where w.sub.1 is true on ad w.sub.2 is true on as in FIG. 9. Therefore w.sub.1 & w.sub.2 is true over the whole interval.

Moszkowski doesn't recognize such bifurcation. Rather he uses the more traditional arbitrary interleaving of states. Though he makes reference to parallel computing, his intervals are always sequences, it's just that several predicates must be true over the same sequence of states. In this example the states v, and w contain variables seen only by w.sub.1. Similarly the states t and u contain variables seen only by w.sub.2. This is the essential, paradigmatic difference in DANCE from other models. Thus the importance of local variables and limited scoping in DANCE provided by domains.

Simple DANCE grammar for parallel composition:

::="&"

Temporal Existential Quantification

For many computations additional, local variables are needed. Temporal Existential quantification allows introduction of such local variables whose scope is limited. The keyword DECLARE (instead of .E-backward.) introduces such variables as in:

DECLARE X=c begin w end

where X is any variable, c is a constant, and w is a formula in which X occurs free.

.sub.i ››DECLARE X=c begin w end!!=true iff

.sub.start(1) ››X!!=c, and .sub.1 ››w!!=true.

The definition of "free variable" is formally quite complex. Suffice to say that variable X is free in formula w iff X occurs in w and nowhere else. The scoping rules of DANCE are more permissive however. Even if X is used outside of the existential quantifier, within w all uses of X refer to the local variable. DANCE uses a syntax similar to Ada; the predicate above comes from the Ada production . Existential quantification provides scoping within the model of computation. Variables introduced in the declaration part of blocks, procedures, and packages (implicitly) invoke existential quantification over their scope. Existential quantification readily generalizes to many variables. Later, we will require such existentially quantified variables to have type declarations.

Simple DANCE grammar for existential quantification:

::="DECLARE" ›":" ("=" )!.sup.+ "BEGIN" "END"

Temporal Bounded Universal Quantification

Bounded universal quantification allows a group of formulae to be executed together, but with different parameters. The keyword FORALL is used instead of .A-inverted..

FORALL X IN b DO w

where X is any variable, b is a boolean expression or range, and w is a formula means:

(DECLARE X=x.sub.1 begin w end) & (DECLARE X=x.sub.2 begin w end) & . . . & (DECLARE X=x.sub.n begin w end)

where x.sub.1, x.sub.2, . . . , x.sub.n are values for X that make b true. For now consider b to be a subrange of integers such as 0.ltoreq.X.ltoreq.12, which would initiate twelve copies of w each having a local constant X ranging from 0 to 11.

.sub.i ››FORALL X IN b DO w!!=true iff

.E-backward.n:n.gtoreq.0:

(.A-inverted. x.sub.k :1.gtoreq.k.gtoreq.n:b.sup.x.sub.xk) and (.A-inverted.y :: b.sup.x.sub.y y .di-elect cons.{x.sub.1, x.sub.2, . . . , x.sub.n }) and

(.E-backward.j1 . . . jn:j1.OR right.i . . . jn.OR right.i:

.sub.1 ››DECLARE X=x.sub.1 begin w end!!=true, and

.sub.2 ››DECLARE X=x.sub.2 begin w end!!=true, and . . . and

.sub.jn ››DECLARE X=x.sub.n begin w end!!=true)

Bounded universal quantification (FORALL) is the source in DANCE for massively parallel processing opportunities; each subinterval can be executed concurrently. FIG. 10 depicts universal quantification.

All program variables must be introduced with either existential or universal quantification. A range such as ›1 . . . 10! may substitute for the boolean expression, b, with the understanding that ›j . . . k! means j.ltoreq.X.ltoreq.k; X takes on all the discrete values from j to k inclusive.

Simple DANCE grammar for universal quantification:

::="FORALL" ":" "IN" ›.vertline.!"DO"

Assignment

So far we have no way to modify the values of state variables. Such a language has limited utility| Therefore temporal assignment is needed:

X:=e

where X is a variable name and e is an expression. The value of X at the end of the subinterval becomes the value of e evaluated in the state at the beginning of the subinterval. Formally:

.sub.i ››X:=e!!=.sub.end(i) ››X!!=.sub.start(i) ››e!!.

This says "the meaning of X (its value) at the end of the interval i is the meaning of e evaluated at the beginning of the interval i." FIG. 11 shows an interval satisfying an assignment formula.

Simple DANCE grammar for assignment:

::=":="

Alternative Command

The alternative command provides data dependent branching.

.sub.i ››1F B1.gtoreq.S1 ›!B2.gtoreq.S2 ›!. . . ›!Bn.gtoreq.Sn F1!!=true if f

(.sub.start(i) ››B1!!or .sub.start(i) ››B2!! or . . . or .sub.start(i) ››Bn!!) and

(.sub.start(i) ››B1!!.sub.i ››S1!!) and (.sub.start(i) ››B2!!.sub.i ››S2!!) and . . . and

(.sub.start(i) ››Bn!!.sub.i ››Sn!!)

Simple DANCE grammar for alternative command:

::="IF" ".gtoreq."

›"›!" ".gtoreq." !*"FI"

Iterative command:

The iterative command provides data dependent looping:

    ______________________________________
    .sub.i ››DO B1 => S1  B2 => S2 .sub.1... Bn => Sn OD!! = true iff
    .E-backward.j1...jc ; j1.di-elect cons.i ... jc.OR right.i:
     (.orgate..sub.k.epsilon.›l..c! jk = i) and
     (start(i)=start(j1) and end(j1)=start(j2) and ...and
     end(jc)=end(i) ) and
     .A-inverted.k : 0<=n:
      ( (.sub.start(jk) ››B1!! .sub.jk ››S1!!) and (.sub.start(jk) ››B2!!
    .sub.jk ››S2!!)
      and ... and (.sub.start(jk) ››Bn!! .sub.jk ››Sn!!) )
    ______________________________________


Simple DANCE grammar for iterative command:

::="DO" ".gtoreq."

›"›!" ".gtoreq." !* "OD"

Combinable Operations

Combinable operations are crucial to obtaining high efficiency in parallel processing. Combinable operations allow access to a broad range of data structures in parallel without mutual exclusion. The meaning of a fetch-add (other combinable operations are similar) is that the value of a shared variable is fetched (read) and its retained value is indivisibly changed to the sum of its previous value and the parameter provided in the fetch-add. If two (or more) processors fetch-add the same variable at the same time, they receive values as if the fetch-adds occur in some sequential order, even though the fetch-adds really occur at the same instant.

Previous language constructs were defined over an interval with some unspecified number of intervening states between the initial state and the final state (e.g. assignment). Originally I had not used the "next" (O) temporal operator, but definition of combinable operations demands it. The meaning of the "next" operator over an interval i is for some state s.di-elect cons.i, O(expression) is true if and only if the expression is true in state t.di-elect cons.i and no other state occurs between s and t.

.sub.s ››O(e)!!=true

iff

.sub.t ››e!!=true and glb(s,t)=s and lub(s,t)=t and

not (.E-backward. u.di-elect cons.i:glb(s,u)=s and lub(u,t)=t).

A single fetch-add where X is the shared variable e is the expression to be added (usually an integer constant) and Y is the variable to receive the fetched value is defined as follows:

.sub.s ››FETCH.sub.-- ADD(X,e,Y)!!=true

iff

.sub.s ››O(Y=X)!!=true and .sub.s ››O(X=X+e(!!=true.

Multiple combinable operations are more difficult to define. The new value held by the shared variable is straightforward the new value is the sum (and, or, xor) of the old value and the sum (and, or, xor) of the values contained in the combinable operations. The returned values are more problematic; they are the result of some, nondeterministic ordering as if they occurred one at a time. The Layered network combines requests and decombines responses for combinable operations. Since which processor executes a part of an interval is unspecified, the apparent sequential order cannot be known. Fortunately it doesn't need to be.

Let ndperm(e.sub.1, . . . , e.sub.k) be a nondeterministic permutation of expressions resulting in some scrambling of their order:

.sub.i ››ndperm(e.sub.1, . . . , e.sub.k)!!=(e.sub.1s, . . . , e.sub.ks)

where

.A-inverted.j:1.ltoreq.j.ltoreq.k:

.E-backward.m:e.sub.m =e.sub.js and

.A-inverted.n:1.ltoreq.n.ltoreq.k, n=j:not(e.sub.m =e.sub.n).

Now with ndperm, fetch-add can be defined for simultaneous execution:

.sub.k ››FETCH.sub.-- ADD(X,e.sub.1,Y.sub.1) & FETCH.sub.-- ADD(X,e.sub.2,Y.sub.2) & . . . & FETCH.sub.-- ADD(X,e.sub.k, Y.sub.k)!!=true

iff

there exists some ndperm(e.sub.1, . . . , e.sub.k)=(e.sub.1s, . . . , e.sub.ks) such that

.sub.ik ››O(X=X+e.sub.1 +e.sub.2 +. . . +e.sub.k !!=true and .sub.ik ››O(Y.sub.1s =X)!!=true and

.sub.ik ››O(Y.sub.2s =X+e.sub.1s)!!=true and .sub.ik ››O(Y.sub.3s =X+e.sub.1s +e.sub.2s)!!=true and . . . and

.sub.ik ››O(Y.sub.ks =X+e.sub.1s +e.sub.2s +. . . +e.sub.k-1s)!!=true.

What the above expressions say is that the shared variable X gets all the values contained in the fetch-ads added to it and the values returned are the original X value plus some subset of other values from other fetch-adds. The reason why this is important is that the behavior is as if the fetch-adds occurred in some sequential order when in fact they occurred at the same instant. Combinable operations are why "interleaving" of states are inappropriate. Any languages or architectures for which interleaving provides an adequate model are doomed to poor efficiency.

The fetch-and, fetch-or, and fetch-xor are defined similarly:

.sub.ik ››FETCH.sub.-- AND(X,e,Y)!!=true iff

.sub.ik ››O(Y=X)!!=true and .sub.ik ››O(X=X AND e)!!=true.

.sub.ik ››FETCH.sub.-- AND(X,e.sub.1,Y.sub.1) & FETCH.sub.-- AND(X,e.sub.2,Y.sub.2) & . . . & FETCH.sub.-- AND(X,e.sub.k,Y.sub.k)!!=tru e

iff

there exists some ndperm(e.sub.1, . . . , e.sub.k)=(e.sub.1s, . . . , e.sub.ks) such that

.sub.ik ››O(X=X AND e.sub.1 AND e.sub.2 AND . . . AND e.sub.k !!=true and

.sub.ik ››O(Y.sub.1s =X)!!=true and .sub.ik ››O(Y.sub.2s =X AND e.sub.1s)!!=true and

.sub.ik ››O(Y.sub.3s =X AND e.sub.1s AND e.sub.2s)!!=true and . . . and

.sub.ik ››O(Y.sub.ks =X AND e.sub.1s AND e.sub.2s AND . . . AND e.sub.k-1s)!!.

.sub.ik ››FETCH.sub.-- OR(X,e,Y)!!=true iff

.sub.ik ››O(Y=X)!!=true and .sub.ik ››O(X=X OR e)!!=true.

.sub.ik ››FETCH.sub.-- AND(X,e.sub.1,Y.sub.1) & FETCH.sub.-- OR(X,e.sub.2,Y.sub.2) & . . . & FETCH.sub.-- AND(X,e.sub.k,Y.sub.k)!!=true

iff

there exists some ndperm(e.sub.1, . . . , e.sub.k)=(e.sub.1s, . . . , e.sub.ks) such that

.sub.ik ››O(X=X OR e.sub.1 OR e.sub.2 OR . . . OR e.sub.k !!=true and

.sub.ik ››O(Y.sub.1s =X)!!=true and .sub.ik ››O(Y.sub.2s =X OR e.sub.1s)!!=true and

.sub.ik ››O(Y.sub.3s =X OR e.sub.1s OR e.sub.2s)!!=true and . . . and

.sub.ik ››O(Y.sub.ks =X OR e.sub.1s OR e.sub.2s OR . . . OR e.sub.k-1s)!!.

.sub.ik ››FETCH.sub.-- XOR(X,e,Y)!!=true iff

.sub.ik ››O(Y=X)!!=true and .sub.ik ››O(X=X XOR e)!!=true.

.sub.ik ››FETCH.sub.-- XOR(X,e.sub.1,Y.sub.1) & FETCH.sub.-- XOR(X,e.sub.2,Y.sub.2) & . . . & FETCH.sub.--XOR(X,e.sub.k,Y.sub.k)!!=true

iff

there exists some ndperm(e.sub.1, . . . , e.sub.k)=(e.sub.1s, . . . , e.sub.ks) such that

.sub.ik ››O(X=X XOR e.sub.1 XOR e.sub.2 XOR . . . XOR e.sub.k !!=true and

.sub.ik ››O(Y.sub.1s =X)!!=true and .sub.ik ››O(Y.sub.2s =X XOR e.sub.1s)!!=true and

.sub.ik ››O(Y.sub.3s =X XOR e.sub.1s XOR e.sub.2s)!!=true and . . . and

.sub.ik ››O(Y.sub.ks =X XOR e.sub.1s XOR e.sub.2s XOR . . . XOR e.sub.k-1s)!!=true.

The swap operation behaves differently. When a single swap occurs at a time, the value of the shared variable becomes the parameter of the swap and the value returned is the previous shared variable value.

.sub.ik ››SWAP(X,e,Y)!!=true iff

.sub.ik ››O(Y=X)!!=true and .sub.ik ››O(X=e)!!=true.

When many swaps occur simultaneously one processor will get the value of the shared variable the others will get values from some other swap, and the shared variable, the others will get values from some other swap, and the shared variable will get the swap value no processor receives.

.sub.ik ››SWAP(X,e.sub.1,Y.sub.1) & SWAP(X,e.sub.2,Y.sub.2) & . . . & SWAP(X,e.sub.k,Y.sub.k)!!=true

iff

there exists some ndperm(e.sub.1 ... e.sub.k)=(e.sub.1s, . . . , e.sub.ks) such that

.sub.ik ››O(X=e.sub.ks !!=true and .sub.ik ››O(Y.sub.1s =X)!!=true and

.sub.ik ››O(Y.sub.2s =e.sub.1s)!!=true and .sub.ik ››O(Y.sub.3s =e.sub.2s)!!=true

and . . . and .sub.ik ››O(Y.sub.ks =e.sub.k-1s)!!=true.

The swap combinable operation is good for manipulating pointers in dynamic data structures. The fetch-add is good for indexing into a shared array. The fetch-and fetch-or and fetch-xor are good for clearing, setting, and toggling flags respectively. Virtually all of the DANCE programs use combinable operations to manipulate shared variables to access data structures in parallel without mutual exclusion. Hardware support for combinable operations is necessary for high efficiency. In addition to combining in the Layered network, I'd like combinable operations in the processor's ISA and memory management unit. Although combinable operations are expected to represent a small percentage of all instructions executed, they (can) eliminate parallelism-destroying mutual exclusion and blocking.

Temporal Domains

Temporal domains hold bindings between identifiers (variables) and values. All state variables occur in some domain and at most one domain. Domains hold not only changeable states, but complete packages which contain procedures and functions as well as state variables.

Domains are essential for proving noninterference--that no other processor is modifying the variables any particular processor is currently using. Noninterference is crucial to showing that parallel activities can occur together correctly ›"An Axiomatic Proof Technique for Parallel Programs," Susan Owicki and David Gries, Acta Informatica 6, Springer-Verlag 1976!.

Domain Trees

All domains for a single parallel program running on a single Multitude computer (which is comprised of many processors) form a rooted, directed tree. Intervals being executed (causing state change) exist in some leaf domain. The temporal predicates (commands) that control the interval can use any variable-value bindings in any domain below it in the tree. The domain tree will dynamically expand an contract as the program is executed. A new domain is created when a block is entered that declares local variables and is deleted when the block is exited.

Domain names are formed by a string of identifiers separated by periods representing the path in the domain tree from the root to the leaf. The local variables immediately declared within a subprogram are contained in a domain whose name is the subprograms identifier prepended (with a period) with the domain name in which it was invoked. Within subprograms, blocks that declare new variables may be encountered; a new domain leaf is grown. Non-nested blocks are assigned numbers starting with 1, in the order in which they are encountered. Nested blocks grow new domain leaves. Automatically numbered blocks cannot be confused with either named domains or subprogram domains since DANCE identifiers cannot start with digits. However domain identifiers and subprogram identifiers may be the same. However, an identifier's position in the name make clear whether it's a domain name or subprogram name.

In the following, d, d1, d2, etc. stand for domains and N(d) stands for the name of domain d. Define the domain operator , by:

››d!!={(n,v).vertline.n.di-elect cons.identifier, v.di-elect cons.Type(n)}

which means "the domain of d is a set of pairs in which the first element, n, is a program identifier and the second element, v, is a value in the type of n." Types are sets of values. A domain contains the domains of all its ancestors in the domain tree. Ancestor domain names are formed by truncating their descendent's names:

››d1!!.OR right.››d2!! iff .E-backward. s:string.vertline.N(d1) && "." SSs=N(d2)

where && is string concatenation. That is that domain d2 call "see" all the values in every domain d1 whose name is a shortened version of d2. All the values visible in domain d2 are either declared in the DANCE program text corresponding to d2, or are declared in some ancestor of d2.

Domains are neighbors if they have some variables they both can see:

neighbors(d1,d2)=››d1!!.andgate.››d2!!=.o slashed..

Remote domains have no common variables:

remote(d1,d2)=››d1!!.andgate.››d2!!=.o slashed..

Remote domains are needed to represent programs that span machine boundaries, i.e. distributed processing.

Simple DANCE program fragment:

    ______________________________________
    package body ex:example is
    declare
    A:float;
    procedure p is
    declare
    W:integer;
    begin
    command1
    end;
    procedure h is
    declare
    X:float;
    begin
    declare
            Y:bool;
    begin
            command2
    end
    command5
    ;
    declare
            Z:char;
    begin
            command3
            ;
            declare
              V:string;
            begin
              command4
            end
    end
    end;
    end package
    ______________________________________


In the above program fragment:

command 1 sees W and A;

command 2 sees A, X, and Y;

command 3 sees A, X, and Z;

command 4 sees A, X, Z, and V;

command 5 sees X and A.

FIG. 42 shows the domains for the example above. FIG. 42 is not a set diagram in the usual sense of a subset drawn inside its superset. The set of nameable and therefore usable variables for a particular domain are all the variables declared within the domain, union variables declared in surrounding domains. Domains are necessary for proof of noninterference. The easiest way to assure non-interference is by restricting access to critical variables. Semaphores protect variables by serializing access to critical sections. DANCE protects variables with domains. By relying on the fact that no other interval can "see" critical, local variables, noninterference is assured.

Remote Procedure Call

Distributed systems have domains that do not form a tree--no single domain exists within which all other domains are contained. No shared address space, no combinable operations, just messages, usually point-to-point sans confirmation. Remote procedure call (RPC) works just fine for much of what one would like to do at disparate geographical locations. (Note: I explicitly reject message passing paradigms for parallel processing that have one machine out of many microprocessors, in a single location. Only geography demands distribution.)

However there are worthwhile applications that demand geographical distribution that cannot utilize nested RPCs such as databases. RPCs are great for computation where the duration or lifetime of the domain which contains all accessed variables coincides with the duration of the invoked procedure. Such applications abound, i.e. photorealistic, 3D, animated rendering.

Sometimes the lifetime of state variables must exceed the lifetime of any procedure that uses it. Databases are like this. The traditional means of ensuring noninterference (and thus partial correctness) is with locks, semaphores, monitors or some other form of mutual exclusion which may lead to deadlock. This has caused a colorful consideration of resource allocation and deadlock detection starting with Dijkstra's dining philosophers and ending with global snapshots. Total correctness proofs cannot tolerate even the possibility of a deadlock occurring--the program must terminate| How to ensure noninterference between transactions that may wish to exclusively use the same values without risking deadlock?

Must have noninterference cannot permit deadlock. Noninterference without deadlock. But how?

1) Divide the database into a large number of domains which may reside on machines across the country.

2) Impose a total ordering on the domains.

3) Allow only one RPC to be active in a domain--queue other received RLPC invocations first-in-first-out.

4) Make database accesses that need to maintain consistency between values in different domains use an RPC chain in the order of the domains used.

The proof by induction of totally correct programs that invoke only other (already proven) totally correct subprograms with RPC is direct. Anything that is totally correct and is invoked with RPC will eventually terminate and return with the right value, and the caller will then terminate (the inductive hypothesis). The chains of RPCs that may occur must be finite so that the deepest invocation will terminate without making further invocations (the basis).

An RPC chain is a finite number of nested RPC calls each of which can be on a different machine. During the invocation phase, values of the database domains of interest are "read." During the return phase, values of the database are "written." Therefore any mapping from previous database values to new values is possible, and provide some initiating transaction the answer to, and confirmation of, its query.

The total ordering of domains makes it work. Suppose a single transaction occurs. It will create a chain through the database complete its last RPC and eventually terminate. If a transaction is stalled because its deepest RPC is stalled in a queue at a domain already in use, it will eventually be given access since the currently active RPC (and all RPCs ahead in the queue) will eventually terminate and release the domain (the inductive hypothesis). There is some "deepest" chain that occupies (has an active RPC chain) in the highest numbered domain. It will eventually terminate and release all its domains therefore possibly unblocking other RPC chains (the basis). Therefore all RPC chains will eventually terminate and total correctness can be proven.

This method only works efficiently when the transactions require domains evenly--each processor must know in which domain the value of interest lies. Tree-structured directories save much space and may work well for sequential accesses, but they cause "hot spots" that would substantially degrade the performance when distributed. On the other hand, if the number of domains were large enough so that the probability of any two transactions needing the same domain were small, the system would exhibit latency largely independent of the number of transactions--a blessing to travel agents worldwide.

Type Theory

Type systems are important. Most languages such as Pascal and C only allow types that are basic and constructed. Basic types generally correspond to types supported directly by the hardware like integers, floating point number and characters. Constructed types are built from the basic types to form arrays, lists, and records. Some languages such as Ada allow forms of polymorphism (many types) through operator overloading. Abstract data typing seemed to answer a desire for powerful, reusable, and encapsulated types. A paper by Cardelli and Wegner ›"On Understanding Types, Data Abstraction, and Polymorphism," Luca Cardelli and Peter Wegner, Computing Surveys, Vol. 17, No. 4, December 1985.! gave new vistas to my thinking about types. The type system of DANCE has been adapted from Cardelli and Wegner's type expression language Fun. Existentially and universally quantified types are realized through "packages" described in a following section.

Basic Types

The basic types of DANCE are:

Integer, Bool, Rat, Float, and Char

short for integer, boolean, rational, floating point, and character. Rational is included because exactness cannot be obtained with floating point numbers. A special value, NIL, is a member of each of the basic types representing no value. Uninitialized variables have the value NIL. Attempts to use variables whose values are NIL in expressions will result in error messages. However NIL can be assigned to variables.

Simple DANCE grammar for simple types:

;;="INTEGER".vertline."BOOL".vertline."RAT".vertline."FLOAT".vertline ."CHAR"

Constructed Types

Types may be constructed from basic types by the use of type expressions. Arrays, function spaces, records, and variant types may be constructed. Arrays are integer indexed and behave like functions as in ›The Science of Programming, David Gries, Springer-Verlag, 1981!. Accessing an array with an index out of bounds results in NIL. Functions are first-class and so may be passed as parameters. Record and variant types have standard usages.

Simple DANCE grammar for constructed types:

    ______________________________________
     ::= "(" ›+ .vertline. +!")"
     ::= "ARRAY" "›" + "!" "OF" 
     ::=  "=>" 
     ::= "RECORD" 


Parametric Types

Parametric types are type constructors that take type parameters to build a new type.

For example, a generic list could be defined and then specialized as with:

    ______________________________________
    TYPE List›Item! IS --declare a generic list
    VARIANT c: empty, cons;
    empty: NIL,
    cons: RECORD
    head: Item,
    tail : List›Item!
    END RECORD
    END VARIANT;
    TYPE IntList : List›Integer!;
    bind Integer to item to get IntLst
    ______________________________________


Simple DANCE grammar for parametric types:

    ______________________________________
     ::= "TYPE"  ( "›" .sup.+
     "!" )
    "IS"  ";"
     ::=  "›" .sup.+ "!"
    ______________________________________


Packages are Abstract Data Types

Packages provide program structuring and information hiding. The packages in DANCE are adapted from those in ›Algebraic Specifications in Software Engineering. Ivo Van Horebeek & Johan Lewi, Springer-Verlag, 1989! and inspired by packages in Ada ›ANSI/MIL-STD-1815A-1983 Reference Manula for the Ada Programming Language. American National Standard Institute, 1983.!.

Packages have two parts: a visible, declaration section, and a hidden body. The body is lot really hidden or secret, just that it is not within the scope that can be seen by other packages. All that a programmer should need to know about the services of a package should be visible in the declaration section.

Those familiar with object oriented languages like Smalltalk will recognize that packages define classes and that these classes are instantiated as objects. Such considerations will be finessed for the time being. The conception of executing a package body is adequate.

Packages have names. The items within a package can be named from other packages similar to records by prefacing the name of the item with the package's name and a period. Packages group together objects and operations that have useful purpose. Packages require lots of declarations. C hackers don't want to be bothered with scoping more sophisticated than "local" and "global." DANCE adds some powerful, yet esoteric declarations to Ada's. These declarations allow generic (polymorphic) programs that are parameterized with type (universal type quantification) and certain requirements for the awaited type (inheritance via bounded universal quantification). Ada's private type declarations serve as existential type quantification. DANCE's type system (including pointers or access types) allows definition of every type in either Ada or Fun, plus exotic types not found in either.

A requirement clause demands passing of a type to tie package. Often this type will be restricted to some form. Cardelli and Wegner's bounded universal quantification is realized through the use of requirement clauses.

An import clause specifies the use of another package's services. Such packages must be visible at compile time.

The private type declaration declares types that are defined within the package but may be used by other packages. The implementation of the types (and operations using such types) are hidden within the body of the package. This corresponds to Cadlelli & Wegner's existential quantification. The private type states that a type exists but does not tell how it works. Other packages may only manipulate existentially quantified types with operations declared by the package.

The package body contains the implementations of the operations. The relationships between each section will become clearer with later examples.

Simple DANCE grammar for packages:

    __________________________________________________________________________
     ::=  .vertline. 
     ::= "PACKAGE"  ( "›" + "!" ) "IS"
    *
    *
    .sup.+  ( "PRIVATE" .sup.+) "END" "PACKAGE"
     ::= "PACKAGE BODY"  ":"  "IS"
    .sup.+  "END" "PACKAGE"
     ::= "PROCEDURE"  "(" ›.sup.+
    ":" !