The present invention relates generally to simulation systems, and particularly to methods and systems for simulation using co-simulators.BACKGROUND OF THE INVENTION
Computerized simulation techniques are used for analyzing and solving complex computational problems in various fields, such as verifying the performance of complex electronic hardware designs. Many simulation techniques are known in the art. Some techniques use parallel processing in order to reduce simulation time.
For example, PCT International Publication WO 2009/118731, whose disclosure is incorporated herein by reference, describes a method for design simulation that includes partitioning a verification task of a design into a first plurality of atomic Processing Elements (PE) having execution dependencies. The method further includes computing an order for executing the PEs on a multiprocessor device, which includes a second plurality of processors operating in parallel and schedules the PEs for execution by the processors according to a built-in scheduling policy. The order induces concurrent execution of the PEs by different ones of the processors without violating the execution dependencies, irrespective of the scheduling policy. The PEs are executed on the processors in accordance with the computed order and the scheduling policy, to produce a simulation result. A performance of the design is verified responsively to the simulation result.
As another example, PCT International Publication WO 2010/004474, whose disclosure is incorporated herein by reference, describes a computing method that includes accepting a definition of a computing task, which includes multiple atomic Processing Elements (PEs) having execution dependencies. The computing task is compiled for concurrent execution on a multiprocessor device, which includes multiple processors that are capable of executing a first number of the PEs simultaneously, by arranging the PEs, without violating the execution dependencies, in an invocation data structure including a second number of execution sequences that is greater than one but does not exceed the first number. The multiprocessor device is invoked to run software code that executes the execution sequences in parallel responsively to the invocation data structure, so as to produce a result of the computing task.SUMMARY OF THE INVENTION
An embodiment that is described hereinbelow provides a method, which includes accepting a simulation task for simulation by a simulator that controls multiple co-simulators. Each of the multiple co-simulators is assigned to execute one or more respective sub-tasks of the simulation task. The simulation task is executed by invoking each co-simulator to execute the respective assigned sub-tasks.
In some embodiments, the simulation task is defined at any given time by a simulation state that is maintained by the simulator, and each sub-task is defined at any given time by a respective co-simulation state that is maintained independently of the simulator by the co-simulator assigned to execute the sub-task.
In some embodiments, executing the simulation task includes executing a given sub-task, which includes a value assignment statement, by a given co-simulator, returning a result of the value assignment statement from the given co-simulator to the simulator before execution of any non-blocking assignment if the value assignment statement is a blocking assignment, and returning the result after executing all blocking assignments if the value assignment statement is a non-blocking assignment. In an embodiment, returning the result includes scheduling the simulator to call the given co-simulator at a postponed time in order to accept the result of the value assignment statement.
In some embodiments, the method includes returning control of the simulation task from a given co-simulator to the simulator before completing execution of a given sub-task executed by the given co-simulator, in order to enable at least one other co-simulator to operate concurrently with the given co-simulator. Returning the control may include scheduling the simulator to call the given co-simulator at a postponed time in order to accept a result of the given sub-task.
In a disclosed embodiment, executing the simulation task includes executing at least one sub-task of the simulation task by the simulator. Executing the at least one sub-task may include synchronizing two or more sub-tasks, which are executed by two or more respective co-simulators, using the at least one sub-task executed by the simulator.
In some embodiments, the method includes representing the simulation task as a hierarchical tree structure, and partitioning the simulation task into the sub-tasks by recursively traversing the hierarchical tree structure. The method may include compiling the sub-tasks using one or more compilers. In an embodiment, compiling the sub-tasks includes compiling two or more of the sub-tasks using two or more respective, separate compilers.
In a disclosed embodiment, accepting the simulation task includes accepting a definition of a hardware design, and executing the simulation task comprises verifying the hardware design. The method may include executing at least part of given sub-task in a given co-simulator using a Graphics Processing Unit (GPU).
There is additionally provided, in accordance with an embodiment of the present invention, a method that includes invoking a co-simulator by a simulator to execute a sub-task of a simulation task, wherein the sub-task includes at least a value assignment statement. If the value assignment statement is a blocking assignment, a result of the of the value assignment statement is returned from the co-simulator to the simulator before execution of any non-blocking assignment. If the value assignment statement is a non-blocking assignment, the result is returned after executing all blocking assignments. The simulation task is executed by the simulator using the result of the sub-task.
There is also provided, in accordance with an embodiment of the present invention, a system that includes a simulator and multiple co-simulators. The simulator is configured to accept a simulation task, to assign each of the multiple co-simulators to execute one or more respective sub-tasks of the simulation task, and to execute the simulation task by invoking each co-simulator to execute the respective assigned sub-tasks.
There is further provided, in accordance with an embodiment of the present invention, a system that includes a simulator and a co-simulator. The co-simulator is configured to accept and execute a sub-task of a simulation task, wherein the sub-task includes at least a value assignment statement, to return a result of the of the value assignment statement before execution of any non-blocking assignment if the value assignment statement is a blocking assignment, and to return the result after executing all blocking assignments if the value assignment statement is a non-blocking assignment. The simulator is configured to invoke the co-simulator to execute the sub-task, and to execute the simulation task using a result of the sub-task returned from the co-simulator.
There is also provided, in accordance with an embodiment of the present invention, a computer software product. The product includes a tangible non-transitory computer-readable medium, in which program instructions are stored, which instructions, when read by a computer, cause the computer to accept a simulation task, to assign each of multiple co-simulators to execute one or more respective sub-tasks of the simulation task, and to execute the simulation task by invoking each co-simulator to execute the respective assigned sub-tasks.
There is additionally provided, in accordance with an embodiment of the present invention, a computer software product. The product includes a tangible non-transitory computer-readable medium, in which program instructions are stored, which instructions, when read by a computer, cause the computer to accept and execute a sub-task of a simulation task, wherein the sub-task includes at least a value assignment statement, to return a result of the of the value assignment statement before execution of any non-blocking assignment if the value assignment statement is a blocking assignment, and to return the result after executing all blocking assignments if the value assignment statement is a non-blocking assignment.
The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:BRIEF DESCRIPTION OF THE DRAWINGS
DETAILED DESCRIPTION OF EMBODIMENTSOverview
Embodiments of the present invention that are described hereinbelow provide improved methods and systems for executing simulation tasks. Applications of the disclosed techniques may comprise, for example, verification of Very Large Scale Integration (VLSI), analog or control hardware designs. Other applications may be in the field of search engines.
In some embodiments, a simulation system comprises a simulator that controls multiple co-simulators. In order to execute a certain simulation task, each co-simulator is assigned to execute one or more sub-tasks of the simulation task. The simulator executes the simulation task by invoking each co-simulator to execute its assigned sub-task(s), as well as synchronizing and coordinating the operation of the multiple co-simulators.
Unlike some known parallel simulation schemes in which a simulation task is executed by multiple parallel processors, in the disclosed techniques each sub-task is defined by a respective state, and this state is maintained by the executing co-simulator independently of the simulator. Some disclosed embodiments provide interfaces and interaction mechanisms between the simulator and the co-simulators, which achieve data synchronization and concurrent execution across the system. An example process of partitioning a simulation task into a well-balanced set of sub-tasks is also described. The partitioning into multiple sub-tasks lends itself to parallel compilation. In some embodiments, different sub-tasks may be compiled by different, possibly parallel, compilers.
The use of multiple co-simulators provides considerable advantages. For example, the disclosed simulation systems are able to carry out large simulation tasks, which are beyond the memory and computational complexity capabilities of a single simulator or simulator/co-simulator pair. Moreover, partitioning a simulation task among multiple parallel co-simulators enables substantial reduction in simulation time. Since each co-simulator maintains the states of its sub-tasks independently, the disclosed schemes provide true offloading of the simulator and are therefore highly scalable.System Description
System 20 comprises a simulator 24 and one or more co-simulators 28. In the present example the system comprises N co-simulators denoted #1 . . . #N. Although the embodiments described herein refer mainly to configurations comprising multiple co-simulators, the disclosed techniques are similarly applicable to configurations having only a single co-simulator.
Simulator 24 is provided with a simulation task to be performed, e.g., a hardware design to be verified. In an example embodiment, the hardware design is represented using Verilog. Alternatively, the design may be represented using any other suitable Hardware Description Language (HDL), such as VHDL, or other Register Transfer Level (RTL) abstraction.
Simulator 24 is configured to run the simulation task, which is partitioned into sub-tasks that are to be executed by the simulator and by co-simulators 28. This kind of execution typically involves a preparatory processing stage in which the simulation task is partitioned into the sub-tasks, as will be described below.
Typically, each sub-task comprises a shared object having a respective database. After partitioning, the simulator provides at least some of the sub-tasks to the co-simulators, and operates the co-simulators to execute their designated sub-tasks. Note that some sub-tasks may be executed internally in the simulator and not delegated to the co-simulators. For example, test-bench sub-tasks (i.e., simulation control) and sub-tasks that synchronize the co-simulators are typically performed by the simulator. (Note that test-bench tasks may be executed by the co-simulators.) In some embodiments, the simulation task is partitioned by another processor or system, and then provided to simulator 24 in partitioned form.
In the context of the present patent application and in the claims, the term “co-simulator” refers to a processing unit that carries out a part of the simulation task having a respective internal state. Such internal states are not fully available to the simulator, which is only aware of the overall state that defines the overall simulation task. In other words, the co-simulators are stateful processing units, each maintaining the internal state of its designated part of the simulation task independently of the simulator. The simulator may synchronize and control the different co-simulators, but it is typically aware only of the overall state of the simulation and unaware of the internal states of the co-simulators.
This definition of co-simulator is in contrast to some known parallel processing schemes in which a simulation task is partitioned into parallel tasks for execution by multiple parallel stateless processors or processing cores. Co-simulators 28 are sometimes referred to herein as accelerators, and the two terms are used interchangeably herein. Delegating a certain simulation sub-task to a co-simulator is sometimes referred to as accelerating that part of the simulation task.
In the example embodiment of
The use of multiple co-simulators is advantageous for several reasons. For example, the size of the design that can be simulated is typically limited by the finite memory size of the co-simulator (e.g., GPU). Partitioning the design simulation task among multiple co-simulators enables the system to simulate large designs that are impossible to simulate with a single co-simulator. Moreover, the operating environment of the GPU (e.g., CUDA for NVIDIA GPUs) sometimes limits the memory size that is usable by a single simulation instance to less than the total memory size of the GPU. Proper partitioning into sub-tasks enables the system to exploit the entire memory space of the co-simulator. Furthermore, operating multiple co-simulators in parallel provides a considerable increase in simulation bandwidth and considerable reduction in simulation time.
The system configuration shown in
Simulator 24 and co-simulators 28 may be implemented in hardware, in software, or using a combination of hardware and software elements. In one example embodiment, the entire system 20 is implemented in a single server platform, with each co-simulator implemented using a respective server blade. Alternatively, however, any other suitable configuration can be used. Typically, each of CPU 32, CPU 44 and GPU 48 may comprise a general-purpose processor that is programmed in software to carry out the functions described herein. The software may be downloaded to the processor in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
In the present example, the simulation task comprises an electronic hardware design that is to be verified. The design is modeled using Verilog. Verilog is specified, for example, in IEEE standard 1364-2001, entitled “IEEE Standard Verilog Hardware Description Language,” 2001, and in IEEE standard 1800-2009, entitled “IEEE Standard for System Verilog—Unified Hardware Design, Specification, and Verification Language,” 2009, which are incorporated herein by reference. In this embodiment, simulator 24 and co-simulators 28 communicate using Verilog Procedural Interface (VPI). VPI is described in chapters 26-27 of the IEEE 1364-2001 standard, cited above. Alternatively, any other suitable interface or communication protocol, such as Programming Language Interface (PLI), can be used for communicating between simulator 24 and co-simulators 28. PLI is described in chapters 20-25 of the IEEE 1364-2001 standard, cited above. Other examples of interfaces are DirectC and Direct Kernel Interface.Hierarchical Simulation Task Partitioning
In the present example, the simulation task is partitioned into three non-accelerated sub-tasks (represented by NACC nodes 56 and 66, as well as TOP node 54 that is also non-accelerated) and six accelerated sub-tasks (represented by ACC nodes 58, 60, 62, 64, 68 and 70). Dashed frames 74, 76, 78, 80, 82 and 84 represent partitions that are assigned for execution in respective different co-simulators. In this example each co-simulator executes the sub-task represented by a single accelerated node. Alternatively, however, a given co-simulator may execute multiple sub-tasks corresponding to multiple accelerated nodes.
Typically, the root of any sub-tree that comprises accelerated nodes is a non-accelerated node. For example, non-accelerated node 56 is the root of the sub-tree comprising accelerated nodes 60, 62 and 64. Non-accelerated node 66 is the root of the sub-tree comprising accelerated nodes 68 and 70. Non-accelerated node 54 is the root of the sub-tree comprising node 58. In other words, any two accelerated nodes (i.e., any two sub-tasks assigned to co-simulators) synchronize via a non-accelerated node (i.e., synchronize via the simulator). In some embodiments, communication between accelerated nodes is carried out by a non-accelerated node through which they synchronize. Alternatively, accelerated nodes may communicate directly with one another, even though they are synchronized by a non-accelerated node. Simulator 24 (or any other processor or system) may partition the simulation task into hierarchical structures of this sort in any suitable manner. An example partitioning method is described in detail further below.Interaction Between Simulator and Co-Simulator(s)
In some embodiments, simulator 24 and co-simulators communicate using a mechanism that ensures synchronization and data integrity between the various signals and databases in the simulator and co-simulators. The following description refers to a system configuration in which the simulator and co-simulators communicate using VPI. The disclosed technique can be used, mutatis mutandis, with various other suitable interfaces or communication protocols.
Simulator 24 typically simulates a Verilog design in a sequence of simulation time slots. The processing of a given co-simulator 28 in a given time slot is divided into an ingress phase, an execution phase and an egress phase. In the ingress phase the co-simulator reads the values of one or more signals from the simulator. In the execution phase the co-simulator applies the appropriate processing to the signals. In the egress phase the co-simulator provides the processing results, e.g., updated values of the signals, back to the simulator.
In some embodiments, the co-simulator reads a given simulator signal by registering, or subscribing, to the signal and requesting the simulator to provide a trigger when the signal value changes. This feature is referred to as “callback on value change” in VPI terminology. The registration process may cause a value change on a given signal to trigger reading (“ingress processing”) one or more other signals. Thus, a co-simulator need not necessarily register all ingress signals.
The processing performed by the co-simulator in the execution phase may comprise any suitable Verilog statements. In particular, the processing may comprise blocking and/or non-blocking assignments (NBA). In the disclosed interaction mechanism between the simulator and the co-simulators, the handling of the two types of assignments is different, as will be explained further below.
In a blocking assignment (denoted “=” in Verilog, e.g., A=B) the Right-Hand-Side (RHS) of the assignment is evaluated and substituted in the Left-Hand-Side (LHS) of the assignment immediately. Other computations in the given simulation time slot are blocked until the assignment is complete. In a non-blocking assignment (denoted “<=”, e.g., A<=B), the RHS of the assignment is evaluated immediately, but the assignment of the result into the LHS is postponed until the other assignments and their resulting assignments in the same time slot are complete. (In the state model of
In the example flow of
Note that since the following state carries out the ingress phase, it should be verified that all active and in-active signals were updated before proceeding. The above-described delay mechanism ensures that egress processing is performed only after no assignments are waiting for execution.
In response to triggers 94, each triggered co-simulator carries out the ingress phase, i.e., reads the signal values from the simulator. Each co-simulator then carries out the execution phase, and in particular updates its internal state.
At this point, blocking and non-blocking assignments are handled differently. For each blocking (non-NBA) assignment, the co-simulator carries out the egress phase, i.e., returns the updated signal values to the simulator. For each non-blocking assignment (NBA), the co-simulator schedules the egress phase for a later time following the NBA state of the simulator. The postponed callback cbReadOnlySync is marked by an arrow 102.
Typically, the co-simulator should not perform the egress phase on the cbReadOnlySync callback, but only the ingress phase, i.e., read signals. Another option is to register on cbReadWriteSync. In practice, however, some simulators do not call this callback on Post-NBA (arrow 98). In some embodiments, the egress phase is postponed by causing the simulator to call the co-simulator on “value change event” after NBA state processing is completed. This mechanism is implemented using the signals signal_updated_by_simulator and signal_driven_by_co-simulator. Since the signal signal_updated_by_simulator is updated using “<=”, it is updated during the NBA state processing. However, since the co-simulator is registered on this signal (with value_change reason), after NBA state processing the simulator will not continue to the POSTPONED state, but instead go again to the ACTIVE state and will call the co-simulator in the following INACTIVE state.
In some embodiments, for a blocking assignment, the egress phase (returning the result of the assignment from the co-simulator to the simulator) is performed before execution of any non-blocking assignment in the system (including, for example, blocking assignments performed by the simulator, the given co-simulator and the other co-simulators). For a non-blocking assignment, on the other hand, the egress phase is performed after execution of all blocking assignments in the system.
Alternatively to using pre-NBA callback triggers 94, a triggered co-simulator may register on post-NBA callback, i.e., requests the simulator to trigger it again following the NBA state. The simulator in this implementation responds by re-triggering the co-simulators with post-NBA callback triggers 98.
When multiple co-simulators 28 (or other units) communicate with simulator 24 using the above-described mechanism, the signals and databases of all these units will remain synchronized. Each co-simulator carries out its ingress, execution and egress phases. In case of non-blocking assignments, the egress phase has to be performed only after all the co-simulators that were triggered in the same time slot have completed their respective ingress phases. A similar synchronization is also needed between a co-simulator and the simulator (even in the case of a single co-simulator). Although this mechanism was described in the context of multiple co-simulators, it can be used to operate a single co-simulator, as well.Operating Multiple Co-Simulators without Mutual Blocking
When simulator 24 interacts with multiple co-simulators using the mechanism of
The mechanism described below enables concurrent running of the execution phase. When using this mechanism, each co-simulator returns control to the simulator after it completes its ingress phase and initiates the execution phase (execution is assumed to be non-blocking).
In some embodiments, a given co-simulator returns control to the simulator even though it has not yet completed its execution and egress phases, in order to allow other co-simulators to proceed. If the co-simulator did not complete its execution and egress phases by the time control is returned to the simulator, the co-simulator schedules the simulator to call it again at a later time to update the egress phase results. An example process of this sort is shown in
When pre-NBA trigger callback trigger 94 arrives, the co-simulator performs the ingress phase, invokes execution of non-blocking assignments, and schedules postponed egress processing, at a processing step 110.
In the next value change callback trigger 90, the co-simulator carries out a completion checking step 114. In this step, the co-simulator checks whether the execution phase is completed. If completed, the co-simulator obtains the execution results and performs the egress phase. Otherwise, the co-simulator re-schedules egress for blocking assignments.
Using this technique, execution in one co-simulator does not block other co-simulators. As a result, the co-simulators can carry out their respective sub-tasks concurrently and thus reduce simulation time. When co-simulation execution is performed by CPU 44 and not GPU 48, concurrency can be supported using multiple concurrent threads or processes.
In some embodiments, system 20 schedules blocking and non-blocking execution using the following scheme: Two Verilog signals denoted “signal_driven by_co_simulator” and “signal_updated_by_simulator” are defined (typically in one of the Verilog files that are compiled by the simulator). The co-simulator registers for callback on value change for the signal signal_updated_by_simulator. The following code is added (typically in one of Verilog files that are compiled by the simulator): a. For immediate (blocking) update: always@(signal_driven_by_co_simulator) signal_updated_by_simulator=signal_driven_by co_simulator b. For NBA update: always@(signal_driven_by_co_simulator) signal_updated_by_simulator <=signal_driven_by_co_simulator Whenever the co-simulator prepares to schedule a callback, it reads the signal_driven_by_co_simulator signal from the simulator via VPI, increments the signal value by one, and writes back the incremented value to the simulator via VPI. Since the simulator implements the update of signal_updated_by_simulator as described above, the simulator will call the co-simulator (with VPI value change reason).Partitioning of Simulation Tasks into Sub-Tasks
A given simulation task (e.g., VLSI design to be verified) can be partitioned into sub-tasks in various ways. In some embodiments, partitioning is carried out by a process that precedes simulator 24. This process generates products that are later used by the simulator. The partitioning may be performed by the same computing platform that runs simulator 24. Alternatively, partitioning may be carried out by another processor or system external to system 20, in which case the design may be provided to simulator 24 in partitioned form. The following description provides an example partitioning process. The description refers to partitioning that is performed by a processor, regardless of the actual computing platform that runs it.
In this example embodiment, the processor splits a simulation task that is represented by a hierarchical tree structure (e.g., tree 50 of
The user provides two parameters denoted MIN_SPLIT_WEIGHT_SIZE and MAX_SPLIT_WEIGHT_SIZE. MIN_SPLIT_WEIGHT_SIZE specifies the minimum permitted weight of a partition. MAX_SPLIT_WEIGHT_SIZE specifies the maximum weight of a partition. In some cases a partition larger than MAX_SPLIT_WEIGHT_SIZE may be created, but only if it is not possible to split this partition into splits larger than MIN_SPLIT_WEIGHT_SIZE, such that they cover at least a certain percentage (denoted P1) of that split weight.
The weight of a partition can be defined in various ways. In some embodiments, the weight of a tree node is defined as the number of tree nodes at and below the node in question. The weight of a partition in these embodiments is defined as the weight of the node that serves as the root of the partition. In other embodiments, e.g., in RTL designs, the weight of a node may also depend on the number of statements and/or constructs in the node and the nodes below it. Again, the weight of a partition is defined as the weight of the node that serves as the partition root. In some embodiments, the processor runs a recursive process that calculates the weight of each tree node.
In an example embodiment, the partitioning is performed by the following recursive function that takes the current node (cur_node) as argument:
Cur_weight = weight of the cur_nodeIs_nacc = this module is non-acceleratedIs_top = this node was specified by the user as a Device-Under-Test (DUT) top levelsum the weights that are > MIN_SPLIT_WEIGHT_SIZE of the children of cur_node into sum_weights_in_rangeif (cur_weight > MAX_SPLIT_WEIGHT_SIZE and sum_weights_in_range > P1*cur_weight) or is_nacc: call auto-split for all of the children, and OR together the returned Booleans into OR_resultif (cur_weight < MAX_SPLIT_WEIGHT_SIZE or ~OR_result) and !is_nacc and (cur_weight>SPLIT_MIN_WEIGHT_SIZE or is_top): add this node as a split-root return Truereturn OR_result
When using this recursive function, the next level of hierarchy is inspected if: The current node is “too heavy,” i.e., its weight is too high; and The weights of the children of the current node (that are >MIN_SPLIT_WEIGHT_SIZE) sum into a value that covers at least P1% of the current node weight.
Therefore, it is permitted to exceed MAX_SPLIT_WEIGHT_SIZE in case that the children (that can become partitions) cannot cover the weight. Parameter MAX_SPLIT_WEIGHT_SIZE is therefore a “soft” constraint that is exceeded in some cases, while MIN_SPLIT_WEIGHT_SIZE is a hard constraint. In the above description, P1 is a user-definable parameter. In one example embodiment P1=90%, although any other suitable parameter values can be used.
In alternative embodiments, any other suitable partitioning process can be used. In some embodiments, the partitioned simulation task is compiled using multiple compilers, such that each compiler compiles a respective subset comprising one or more of the sub-tasks. Different compilers may run on different processors or other computing platforms, which in some embodiments operate in parallel to one another.
Although the embodiments described herein mainly address design verification applications, the methods and systems described herein can also be used in other applications. For example, the above-described partitioning process can be used in other Electronic Design Automation (EDA) applications such as Place&Route applications. Another example application is in concurrent searching using multiple concurrent search engines.
It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.