Multicore processors are the future of computing. With that in mind, we should consider how to create a multithreaded VVP simulator.

Core Observations Edit

The core of Verilog simulation is the handling of events. In spite of modelling an inherently parallel device (the hardware), however, Verilog executes events serially, one at a time. Each event executes "atomically", running to completion. Even procedural blocks are atomic until it reaches the end of an initial procedure, or has to wait for some event. We can thus model individual events as "transactions". (In fact the ideas here are inspired by Software Transactional Memory)

In practice, most Verilog code (such as RTL) have signal changes propagating to a limited set of driven signals. Signals that are written to by events are generally limited; many events in typical, well-designed RTL will change only one signal (think of a well-designed always block; it should generally always have just 1 output signal.

Thus, two events that do not "interfere" with one another can be executed in parallel. By "interfere" we mean that signals written to by one event will not be read or written by another event. Signals that are read by both events can be safely read in parallel.

Thus, our core observation means that we simply need to check for interference prior to executing an event. If two events do not interfere, then they can be run in parallel. Each event can be treated as an atomic block, effectively equal to having something like a lock for an entire event execution.

Re-modelling Scheduling Edit

Let us consider the scheduler structure. The scheduler primarily consists of a list of time steps. In each time step item, we have 4 lists: active, nbassign, rwsync, and rosync. If the time step has active items, the scheduler takes one of those and executes it, and so on. Once active is empty, the scheduler checks (in order) the nbassign, rwsync, and rosync lists. Once it detects a non-empty list in one of those, it empties that list into active (rosync is treated specially; the scheduler checks that additional events are added to rosync only).

Now in general, the active list contains a list of events to process. So for an entire active list, we can try to compute events in parallel as much as possible. This means that we acquire and empty the active list first and process it as described below, and instead of moving into the active list, the nbassign etc. lists are acquired and emptied for parallel processing. This parallel processing scheduler thus has as input a list of events to run as parallel as possible.

Requirements Edit

For one, we assume that we have a thread pool capable of executing arbitrary tasks. Optimizing and coding the thread pool is assumed to be similar to each and every other thread pool, and so we will simply assume two methods from it, add task magically executing a task on an unspecified processor at an unspecified time in the future, and wait complete which waits for the completion of all tasks.

In the proposed scheme, we need to add the following member variables to each event:

  1. state lock, a mutex protecting these new member variables.
  2. state, an enum specifying the state of the event, with the possible values Idle, Running, and Finished.
  3. dependency, a list of events, specifying the events that must be completed first before this event can run.
    • This is a non-unique list; that is, an event may be a member of multiple dependency lists.
  4. todo, a list of events that are to be executed after this event completes.
    • This is a unique list; i.e. an event can be a member of 0 or 1 todo lists, and cannot be a member of more than one todo list.

The operation of the algorithm proceeds as follows:

  1. Compute read and write sets (trivially parallel, each event can have this task done in parallel on the thread pool)
  2. Compute dependency list (trivially parallel, each pair of events can have this task done in parallel on the thread pool)
  3. Perform events (not trivially parallel, must respect dependency list)

Determining Interference Edit

Each event must then add new methods get read set and get write set. The read set is the set of objects that will be read but not modified, while the write set is the set of objects that are modified (and might also be read). Note that we use the term "objects" since we don't want to limit the sets to just signals (basically, any heap-lifetime C++ object that exists before the event will start and will continue to exist afterwards is required to be reported in the set).

Getting the Sets Edit

For functor events, the read set is its inputs, which are general stored in the functor itself (and thus the functor itself is part of the read set) (note that as of 0.9.x, practically all functors write to a net_ member variable which is modified during the execution of run_run(); this means that the functor itself should be write set. but i read somewhere that the separation of nets and functors will be fixed in 0.10.x, so the functor may indeed be read set only in 0.10.x; I haven't checked).

For vthread execution events, we need to inspect the bytecodes of the label to start execution. Scan linearly. If a bytecode would read an object, add the object to the read set, if it would write, add the object to the write set. Conditional jumps fork our processing, recursing on the branch target and continuing with the next bytecode. We need to mark each bytecode for "done" so that loops will terminate once it reaches a marked bytecode (probably by using a thread-local set, we don't want to interfere with bytecodes in another thread). Scanning ends when the end, wait, or join bytecodes occur. The completed sets are then used.

Computing Dependency Edit

The active list serves as a queue. In non-parallel VVP we simply execute queue items one at a time (in most cases pushing onto the active list just pushes to the tail of the queue). This must be emulated as much as possible in multithreaded VVP.

Thus, we compute dependency by taking all pairs of events. We must also consider the ordering of the events in the queue. Earlier events might be dependencies of later events, but not vice versa. The first event in the queue thus has no dependencies and will have an empty dependency list.

At this point we expect the read and write sets of all events to have been precomputed, and that accessing them will not require significant processing etc etc.

Now given a pair of events a and b, where, a is earlier in the queue than b, we compute dependency as follows:

  1. If a's write set intersects with b's write set, then put a in b's dependency list (dependency is protected by state lock, so make sure to acquire state lock before modifying dependency)
  2. Else if a's write set intersects with b's read set, then put a in b's dependency list
  3. Else if a's read set intersects with b's write set, then put a in b's dependency list
  4. Else do nothing, a and b are safely executable in parallel and are therefore independent, hurray!

For each pair of events in the queue, simply push the above as a task in the thread pool. By pair we mean, for example given events A, B, and C in the queue, we need A-B, B-C, and A-C pairs. Then wait for the thread pool to empty.

Performing Events Edit

Once interference has been computed, we can then put each event in the queue as a task composed of the following steps:

The label top:

  1. Acquire the state lock for this event. Assert that state is Idle.
  2. Check dependency. If it is empty, continue to the label perform below. If non-empty, pop off one item; let this be parent. Acquire parent's state lock and if its state is Finished, release the lock and repeat this step.
    • This will not deadlock. Lock acquisition is always in the order dependent->dependee.
  3. Add this event to parent's todo list. Then release both locks and do nothing more (i.e. return).

The label perform (at this point, the event's state lock is acquired):

  1. Change state to Running and release state lock.
  2. run_run()
  3. Acquire state lock for the event and set state to Finished.
  4. Check the todo list. If empty, release the lock and do nothing more. Otherwise, acquire and empty the list, release the lock, and pop off one item; let the item be next and the rest of the list be rest.
  5. Push rest on the thread pool. Then let next be this event and go to label top.

To process a queue of events, simply push each event as the above task, then wait for the thread pool to empty.

Locking Edit

Additional locking is unnecessary, provided that the write set and read set of each event is accurate. This means that existing code will hardly be modified. However, each event does require a full write set and read set. Fortunately, it appears that most functors will not propagate immediately upon getting recv_vec4() and friends, but will schedule their propagation; thus computing their write set simply needs to inspect the fan out for a functor.

Sticky Points Edit

VPI Edit

Events that must call VPI can be serialized by adding a dummy VPI object to the write set of the event. However, this is not sufficient. VPI designers are expected to have the following assumptions:

  1. VPI modules are called from only one thread at a time.
  2. VPI modules are called from exactly one thread.
    • This is a stronger constraint than the above. This means that only one thread will ever call VPI. No other thread has the "right" to perform C calls to the VPI function.
    • This can occur for a "sufficiently complex" VPI module, where the VPI module itself multithreads. The VPI module might attempt to detect the "simulator thread" by using a thread-local variable. It then assumes that the thread which originally called its initializer functions is the simulator thread, and set the thread-local variable for that thread and only that thread. Thus it is not safe to call the VPI module from any thread other than the thread that initialized it.
  3. Procedural code will always execute in sequence, with code in other processes not interfering. Thus a VPI designer might design a VPI module for a use like:
initial begin
   $vpi_process("blah blah");
   $vpi_close; end
initial begin
   $vpi_open("something else");
   $vpi_process("woot woot");
   $vpi_close; end

In the above case, we cannot process either procedural event in parallel with the other, since the VPI calls will get interleaved.

The simplest solution is to have, in addition to including a dummy VPI object in write sets of events that do VPI, a VPI caller thread with a message queue. VPI call requests are routed to the message queue, together with a conditional variable/semaphore. The VPI caller thread then performs the call, gets the results, and notifies the condvar/semaphore.

Note that the main thread can even be the VPI caller thread (which can at least reduce call overheads in schedule-related callbacks, i.e. time steps, which are still only executed in the main thread). One simplified way of telling the main thread to dismantle the VPI caller thread (i.e. exit that mode) is to add a dummy event that is dependent on all other events in the queue. This event's run_run() then simply sends a message to the VPI caller thread telling it to exit the mode and return to being the main, scheduler thread.

Child vthread's Edit

Forking a child vthread causes the child's event to be scheduled at the front of the active list. This is problematic, since we don't really know how it could change the dependency (pushing to the back of the active queue is OK, since that event will get processed after all the currently parallel processed events are done). A comment in the definition of schedule_vthread(), however, says "I can push this event in front of everything. This is used by the %fork statement, for example, to perform task calls." The key word here is "can", implying that it is allowed to do so, but is not necessary. If pushing to the back of the queue is acceptable for newly-forked threads (as well as threads revived by child termination) then we can simply push to either end of the active list, to be processed in parallel later. But this changes the apparent order of execution.

If this is not acceptable, I don't really know; it seems a hard problem. We may need to speculatively perform the fork even at write set/read set computation, then at execution of fork, we set a flag specifying that the child will indeed be executed after the parent is. We can't safely use the todo list because other events may interfere with the child, and we don't know if we can execute them in parallel with the new child.

Not All Events Edit

Possibly, one solution that could apply as a solution to both VPI calls and forkings of child vthread's would be to not include all events in parallel execution. Instead, we do the following:

  1. Initialize an empty list of events for parallel execution.
  2. For each event to be executed:
    • Check if it calls VPI. If it does, end.
    • Remove from active list.
    • Append to list of events for parallel execution.
    • Check if it forks. If it does, end.

Events that are not removed from the active list remain there, and will be executed in the main loop, at least until the next time we choose to perform parallel execution (which should be the next loop iteration).

The above means that an event with VPI will be executed "as normal", i.e. in the main thread, without going through the rigmarole of using multithreaded queues. Also, an event that forks a vthread will be the last event. This allows it to safely put the newly forked vthread into the front of the main scheduler queue, without conflicting with other forkers, and only requiring a relatively lightweight sequencing constraint.

Self-Replacing Bytecodes Edit

Several of the bytecodes in end up replacing themselves with a more optimized version. However, this is problematic since multiple threads may execute the code.

One way to fix this would be to also include the bytecode itself as part of the write set. This forces serialization, at least during the stage of replacing themselves. For future executions, the bytecode is no longer part of the write set and thus releases serialization.

Weak Processor Memory Models Edit

Not really a problem: pthreads requires that locks perform necessary memory barriers. Even though we do not execute while locks are held, it is sufficient to acquire and release the lock to perform acquire and release memory barriers. Thus the act of acquiring and releasing state lock before and after executing run_run() is sufficient to ensure barriers. This is even less of a problem on Windows, which use IA32 and AMD64, both of which are sequentially consistent.

Low-Hanging Optimization Fruits Edit

Read and Write Sets Edit

Set Representation Edit

Primarily, we need only the following operations on sets:

  1. Insert into set
  2. Check if intersects with another set

There is no real need to perform set lookup, or to iterate over set members, etc. One possible representation of a set is thus a Bloom Filter. Insertion is O(1), and checking intersection requires simple binary AND operations. If a binary AND of their bits yields non-zero, then there is an intersection; this check is also O(1).

Note however that Bloom Filters are probabilistic, and there is a possibility (much higher than probability of false positives) of false intersects, which reduce our parallelization. A sufficiently large Bloom Filter with sufficiently few hashes may reduce this, however.

Since an efficient way of checking set intersection is necessary, this almost likely rules out the use of binary search trees, which are at best O(N log M). Sorted arrays allow set intersection to be O(N + M).

Caching of Read and Write Sets Edit

Since we expect net fanouts to be fixed, we can probably pre-compute the write and read sets of functor events. In addition, we might also annotate bytecode addresses with the read and write sets that would be accessed if/when vthread execution resumes at that address.

Note however that caching the read and write sets for bytecode addresses is problematic for self-replacing bytecodes. We might set a flag specifying "do not cache sets" if a self-replacing bytecode is encountered.

Thread-local Scheduling Edit

Given a centralized scheduler, each of the worker threads will compete for the central event scheduler. This can cause severe contention, especially since propagation events are likely to consist completely of scheduling a bunch of events.

One solution would be to use thread-local schedules. When the thread pool empties after executing all run_run() methods, the main thread can then acquire and empty each of the thread-local schedules into the central scheduler. This reduces serialization during execution of events.

The order problem Edit

The risk here is nondeterministic ordering. It would be nice if we could have deterministic ordering of pushing events into the scheduler. Possibly we can instead use an event-local scheduler instead of a thread-local one. Since the main thread acquires the events on the queue in a particular order, it can simply retain that ordering internally. Then the main thread simply appends each event-local schedule onto the main schedule, in the correct order (this is really an associative reduction operation (I refer here to MapReduce), which is in fact partially parallelizable, or at least log2(N) parallelizable).

Event-local Scheduler Edit

As mentioned in the previous section "The order problem", an event-local scheduler would allow for the property of having consistent ordering of events. This can even be changed to allow parallelizability by allowing thread pool tasks to "wait on" other thread pool tasks to complete before they are started. An event that is "transformed" to a thread pool task would wait on events in its dependency list. In the meantime, the main thread creates tasks to merge the schedules of pairs of events in order. The main thread then creates tasks to merge schedules of pairs of those events, and so on, until it reaches a single event. The main thread can then wait for completion of that single event to wait for all events, and to acquire the final schedule, which it can then merge with the main schedule.

This allows us to let go of the requirement for the thread pool to provide a "wait until all tasks are completed", since we need only wait on one task, which is dependent on all the other tasks we are concerned about.

Further, not only must event-local schedules be used, but also event-local wait lists. Waitable event functors have a list of waiting vthread's, whose order we must also preserve. We can simply subsume both schedules and waitlists into a single, mergeable structure.

Continuous Parallelism Edit

In the discussion so far above, we have synchronization points where the main scheduler thread waits for the thread pool to empty of tasks, then re-loads a bunch of tasks onto the thread pool. Specifically, between computing access sets and computing dependencies, the main scheduler waits for access set computation to complete before continuing with computing dependencies, and again the main scheduler waits for the thread pool to empty before it continues processing.

However it should be possible to load all these tasks into the thread pool at once. This allows some steps to overlap; for instance, the first scheduled event may be executing while some later scheduled event is still computing dependencies.

This can be facilitated by using a Promise object.

Details of Promise Edit

A promise is simply composed of:

  1. A computation, which is simply an abstract object representing some function with its inputs
  2. Some storage for the result of computation
  3. A state', which is either Idle, Running, or Finished
  4. A semaphore
  5. An integer num_waiters, specifying the number of threads waiting for the promise to transition from Running to Finished
  6. A state lock mutex protecting the entire promise

Promises are constructed from a computation abstract object. Promises, once constructed, have the following operations:

  1. A try run operation
  2. A run operation
  3. An extract operation, which has the precondition that run has completed successfully

On construction, the constructor adds its try run operation on the thread pool automatically. The try run operation then does the following:

The label try run:

  1. Acquire state lock
  2. If state is not Idle, release state lock and return, doing nothing
  3. Change state to Running, then acquire and clear computation; let this be f
  4. Release state lock
  5. Go to label perform run, giving f

The label perform run:

  1. Run f, storing the result into a temporary return value
  2. Acquire state lock
  3. Copy return value into storage
  4. Get the number of num_waiters; let this be N
  5. Release state lock
  6. Repeat N times: Post to semaphore

The actual run operation ensures that the Promise has completed computation in some thread, either the current thread or in some thread pool worker thread.

The label run:

  1. Acquire state lock
  2. If state is Idle:
    1. Change state to running, then acquire and clear computation; let this be f
    2. Release state lock
    3. Go to label perform run, giving f
  3. If state is Running:
    1. Increment num_waiters
    2. Release state lock
    3. Wait on semaphore
  4. If state is Finished, do nothing

The extract operation simply locks the object and copies storage.

Using Promise for Continuous Parallelism Edit

After acquiring the active list into the parallel processing engine, we do the following:

  1. For each event, create a Promise that computes the read set and write set of the events. Store that promise with the event.
    • Promise creation implies adding to the thread pool, so this is inherently parallel.
  2. For each event, add the following task to the thread pool:
    1. For each earlier event, create a Promise that checks for interference between this event and the earlier event. Store those promises in the event.
    2. For this event, create a Promise that performs the event

The compute interference function simply ensures run of the read set and write set Promises, then uses the normal dependency computation.

The perform event function simply ensures run of all dependency Promises and then uses normal perform event computation.

Comments by CaryR Edit

I wanted to let you know that we are watching this and thank you for your thoughts. I have been thinking about parallelism and was planning to see if we could add something simple to the expression rework. What you are describing is much more involved and would likely have a much bigger speedup. I'm still working on getting a grasp on all the subtleties of multi-threading, so I don't have any comments at the moment. Once again thank you for your contribution!

Cary 15:41, May 22, 2010 (UTC)

You're welcome. Looking through the 0.9.1 source I'm currently having some difficulties with keeping modifications as separate as possible so that we can retain as many source files common between single-threaded and multi-threaded (no matter what, single-threaded code will run better on machines with a single processor).
One thing I mentioned in the above discussion is about "Access Sets", so I was thinking that a bunch of functions such as recv_vec4() would have equivalent recv_vec4_acs(). The "acs" variants would simply compute the access set without doing the "real" computation - basically just add a pointer to an access set. I'm currently thinking of using a varargs macro of some description that would be approximately like the following:
#define ACCESSES(cls, fn, ...)\
    void cls :: fn ## _acs(vvp_access_set_t acs, __VA_ARGS__)
#define ENDACCESSES /*nothing*/
#else /* !MULTITHREADED */
#define ACCESSES(cls, fn, ...)\
    void cls :: fn ## _acs(vvp_access_set_t acs, __VA_ARGS__) { if(0)

/*... */

ACCESSES(vvp_not_fun_t, recv_vec4, vvp_net_port_t net, vvp_context_t context) {
void vvp_not_fun_t::recv_vec4(vvp_net_port_t net, vvp_vector4_t data, vvp_context_t context) {/*...*/}
The problem is that varargs macros are C99, and older GCC had a different, incompatible syntax (newer GCC supports both syntaxes). This reduces portability.
Another problem that's come up so far is how to handle value propagation in parallel. Normally propagation would touch a bunch of functors/nets. This greatly increases the access set of vthreads that write to functors, making it more likely to interfere with other vthreads/events, and thereby force a lot of sequentialization. What I want to do to "fix" this is to use promises, such that the send_vec* functions will make promises to propagate to the fanout, and then force those promises before returning. The use of promises means that in effect, the current thread, even though it might not be available on the thread pool, is still available to perform the core work that needs to be done.
In any case, whatever you are planning in the expression rework, you might (should?) consider the use of a thread pool, which simplifies tasks. You probably might (should?) use Promises/Futures, implemented via that same thread pool, in order to ensure that propagation completes before you return to the modifying code in
--The Horrible Idea Guy. I'll probably get around to registering "AmkG" when I have more free time at home.

The expression rework should help with this since expression evaluation will be controlled by a master object that pulls the calculation when a signal changes vs the diffuse push method we have now. I would also suggest you use the latest development code from git. V0.9.1 is quite old. We released V0.9.2 last year and need to find some free time to make a 0.9.3 release before too long. Development (0.10.0 devel) has some net rework and development work like this really should be done in that branch. The expression rework will only go in the development branch.

Cary 16:20, May 25, 2010 (UTC)

Actually I think diffuse push would be easier to parallelize, but really it depends on how you're structuring things; central objects tend to be anathema for multithreads, but not if its e.g. backed by a thread pool or some such (I am obviously obsessed with thread pools). I've got a git clone of the steveicarus github repository, I don't see an expression-rework branch, is it directly on master? Or do you have a branch somewhere else? Rather unfortunately v0_9_1 is the easiest reference I can get at so far (I've been studying it on and off for about 3/4 of a year or so), but I'm starting to study on github/steveicarus/iverilog master. A rough summary of the changes on master since v0_9_1 would be appreciated.
So far one stumbling block is how to organize the code so that as little as possible needs changing. I'm thinking of doing a lot of macro trickery so that e.g. calls to schedule_* functions are calls to the code in singlethread compilations, but are replaced with calls to mt_schedule_* functions in multithread compilations; the mt_schedule_* variants will attach an object to the event_s currently being run (if running an event_s, otherwise call directly - probably need a thread-local var to determine this). How comfortable would the dev team be with such tricks?
My current thinking is to have a separate mvvp/ directory for the multithreaded build, have a -I../vvp/ flag in AM_CFLAGS or similar private CFLAGS, and for source files which require little or no change, just have the mvvp/ contain #include"". Is this OK?
The count_* variables would be a class which increments thread-local variables, but when converted to an unsigned long will access all those thread-local variables, sum them up, and return the sum. The expected use case is that they are always incremented but are read only once (or at most very few times), at the end of simulation.
class count_variable {

    /*lock protected*/
    mutable mvvp_mutex_t M;
    std::vector<unsigned long*> tl_vars; /*pointers to all thread-local variables*/
    /*end lock protected*/

    mvvp_tread_local_t tlvar; /*actual thread-local*/

    count_variable& operator+=(unsigned long i) {
        unsigned long* pvar = reinterpret_cast<unsigned long*>(tlvar.get_local());
        if(!pvar) {
            pvar = new unsigned long(0);
            tlvar.set_local((void*) pvar);
            {mvvp_lock_t L(M);
        *pvar += i;
    operator unsigned long(void) const {
        unsigned long rv = 0;
        {mvvp_lock_t L(M);
          BOOST_FOREACH(unsigned long* pi, tl_vars) {
              rv += *pi;
        return rv;
Also, I'm now on the iverilog-devel mailinglist, so we can discuss there (discussing on the wiki is OK though if you prefer it).
The Horrible Idea Guy 22:50, May 25, 2010 (UTC)

There is no expression rework branch yet. I have some C++ files where I'm working on the basic structure and the implementation for the various expression operations (binary operators, reduction operators, etc.). What I have appears to be both fast and functioning correctly. I just need some free time to finish the implementation of the various operators. Once all the operators are working correctly I can then start integrating this into vvp. My assumption is that each expression object would go into a thread pool when it needed to be evaluated. We have to correctly handle the case where the same variable is used multiple times in the same expression (it must have the same value in both places and must trigger only one update of the result).

Cary 00:06, May 26, 2010 (UTC)

Using a promise is ideal here in that case; in fact, in a singlethreaded implementation all you need to do is remove the backing thread pool and the lock, and the promise will still work, albeit very lazily (it won't trigger unless you actually get the value, unlike having a thread pool where it will get done either when you actually get the value, or when the thread pool gets around to doing it). However, you may need to have a "resettable" promise, i.e. one where we can reset the function it computes, and that implies visible external state, including races. Erk. And then there's %deassign, which changes the fanout of things...
The Horrible Idea Guy 02:43, May 26, 2010 (UTC)

You can use gitk to see what has changed. You can ignore things not in the vvp directory and there is supposed to be a good description of what each patch does. V0.9 was forked on March 20th, 2009 from the master branch. If you are looking at plain V0.9.1 source code then any change since then could be of interest. Later code in the V0.9 branch (0.9.2 and the V0.9 branch from git) has some of the fixes from development back ported.

Steve will need to decide on the code/architecture changes. Ideally we could use the same vvp source for either the single or multi-thread versions of vvp. Being able to run vvp <file> and mvvp <file> is very appealing from a usage point of view as well as a verification point of view. I realize this doesn't work for anything that is showing/depends on evaluation order.

Cary 00:06, May 26, 2010 (UTC)

Actually, we might be able to get eval order to work similarly. All we need to do is to ensure that the results of computation are committed in the correct order; the getting of inputs to computations to outputs need not be done in order, just the combination of outputs to final output. For example, consider how insertion into the scheduler can be implemented. I suggest in #The order problem that we redirect calls to schedule_* functions so that they insert into scheduler that is local to the currently-executing event. Then, after all active events have completed, the main thread goes through them in order, inserting their scheduled events into the "main" scheduler. This means that the order in which the events are inserted into the "main" scheduler is the same order in which they would have gotten inserted into the only scheduler in singlethread. But we have to treat schedule_push_ differently, since it does not append to the back, but to the front. Possibly we can treat this as adding to the access set for this event, putting them in a stack, then using promises to execute them in any order while committing their results in the correct LIFO order. This implies that an event should be able to also call into the event-distributing function (implying everything like getting access sets, figuring out dependencies, etc), not just the main schedule_simulate(). I think I need to clarify/expand what I just said about schedule_push_, possibly in an entire section by itself.
I would appreciate knowing just how much of the code can be changed, and how much preprocessor/template metaprogramming trickery that Steve and/or the other devs are willing to tolerate. My initial attempts suggest that code similarity and trickery reduction are mutually exclusive; we need nasty preprocessor tricks (e.g. #define schedule_assign mt_schedule_assign) to reduce changes in code between multithread and singlethread.
The Horrible Idea Guy 02:43, May 26, 2010 (UTC)

Discussing on iverilog-devel is fine, but having the conclusions and other details presented in an organized manner here would be very nice.

Cary 00:06, May 26, 2010 (UTC)

I'll possibly post these discussions on the iverilog-devel list then, remove them from here, and we can discuss what gets into this wiki and what doesn't on the list.
The Horrible Idea Guy 02:43, May 26, 2010 (UTC)