At the current time we have only really exposed an extra layer of
parallelism in FLASH and have not yet focused on tuning the
multithreading. We do not have enough experience to suggest the most
efficient ways to run the multithreaded code, however, we can suggest
some things that may help the efficiency. In an application setup
with threadBlockList it makes sense to maintain at least 5
blocks per thread. This is because the computational load imbalance
between thread A being assigned 1 block and thread B being assigned 2
blocks is larger than thread A being assigned 5 blocks and thread B
being assigned 6 blocks. In an application setup with
threadWithinBlock you should probably use larger blocks,
perhaps
or even
, so
that each thread has more cells to work on.
As a closing note, you should be aware of the amount of time spent in the threaded FLASH units compared to the non-threaded FLASH units in your particular FLASH application - perfect speedup in a threaded unit may be insignificant if most of the time is spent in a non-threaded FLASH unit.