At the current time we have only really exposed an extra layer of parallelism in FLASH and have not yet focused on tuning the multithreading. We do not have enough experience to suggest the most efficient ways to run the multithreaded code, however, we can suggest some things that may help the efficiency. In an application setup with threadBlockList it makes sense to maintain at least 5 blocks per thread. This is because the computational load imbalance between thread A being assigned 1 block and thread B being assigned 2 blocks is larger than thread A being assigned 5 blocks and thread B being assigned 6 blocks. In an application setup with threadWithinBlock you should probably use larger blocks, perhaps or even , so that each thread has more cells to work on.
As a closing note, you should be aware of the amount of time spent in the threaded FLASH units compared to the non-threaded FLASH units in your particular FLASH application - perfect speedup in a threaded unit may be insignificant if most of the time is spent in a non-threaded FLASH unit.