We are running a monte-carlo model, usually running a large number of different cases (say 1500). Each one of these cases requires a large number of iterations to account for randomness. We typically run about 1000 iterations per case.
We initially set this up to run each case in a different job (1500 jobs, for example), and the 1000 iterations were handled with a parametric task one (or sometimes more) per job. However, this resulted in a huge number of jobs in the queue, and it was
extremely difficult to estimate overall progress. In addition, using the balanced scheduling, each of these jobs would compete for resources as the balancing occurs over all these jobs.
We then shifted gears and tried to put all these cases under a single job, 1500 parametric tasks for each case. This worked perfectly for our tests. Scheduling worked as desired, and the progress was clear at the job level. During our testing
of this, we only tried a few cases (~10 or so) so all was fine and dandy. Then when we scaled it back up to our typical work load of 1500 cases, we quickly learned of the limit that each job can only have a maximum of 100 parametric sweep tasks.
Our first thought was to add 1000 separate tasks rather than one parametric task. However, the time it would take to submit all these tasks would take days/weeks. Our next thought was to just group our jobs such that we would maybe make 15 jobs, each
with 100 parametric tasks, resulting in the 1500 cases. However, this can get ugly in our cases, because sometimes each case doesn't know in advance how many parametric sweeps it will need. This makes it difficult to determine how many cases can
be included under a single job.
Why does this 100 parametric task limitation exist and how can we best get around this limitation?