Is an autoscaling, single shared cluster "HPC for Rent" service in Azure practical today? RRS feed

  • Question

  • This is an HPC question so please bear with me through this first paragraph of background. My firm has had me working alone on an Azure program of utterly typical design for 4 months. It has nothing to do with HPC.  It has gone reasonably well. The service leverages existing, single-threaded command-line programs to perform “file conversions” that take on the order of 1 to 10 minutes per submitted file. A typical conversion would be on the order of 2 or 3 minutes in duration. The use of these existing command-line programs is mandated and the altering of that file conversion code’s design is outside my scope. I have an AJAX style, single-page html application that calls RESTful services implemented on MVC 4 web api. Users can use a web browser to multi-upload input files to my web role which queues the actual conversion work to the worker role which runs the command-line conversion programs. Results are available both through the html front-end and are linked to within an optional job completion e-mail. This service is to be made available, for free, to our entire customer base. I’ve used all the cloud 1.0 stuff in the implementation: web-role, worker-role, queue storage, blob storage, table storage. It is my first Azure project. I’ve learned a lot. It is going pretty well.

    Suddenly I have been asked to put down my pencil and to instead begin to investigate reworking this simple little “file converter” service to run on top of another group’s effort to create an “HPC for rent” service by leveraging Microsoft HPC within Azure. None of the people working on that framework brought any previous HPC experience to that project and I don’t have any HPC background either. The notion of using an HPC cluster against single-threaded jobs that require a few minutes of compute time on a single core seems counterintuitive to me.  The “HPC for rent” service is aiming to support genuine supercomputing kinds of workloads also; single jobs from single customers that would lend themselves to parametric sweep in a cluster and which could take hours to execute even when using dozens of nodes/cores. There would ultimately be a growing variety of unrelated applications supported by a single shared, dynamically scaled cluster and our entire customer base would be unleashing jobs against this cluster. Obviously there would be zero coordination among our customers that were submitting jobs. Although my application is to be free, it is an exception. Most of the applications are expected to be paid and so they will have to figure out how to bill customers of this single, autoscaling, shared cluster.

    Is the group that is building this “HPC for rent on Azure” service justified in claiming that they can actually make life for me easier and that they can serve the customers of my application adequately?  I’m afraid of these possibilities:

      • My customer’s little several minute file conversion jobs may not get completed for far too long because they cannot get past a rush of genuine HPC workloads.
      • That whatever auto-scaling they come up will not serve my customers nearly as well as what I could have done myself quite cheaply.
      • That my development life will become much harder. I find Azure development difficult enough. Combining it with HPC sounds even more challenging to me. They tell me I cannot use the Azure emulator any more, for example.
      • That ceding of control over the architecture of my service to the architecture of their framework on top of HPC on top of Azure is going to be stifling and hamper my ability to serve my customers.
      • That the advocates of this idea are vastly underestimating the effort required to create an “HPC for rent” service on top of Microsoft HPC on top of Azure. If it were practical for a tiny development team to do that (which is what they are) wouldn’t Microsoft already have it?
      • That Azure support from Microsoft HPC has not really matured enough yet to justify such grand plans by a team of a hundred developers, much less one of several developers with no prior HPC experience.
      • That when application or framework code fails on this shared cluster, that figuring out who’s at fault and what the problem is will be 100 times more difficult. In fact, I’m afraid that even noticing that there is a problem will be much more difficult.

    I have no experience with HPC clusters at all so I ask members of the Microsoft HPC community, what do you think of this? Should I relax and embrace the “HPC for rent” redirection of my little Azure service project or should I resist it?  Is the notion of using a compute infrastructure designed for utility supercomputing to also do little, single core file conversion jobs just fine? Is the notion of building a service to allow all customers to share a single, autoscaling Microsoft HPC cluster running in Azure practical today?  Can it be created and sustained by a team whose maximum size is likely to be maybe 8 developers?  It is more like 3 developers right now. I have read about one company that was making a business of “HPC as a service in the cloud” and they were not aiming to have multiple customers share single clusters.  Rather they took an approach of a cluster for each customer. That probably makes billing a lot easier and probably eliminates a whole class of security concerns.

    I look forward to feedback from the Microsoft HPC community.

    Friday, September 21, 2012 3:45 PM

All replies

  • First of all, thank you for the story. It's a nice weekend reading. :)

    Secondly, your workload is a typical HPC workload we called parametric sweep job. In fact, a lot of our customer are using HPC to do similiar jobs with their on-premise HPC cluster. (A good sample is DCC, where the customer runs a large HPC cluster, send in some initial scheme design and texture, run renderers to generate all the frames, and merge them into a video. The generation of each frame is running the renderer with certain parameters.) What the HPC cluster will give you in this case is machine management, automatic error recovery, remote execution, monitoring, and billable reports. We do currently offer Windows Azure HPC Scheduler (WAHS) to enable you to do all those purely on Azure.

    Some answers are not appropriate to this forum (e.g., Azure roadmap) and some discussion is more efficient in mail or conf-call. So if you are still interested, please contact me through yidingz at microsoft dot com.

    Saturday, September 22, 2012 2:51 PM