none
HPC for processing 50 GB of input data per day RRS feed

  • Question

  • Hi,

    I am about to begin work on a project which involves the processing of 50 GB of input data arriving in flat files every day. I need to read and perform certain calculations and/or transformations on the data in these input files and then output them into text files. Over the course of time, the size of input data is bound to grow many times.

    I was wondering if HPC Server could help me in my effort. I am planning to write my data extraction and transformation routines in C# classes. I am hoping that deploying it on Windows HPC Server would help to speed up the process of transforming the data and then outputting it to the target text files. Especially in view of the future growth of input data, possibly upto 250 - 400 GB per day.

    Never having  worked on HPC before, I would like to seek your opinions on whether my idea is feasible. Is this a typical use of HPC? All advice and suggestions are welcome!

    Ringoo

    Monday, October 14, 2013 6:49 PM

All replies

  • HPC is a good platform for what you describe and the scale you mention.

    In most cases of data-centric parallel applications, data locality is often an issue.  You did not mention if your data are in the cloud or on premises.

    As you traverse your datasets, you might review job and task dependency features (ie: directed acyclic graph/DAG).

    DAG needs often lead to custom code via IScheduler interfaces in C#.

    There is also HDInsight for an all-cloud Big Data solution.

    d

    Wednesday, October 16, 2013 8:38 PM
  • The question is if you can you split your processing into independent tasks.

    In the following example HPC solution should be considered:

    1. Server gets 50GB of data

    2. Split the data into N independent chunks (to send & process it on N nodes)

    3.Send each chunk of data to compute node ,process data and store the result on the same node ( you may reuse it later)

    4.Get new 50 GB of data and repeat steps 1-3

    Consider also your network bandwith, read about LINQ to HPC jobs.


    Daniel Drypczewski


    Friday, October 25, 2013 8:24 AM
  • An old thread but I thought I would nevertheless try my luck to get responses to my query. I have a similar problem of trying to perform computations on a large data file and doing this on premise. Using HPC Pack 2012, is sharding my data and storing on individual nodes still an option? Linq to HPC is out of questionWhat happens to my data if my node shuts down? Or is it recommended to use Azure Batch and cloud burst my solution to tackle this availability issue? Thanks a lot for your response.
    Thursday, January 29, 2015 10:00 AM