In LinQ to HPC, how to split the files which are large i.e, more than 50 Gig RRS feed

  • Question

  • Hi,

    I was going through the documentation and understand that DSC utility/custom logic should be used to create files and file sets. In most of these articles, it was directly addressing a group of individual files. For example, I have a file of size 50 Gig or 1 terabyte which I want to split across HPC cluster and then do data processing. In this case, should be need to write custom logic to split this one big file to be distributed across the cluster? Are there any utilities which can be used to get this done? Please let me know.


    Phani Note: Please vote/mark the post as answered if it answers your question/helps to solve your problem.
    Monday, September 19, 2011 9:57 AM


  • Hi,

    There are two ways in which you can do this.

    1.) Create a fileset in DSC that holds only the single file you want to partition. Next, write an L2H query that partitions the data based on it's contents. For example, say you have a 100 Gig weblog that organized as lines of text, and you've added it to a fileset called "WeblogBigFile". You could then use the following query to partition it into 100 ~gigabyte files and store those into a fileset called "WeblogSmallFiles":

    using (HpcLinqContext context = new HpcLinqContext(config))
                .HashPartition(r => r, 100)


    2.) Another approach is to use the .NET file IO streaming APIs and divide the large file into smaller files, and then create a fileset out of the newly created files.

    In either case, once a fileset exists that contains the small files, these files will be spread across the cluster, replicated, and ready for access by further L2H queries.


    One other note is that the best forum for questions about L2H development is here: http://social.microsoft.com/Forums/en/windowshpcdevs/threads


    Hopefully that helps guide your efforts.


    [Edit: Apologies for the very long response delay. This forum is for prerelease versions of our V2 (HPC Pack 2008), which was released in 2008, so it's not closely monitored. Hence the alternative forum suggestion.]
    Saturday, October 15, 2011 12:08 AM