I was going through the documentation and understand that DSC utility/custom logic should be used to create files and file sets. In most of these articles, it was directly addressing a group of individual files. For example, I have a file of size 50 Gig
or 1 terabyte which I want to split across HPC cluster and then do data processing. In this case, should be need to write custom logic to split this one big file to be distributed across the cluster? Are there any utilities which can be used to get this done?
Please let me know.
Phani Note: Please vote/mark the post as answered if it answers your question/helps to solve your problem.
1.) Create a fileset in DSC that holds only the single file you want to partition. Next, write an L2H query that partitions the data based on it's contents. For example, say you have a 100 Gig weblog that organized as lines of text, and you've added
it to a fileset called "WeblogBigFile". You could then use the following query to partition it into 100 ~gigabyte files and store those into a fileset called "WeblogSmallFiles":
using (HpcLinqContext context = new HpcLinqContext(config))
.HashPartition(r => r, 100)
2.) Another approach is to use the .NET file IO streaming APIs and divide the large file into smaller files, and then create a fileset out of the newly created files.
In either case, once a fileset exists that contains the small files, these files will be spread across the cluster, replicated, and ready for access by further L2H queries.
[Edit: Apologies for the very long response delay. This forum is for prerelease versions of our V2 (HPC Pack 2008), which was released in 2008, so it's not closely monitored. Hence the alternative forum suggestion.]