none
Manipulating data with in DSC fileset RRS feed

  • Question

  • Hi,

    I have a requirement to manipulate the data with in a file that is added to a DSC fileset based on data available in another file (this can be added to a DSC fileset as well). I know that we cannot modify the data once the file is added to fileset and is sealed. Hence Thought to gather suggestions from experts with possible solutions on this. Key point with the solution is that it should be able to leverage entire cluster as the DSC file is existing on the cluster.

    Regards,


    Phani Note: Please vote/mark the post as answered if it answers your question/helps to solve your problem.
    Tuesday, October 18, 2011 1:24 PM

Answers

  • I just realized I had forgot to also mention the Concat() operator that was enabled in the " HPC Pack 2008 R2 Service Pack 3" beta release.

    Concat effectively performs straight-forward DSC fileset concatenation in the middle of a query.  This is particularly useful when creating complex queries as it avoids you doing query + DSC manipulations + query + .. and so on.

    var mainData = context.FromDsc("..");

    var newData = context.FromDsc("..");

    var q = mainData.<operators>.Concat(newData).<operators>.ToDsc().SubmitAndWait();

    Note however that using just data1.Concat(data2).ToDsc().Submit() will introduce some data copying as queries do not operate directly on the DSC store hence there is some movement from DSC to temporary storage and back again.   For this simple type of data concatenation, direct use of DSC APIs will have better perf. 

     

    -mike.

     

    Monday, October 24, 2011 6:54 PM
  • Thanks for clarifying.  Your scenario should work out OK if the frequency of your updates and ratio of updates vs new data are not extreme.  However, if you are going to be processing your multi-GB updates frequently and/or there is a lot of update, then L2H simply may not be the right approach (random-access writes to a dataset simply isn't one of the core scenarios (yet)).

     

    The general approaches I would consider:

    1. design such that updates as add-only, and then using queries to resolve the multiple records that deal with a single logical data element.  This can be used where later records can supersede previous ones (eg by tracking timestamp/sequenceID) and in general the store is just a long list of 'facts' such as 'item i cost x dollars on date d".  Queries either take the latest data or just aggregate everything together.  This works well for frequent small updates and when the queries can successfully combine all the records pertaining to one data element.  It can also work well in other scenarios. There are many benefits to this approach.  It starts to break down for large update volumes, but periodic compacting is a good workaround for intermediate cases.

    2. create new datasets each time and retire the old.  This is appropriate if your update volumes are large.  You could perhaps use  Join/GroupJoin to form groups of old data + new data for each data element, then resolve to single element each.  Also consider using Union(a,b).GroupBy().  The choice of operators will likely depend on whether your data+update has many records per element, or strictly zero/one each.

    3. update the data in-place via DSC.  For some data and updates, it may be reasonable to use DSC operations only to perform the update.  this would be "create new dataset" and use AddExistingFile where possible and AddNewFile for data that is being altered.  If your updates are typically localized and very simple this might have lower overheads than using L2H to do the updates.

     

    -mike.

    Thursday, October 20, 2011 6:23 PM

All replies

  • Hi Phani,

     

    I'm not quite sure what you need, so perhaps you can clarify.  However, a couple of pointers:

    The normal approachs to 'extending' a fileset with new data are:

    1. use DSC APIs to create a new fileset and use {DscFileSet.AddExistingFile()} to reference the existing files and {DscFileSet.AddNewFile() followed by a filecopy to the new file write-path} to attach new data.

    2. use a HpcLinq query that creates a new fileset that has incorporated both datasets.  You will likely use a binary operator such as Union(), Join(), or perhaps the binary-operator version of Apply if you need to do something more unusual. 

     

    If you use binary-Apply you will want to ensure that you get the right 'mode', being either fully-merged, fully-distributed or the left-distributive mode.  I wont go into lots of detail here, but:

    Merged binary-apply.  Acheived without any attributes. Sucks all data to one vertex and runs your user-delegate once over all the data from each source.

    Fully distributed binary-apply.  Acheived by adding attribute [DistributiveOverConcat] to your user delegate.  The datasets must agree on partition count nn vertices will execute, each getting ith pair of the inputs.

    Left-distributed binary-apply. Acheived by adding attribute [LeftDistributiveOverConcat] to your user delegate.  The behavior is that the left dataset is left partitioned, but the right dataset is merged and broadcast in full to each vertex.  if the datasets have nLeft and nRight partitions respectively, then nLeft vertices will run and each receives one partition from the left dataset and the entirety of the right dataset.

    As you can see, the binary-apply gives you various controls, but if at all possible you should use the rich binary functions such as Join().

    -mike.

    Wednesday, October 19, 2011 6:31 PM
  • Thanks Mike for your detailed response and pointers. Will try them out and get back if required.

    To clarify my requirement, I have one file with base data and another file with incremental data (both are flat files of which base data is in size of TBs and incremental data is of size in GBs). Now, we would want to manipulate the data in base file based on the incremental file (new records - records existing in incremental file but not in base file - to be inserted, existing records - records existing in both files - to be updated with data in incremental file in the base file i.e., replacing the record in base file with the one existing in the incremental file). This is my requirement.

    Let me try to map my requirements with the pointers you gave and see if I can get something out of it.


    Phani Note: Please vote/mark the post as answered if it answers your question/helps to solve your problem.
    Thursday, October 20, 2011 3:31 AM
  • Thanks for clarifying.  Your scenario should work out OK if the frequency of your updates and ratio of updates vs new data are not extreme.  However, if you are going to be processing your multi-GB updates frequently and/or there is a lot of update, then L2H simply may not be the right approach (random-access writes to a dataset simply isn't one of the core scenarios (yet)).

     

    The general approaches I would consider:

    1. design such that updates as add-only, and then using queries to resolve the multiple records that deal with a single logical data element.  This can be used where later records can supersede previous ones (eg by tracking timestamp/sequenceID) and in general the store is just a long list of 'facts' such as 'item i cost x dollars on date d".  Queries either take the latest data or just aggregate everything together.  This works well for frequent small updates and when the queries can successfully combine all the records pertaining to one data element.  It can also work well in other scenarios. There are many benefits to this approach.  It starts to break down for large update volumes, but periodic compacting is a good workaround for intermediate cases.

    2. create new datasets each time and retire the old.  This is appropriate if your update volumes are large.  You could perhaps use  Join/GroupJoin to form groups of old data + new data for each data element, then resolve to single element each.  Also consider using Union(a,b).GroupBy().  The choice of operators will likely depend on whether your data+update has many records per element, or strictly zero/one each.

    3. update the data in-place via DSC.  For some data and updates, it may be reasonable to use DSC operations only to perform the update.  this would be "create new dataset" and use AddExistingFile where possible and AddNewFile for data that is being altered.  If your updates are typically localized and very simple this might have lower overheads than using L2H to do the updates.

     

    -mike.

    Thursday, October 20, 2011 6:23 PM
  • Thanks Mike for taking time out and providing me detailed direction about approching the solution.
    Phani Note: Please vote/mark the post as answered if it answers your question/helps to solve your problem.
    Friday, October 21, 2011 9:30 AM
  • I just realized I had forgot to also mention the Concat() operator that was enabled in the " HPC Pack 2008 R2 Service Pack 3" beta release.

    Concat effectively performs straight-forward DSC fileset concatenation in the middle of a query.  This is particularly useful when creating complex queries as it avoids you doing query + DSC manipulations + query + .. and so on.

    var mainData = context.FromDsc("..");

    var newData = context.FromDsc("..");

    var q = mainData.<operators>.Concat(newData).<operators>.ToDsc().SubmitAndWait();

    Note however that using just data1.Concat(data2).ToDsc().Submit() will introduce some data copying as queries do not operate directly on the DSC store hence there is some movement from DSC to temporary storage and back again.   For this simple type of data concatenation, direct use of DSC APIs will have better perf. 

     

    -mike.

     

    Monday, October 24, 2011 6:54 PM