locked
Problem with Batching on retransmission RRS feed

  • Question

  • Our situation is as follows:

    SyncFx: 2.0, local: SQL Server 2005 Express, remote SQL Server 2008, communications over WCF/Internet.

    Local provider: SQLSyncProvider, remote provider: RelationalSyncProvider.

    We have a scope, one of the tables has approximately 1 million inserts to be synced from local to remote. The internet connection between these two sites is less that stellar and we get a break in the connection every time we sync. We have a batch size of 1MB and the resulting data is spooled into 175 batch files of approximately 1mb each.

    The problem seems to be that, although syncfx identifies that there are spooled batches already, during the restarting of the sync, it respools all of the data already spooled to disk, so we then end up with 350 batch files each of 1mb, and with the data replicated.

    Well, if we couldn't send 175 files without a disconnection, the chances of getting 350 is nil.

    The next time, it now has 350 files, respools all of the data again and we end up with three times the data and 525. Well, if we couldn't send 175 files.... you get the idea.

    So what's going on? Well, turning on VERBOSE level of sync tracing, it appears that the data in the batch files is in reverse order. We have a primary key which is an ascending bigint, so effectively like an identity and the records were inserted in ascending key order. But the order of records in the batch files is from highest to lowest. Once Syncfx looks into the batch control file for context of where it was, it processes the files and takes the value from the last batch as the sync_min_timestamp. It then uses this value to get any extra records that may be available.

    Well, since this value is the lowest primary key value, it gets all of the data again!

    I have posted an excerpt of the trace at https://docs.google.com/open?id=0B6w0EH1yiPYmOXBrb1ZaZ0dJSlk (link modified from first post) where you can see the details.

    It is clear to see that the first batch has the highest key value, that the last batch has the lowest and that after enqueuing all of the existing, it does a select of more changes (calling Sync_ACTTRANS_selectchanges at 18:37:17.505) using that lowest key value. Hence it re-retrieves all of the changes that are already in those batches.

    My thoughts are that the answer lies here: http://msdn.microsoft.com/en-us/library/dd918908.aspx where you can see a description for SelectIncrementalChangesCommand that you want the command to order the rows by timestamp order.

    Well, guess what, we have the local database provisioned by the Syncfx and there is NO ORDER on the Sync_ACTTRANS_selectchanges command (or any of the others).

    I am going to see if I hand modify the SP to order the records, whether it will work or not on the retry.

    Has anybody else had this sort of problem and what did you do about it?

    Steve


    • Edited by Speedware Sunday, June 3, 2012 9:29 PM Fixed hyperlink to trace
    Sunday, June 3, 2012 9:14 PM

All replies

  • steve,

    have you checked the conditions when sync fx would actually resume from previously spooled files?

    these links might help clarify some things how the batching works:

    http://blogs.msdn.com/b/mahjayar/archive/2009/09/16/msf-v2-ctp2-deep-dive-memory-based-batching.aspx

    http://blogs.msdn.com/b/mahjayar/archive/2009/11/16/msf-v2-deepdive-batching-directory-and-batch-files.aspx

    Tuesday, June 5, 2012 2:21 PM
  • Hi June,

    I have seen the first link before but not the second. The second contains more details of the batching and specifically the resending.

    Nevertheless, it does not change anything with regards to my understanding of how it works and what I see in the specific case that I had.

    The trace file that I have, and that I excerpted in the link, demonstrates that all of the batches that had been spooled qualified for retransmission as per the 2nd of your links.

    To me, the issue still seems to be the determination of where the batching had got to. Approximating, the situation is  as follows:

    1. There are 2.5 million records in the base table.
    2. There are 2.5 million tracking records with timestamps 10k through 2.5 million + 10k.
    3. There are 175 batches.
    4. The first batch has records for timestamps x through 2.5 million + 10k, that is that newest of the records.
    5. The 2nd batch has records for timestamps y through x-1, that is the 2nd to last set of records.
    6. The last batch has records for timestamps 10k through n, that is the oldest set of the records.
    7. No records were added or modified in the base table at any time after the start of the first batching.
    8. Upon the resumption, the syncfx takes the details from the last batch as its starting point of getting any new records that have timestamps after the batches were created. This should yield no records since there were no modification. However, it selects 10k as its starting timestamp.
    9. This selection of 10k as the starting timestamp yields all the records again.
    10. Last but most importantly IMO, the stock SP ObjectPrefix_BaseTable_selectchanges has no ORDER clause and since SQL server is not in the business of guaranteeing any order unless one is explicitly requested, that seems to be a fundamental flaw.

    So, on the last point, consider the situation where SQL server coughs up the records in a totally random order - not likely but theorically possible. Then how does SyncFx in its batching get the latest timestamp?

    Well clearly in the situation I have, it is quite clear that the latest timestamp selected to identify new records since the spooling (as described in the 2nd of your links) is wrong when it starts at the 10k timestamp.

    And all of this is clearly identifyable in the excerpt of the trace that I posted.

    June, what are your thoughts, especially on the lack of any ORDER in the selectchanges SP?

    Tuesday, June 5, 2012 3:18 PM
  • must be a bug...

    looking at your trace, it is re-enqueing the existing files...

    however, when it tries to get where to resume, it calls this function:

    internal virtual ulong GetNewResumptionTimestampFromWatermark(ulong waterMark)
    {
        return waterMark;
    }
    

    so its not actually computing for a new resumption watermark and simply returning whatever watermark is being passed to it.

    Tuesday, June 5, 2012 5:13 PM
  • Hi June,

    I see that you've just popped over the 10k point mark. So THANKS for all of your contributions and well done!

    From Reflector, there is also this code:

        private void ReEnqueueAlreadyEnumeratedBatches()
        {
          string batchFileName = null;
          for (int i = 0; i < this._batchDirectoryHeader._batchFileNames.Count; i++)
          {
            batchFileName = Path.Combine(this._currentSessionSpoolDirectory, this._batchDirectoryHeader._batchFileNames[i]);
            DbSyncBatchInfo batchInfo = DbSyncBatchInfoFactory.Deserialize(batchFileName);
            batchInfo._isLastBatch = false;
            if (i == (this._batchDirectoryHeader._batchFileNames.Count - 1))
            {
              SyncTracer.Verbose(2, "SyncBatchProducer: Resuming from a previous enumeration. Reading table watermarks as {0}", new object[] { batchInfo.ConvertTimestampDictionaryToString() });
              this._resumingTableWatermarks = batchInfo._maxEnumeratedTimestamps;
              this._tableWatermarks = batchInfo._maxEnumeratedTimestamps;
            }
            this.ReEnqueueBatchAndRaiseEvent(batchInfo, batchFileName);
            this._batchNumber++;
          }
        }
     

    So you can see that it picks the watermark out of the last batch file. This is the one that has the first records in it since the records in my trace run from newest to oldest. This is why I believe that the lack of ORDER in the query has a profound impact on the results. If the data is randomly ordered, then this calculation is useless. If it's in reverse order, like my data sample, this calculation is useless. If the data is in the correct order, then this calculation could have some merit.

    For us, the batching serves two purposes: breaking the data up into managable pieces from the perspective of memory usage and secondly to allow for efficient retransmission. For most sites the former is the key criterium. But for some, where we have serious internet connectivity quality issues, the latter is paramount.

    This is representing a serious issue to us.

    Tuesday, June 5, 2012 6:59 PM
  • I must say that there has been a deafening silence from all of the Microsoft people on this forum about this.

    One of the most respected contributors here, June, identifies this a likely a bug. I provide what I believe is amble evidence of it and...

    Nothing!

    I'd have thought that a bug in the batch handling would be a major problem but apparently not.

    It sure is important to us.

    If it's not a bug help us understand what we're doing wrong. If it is, let's get it identified and a plan in place to get it fixed.

    What should my next step be?

    Sunday, June 17, 2012 2:53 PM