Sql Sync Provider - Knowledge Size Problem RRS feed

  • Question

  • We are using SyncFX 2.1 SqlSyncProvider over WCF to sync our SQL 2008 client and server custom application peers.  Typically there are 10-20 peers in our testing environment and we also have a Central peer.We have had reasonable success in getting transactional data synced among all peers with good performance.   Each peer is a Windows WPF application capable of hosting many incoming sync requests via WCF at the same time (as concurrent sync sessions are supported by underlying SynxFX 2.1) and while idle, each end-to-end sync with minimal data delta takes only a few seconds of overhead.   It gets incremental slower however, as a client hosts more requests at the same time.Problem occurs when the scope_info.sync_scope_knowledge (or Sync Knowledge) gets over a certain size limit, in our case about 100k,  a sync session will get *exponentially* slower.  

    For example, a few second sync will now take 30 seconds and up end to end.    We have timed different steps in a sync session and found the increased overhead to spill into many of these steps such as EnumercateChanges() and ApplyChanges() so it's hard to pinpoint one specific area of slowdown.  Running Visual Studio profiler wouldn't help either because we cannot time SyncFX internal APIs.The knowledge grows quickly because of failed inserts and deletes and every time an "exception" happens it gets stored in the "Clock Vector" inside that aforementioned knowledge column and even a small exception requires a fill of a complete clock vector (which has a dimension of 10-20 peers) which takes up lots of storage.   Once this app goes into production there will 100 or so total peers so it's evident knowledge size poses a serious problem.

    1) Do we have explanation why SyncFX scale so badly with increased knowledge size?2) We can make our app clean up sync knowledge and remove outdated exception RangeSets regularly but there is no guidance how we can accomplish this?   We looked into SqlSyncStoreMetadataCleanup class and it's not the solution we want as it only cleans up "Cleanup Knowledge" past certain retention days and not the main knowledge.

    Monday, January 28, 2013 5:26 PM

All replies

  • if you enable Sync Framework tracing, you should be able to extract the timings for each operation during synchronization.

    am not sure how you got into conclusion that exceptions gets stored in the sync knowledge.

    the more peers you have the more replica ID's and clock vectors to store in the sync knowledge. the more deletes you do, the more it gets fragmented. the metadata cleanup should help compact the sync knowledge.

    Tuesday, January 29, 2013 1:23 AM
  • Thanks for your reply.  

    "if you enable Sync Framework tracing, you should be able to extract the timings for each operation during synchronization."

    yes, we do have sync tracing on all the time.   It does appear though, as I mentioned, that the overhead with a large knowledge size is across the board, such as EnumerateInternal() and ApplyChangesInternal().  Note a larger knowledge object will increase wire-to-wire WCF transport time but at 100K the transport time is still negligible.  So I imagine most of the overhead occurs in actual processing of the bigger clock vector set.  We did time SyncKnowledge.Serialize() and .Deserialize() which are public APIs and found it not noticeably slower.   Many other lower level APIs are internal so we cannot either profile or run and clock them ourselves.

    "am not sure how you got into conclusion that exceptions gets stored in the sync knowledge."

    A typical exception is when a row change unit from another peer fails an insert on the local peer, due to uniqueness constraint and results in an ADO.NET exception.   In our controlled test,  for source knowledge SyncFx recognized it as a "conflict and retry" and created two new clock vectors after the exception, one for the existing row, the other for the to-be-synced row.    This extra information then is duped to every peer participating the sync with the said local peer and soon everybody's knowledge bloats.   And the extra clock vectors will remain in knowledge forever if the conflict is not resolved (realistically it will not be resolved ever).    We want to proactively clean up these kinds of unresolved conflicts to keep the knowledge at decent size over time.  And we are looking for guidance on doing that.

    This finding is also backed by studying the source code of ChangeHanderBase.ApplyChange() method,  this line "this.MappedSourceKnowledge.ExcludeItem(this.SourceRowMetadata.Id);" does an exclude (which actually adds clock vector) to the knowledge for "will retry next sync" row scenarios.

    " the more peers you have the more replica ID's and clock vectors to store in the sync knowledge. the more deletes you do, the more it gets fragmented. the metadata cleanup should help compact the sync knowledge."
    Yes, but SqlSyncStoreMetadataCleanup.peformcleanup() method only cleans up "Tombstone Knowledge" which is not the main knowledge, which is our primary concern.

    Tuesday, January 29, 2013 5:28 PM
  • ok, got it.

    so you're not resolving conflicts? 

    if you're not then that would actually result to more time on change enumeration as it has to include in the change enumeration those rows marked as Retry on Next sync. It's essentially, "select all changes that happened since last sync and include all rows that were previously mark as retry on next sync".

    i maybe wrong, but i don't think it's the actual processing of the Sync Knowledge that is slowing it down. rather it's the change enumeration. (the actual Sync Knowledge processing is actually done in unmanaged code)

    Wednesday, January 30, 2013 1:34 AM
  • June

    We're revisiting this issue again as any knowledge > 100k (scope_sync_knowledge) slows things down tremendously.  

    In our app, we do handle the ApplyChangeFailed event but our handling is limited to conflicts only.   Conflicts such as LocalDeleteRemoteUpdate and LocalUpdateRemoteUpdate are handled based on context of the application.   We use a range of actions such as ApplyAction.Continue, ApplyAction.RetryWithForceWrite, ApplyAction.RetryApplyingRow, ApplyAction.RetryNextSync.

    It is not clear however, based on these actions how  SyncFX 2.1 determines what to add to the knowledge.   Article such as this http://msdn.microsoft.com/en-us/library/cc761628.aspx really does not tell us that. 

    Also we are not doing anything with the dbApplyChangeFailedEventArgs.Error property.   I believe most of the exceptions (not just conflicts) can be tracked here.   What we really need is the following:  since most of the time an error is not fixable,  in such cases we would just tell SyncFx to "forget about it" and not bother storing these exceptions in knowledge and send them down over and over again.   As you mentioned,  the extra enumeration need to go through these exception rows every time and that slows down end-to-end time dramatically as we have observed.   (This is still a theory that enumeration is causing it.  but we do know that the bigger the knowledge, the MUCH SLOWER it is)

    We are also thinking about ways to periodically clean up those clock vectors that we deem not useful.  If an exception/conflict has been passed around many times and the app still cannot/will not handle then it's an easy decision to get rid of them.  We are setting our own custom expiration policy on them so they get deleted after certain days.   

    So we really have two questions.

    1) how to best use ApplyAction.xxxx  as well as dbApplyChangeFailedEventArgs.Error intelligiently so that we can tell SyncFX to "forget about it" and don't bother with storing them for retry.   This will cure the root cause of knowledge growth.

    2) If we are to manually clean up the knowledge on a regular basis, we need to know if our expiration based strategy is sound.  If not, what is a great way to clean it up?  We know that SqlSyncStoreMetadataCleanup.peformcleanup() method only cleans up "Tombstone Knowledge".


    • Edited by PUDONG001 Wednesday, February 13, 2013 10:55 PM
    Wednesday, February 13, 2013 6:31 PM
  • for # 1 - try setting the action to ApplyAction.Continue.  afaik, it's the RetryNextSync that stores some metadata.

    Thursday, February 14, 2013 1:35 AM