locked
SqlSyncProvider appears to miss rows on occasion RRS feed

  • Question

  • I'm investigating an issue I have with n-tier synching.  The current code used WCF and SqlSyncProvider across about 20 replicas.  There is a central replica that communicates with every other replica and groups of replicas that can communicate with each other and the central pier.  I'm trying to run down a problem with missing rows that appears to be related to either issues with concurrent sync sessions or load related.

    I can replicate the issue fairly frequently with a small number of nodes if I apply changes to all the nodes periodically during the test.

    To start the test, I provision one database, backup the database, distribute the backup to multiple clients and run PerformPostRestoreFixup on each client.  The initial tables are populated with about 300,000 rows in one table (Table A) and the other is empty (Table B)

    The test writes a new row to Table B and updates a field in table A.  The two tables are logically related, but there are no relationships between the two tables.  The test script runs at each client node and updates/adds about 30,000 rows to Table B and updates rows in Table A.  There are no overlaps between clients, meaning that each client updates its own set of 30,000 rows and each add in Table B is a new unique row.  The updates are done in batches, usually about 30 updates with a pause of about 15 seconds between the batches.

    Communication occurs between each client about once every 15 seconds to one minute (via WCF proxy and service endpoint).  SqlSyncProvider is the local provider and the WCF endpoint uses SqlSyncProvider on the other end.  SQL Express 2008 R2 on each node.

    Typically the test chugs along and each client syncs the changes from their peers.  However, when the test is over, it's not uncommon to have a couple rows out of sync.  Maybe one client has a couple more changes than another or is missing a couple changes. 

    Looking at the actual missing rows on a given client and with SyncTracer set to 4, I've seen cases where it appears that a change wasn't picked up by selectchanges and from what I've seen, it seems to be a change that would have come immediately after the last row of the batch (assuming I'm reading the trace log correctly).

    For example, the change I'm missing on the other node corresponds to the entry in TableA_tracking with scope_update_peer_timestamp of 369597.

      SyncBatchProducer: Read last row's Sync_row_timestamp value for table TableA as 369596.
    VERBOSE, PeerManagerService, 5, 09/15/2014 18:58:10:531,       Checking to see if batch producer has table watermarks available from previous sync.
    VERBOSE, PeerManagerService, 5, 09/15/2014 18:58:10:531,       RelationalSyncProvider.BatchedEnum
    VERBOSE, PeerManagerService, 5, 09/15/2014 18:58:10:531,       Executing Command: [TableA_selectchanges]
    VERBOSE, PeerManagerService, 5, 09/15/2014 18:58:10:531,          Parameter: @sync_min_timestamp Value: 368633
    VERBOSE, PeerManagerService, 5, 09/15/2014 18:58:10:531,          Parameter: @sync_scope_local_id Value: 1
    VERBOSE, PeerManagerService, 5, 09/15/2014 18:58:10:531,          Parameter: @sync_scope_restore_count Value: 1
    VERBOSE, PeerManagerService, 5, 09/15/2014 18:58:10:531,          Parameter: @sync_update_peer_key Value: 4

    The next go round for peer id 4, the sync_min_timestamp_value is higher than 369597 and the change is never processed.  The key value for the row is no where in the log.

    This doesn't happen every time.  Usually with 80-90,000 updates, there will be a few rows here and there.  In the above case, 4 of the 5 clients picked up the change correctly but the fifth did not.  Does the fact that a peer would pull requests from another peer (local provider = new SqlSyncProvider, remote = Proxy) while also receiving an inbound request for changes via its WCF service endpoint cause an issue? 

    Thanks,

    Rob

    Tuesday, September 16, 2014 1:55 AM

All replies