Repairing Knowledge RRS feed

  • Question

  • We have a SyncFx 2.1 hub and spoke style app.  Some of our replicas have managed to have a mintickcount for remote replicas that is higher than the current timestamp at that remote replica.  We are thinking about writing a tool to instantiate a SyncKnowledge object from the knowledge in scope_info..adjust the mintickcount and then save back to scope_info.  It sounds draconian, but we're not sure what else we can do to get this data flowing again.

    Can anyone comment on whether they have done something like this before or things we should beware of as we try?  I understand it's risky.

    Chris W.

    More details:

    Server db is SQL 2008.  Client db's are SQL 2008 express.
    We use .NET/WCF/HTTPS to connect clients to server.

    Our problems are in some of our bidirectional scopes.
    The symptom is user's reporting they see things in the online (server) but not in the offline (client) and vice versa.

    There was an initial snafu during initial deployment.  One client database was used to clone all clients, without running postrestorefixup after each clone.  This was remedied later by redeploying a server backup to the clients and running postrestorefixup on each.  Then running local inserts to copy from their old database to the new one selecting any records that were added only to their old client database and weren't in the new one.

    We've run SQL profiler and see that the calls to _selectchanges for the problem scopes are using timestamp values above the databases current timestamp.  (So, clients that can't see server data pass a known timestamp to server that is passed to _selectchanges that is above the server's min_active_rowversion.)  The same is true in reverse when server database doesn't have client data.

    We have several scopes.  There are some cross scope table dependencies.  Therefore when records don't copy in the "bad" scopes, other scopes accumulate large numbers of conflicts.  This is leading to a whole host of other problems due to large soap payloads.  (30-80Meg, some as high as 100Meg.)  We feel that fixing the timestamp/knowledge issue is the best way to solve these other problems by getting rid of the conflicts.

    We're not sure how this data is getting corrupted.  In some cases it seems like users were able to restore our new database, run postrestorefixup, then run one sync...and then have corrupted knowledge.  We're trying to confirm this now.  We do have other users that did the restore process and are working just fine.  Any other thoughts on how knowledge can get corrupted?

    Tuesday, January 25, 2011 9:26 PM

All replies

  • We've found that you can use the SyncKnowledge class to view the tickcount for each replica but there doesn't seem to be an easy way to just adjust this tickcount.  In the end we're going to try to just start over.  (Turns out our server had been restored from an empty test database without postrestorefix being run.)

    So, here is our plan to recover/reset.  Is this an ok approach to basically start over with a system that is already in production?

    1) Deprovision the server database and reprovision it.
        a. I’d like to do this instead of just postrestorefixup because there is so much bad knowledge in place at the moment.
    2) Backup the newly cleaned server database and copy to clients.
    3) Run postrestorefixup on clients.
    4) Run data recovery scripts to insert into the new client database any data that exists only in their old client database.
    5) Disable the current sync wcf endpoints and deploy a new set of endpoints.  (Thus ensuring sync will only work with the new local database in place.)

    Does anyone have any tips to watch for as we basically start over?

    Chris W.


    Friday, January 28, 2011 8:07 PM
  • Chris,

    Is the server database very large? You could start out with clean, empty client databases and use Sync Framework right from the start to do the initial sync between clients and server to avoid corrupting sync knowledge.

    Thursday, February 3, 2011 10:11 PM
  • The database is just under 3 gigs.  Apparently they had tried that approach before and the time to download the database from scratch was too much.  In the end, the reason our knowledge was so confused was because of multiple missteps including failing to deprovision and reprovision the production server (it was a restored backup from another server) and then the failure to run post restore fixup on each client individually.  Those have now been fixed based on my last posted steps (so anyone facing this can follow those steps..it did work.)

    We are now being asked to implement row filtering.  (Previously was just to sync entire database.)  Given that requirement I think we will have to go the route you suggest because we can't deploy entire databases to clients via backup/restore anymore.

    Chris W.

    Friday, February 4, 2011 1:19 PM
  • Hi Chris,

    If you still wanted to distribute an image of the server database to client machines, make sure to sync the server database at least once with any client.  Then back up the server database, distribute to clients, restore, and perform post-restore fixup on each client machine. That should hopefully avoid corrupting knowledge when clients sync back to server.



    Saturday, February 5, 2011 1:39 AM