Monday, August 16, 2010

DFSR R2 and event id 2104

Today my DFS-R volume is returning error 2104 every 1 hour in the DFS Replication event log.

Here's the content of the error event:

The DFS Replication service failed to recover from an internal database error on volume F:. Replication has been stopped for all replicated folders on this volume.

Additional Information:
Error: 9214 (Internal database error (-1414))
Volume: A9REC15F-ED9F-11DB-A78E-0019B44441DC
Database: F:\System Volume Information\DFSR

For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.



My configuration is Windows 2003 R2 with DFSR full mesh topology on two nodes. The replicated folder is : f:\data_to_replicate. Shadow Copies for volume F: are activated. Storage area for the Shadow Copies is on volume G:. The DFSR is very highly accessed and many very small files are continuously modified.

I have run a DFS Replication Health Report and here's what I got on the problematic DFS member:

  • A database problem is blocking replication on volume F:.
  • DFS Replication unable to replicate files for replicated folder data_to_replicate due to insufficent disk space.
  • Cannot access DFS Replication performance counters.
  • Cannot access DFS Replication performance counters.
  • Cannot access the local WMI repository.
  • One or more replicated folders have sharing violations.
The detailed error description is the following : “The DFS Replication service was unable to recover from an internal database error on volume F:. Replication has stopped for all replicated folders on this volume until the database is automatically rebuilt. If the database is rebuilt successfully, replication will resume after the rebuilding is complete. If the database cannot be rebuilt, a separate event is generated. If you are seeing this error more than two times in seven days, we recommend that you run Chkdsk on the volume that contains the database. Event ID: 2104”

I have had a look to F:\System Volume Information\DFSR and found that SimilarityTable_1 has taken all the available space on our Data Drive and is 8 GB.



So, to resume, the disk space situation is as follow:
  • Server001:Disk F: is full (because of SimilarityTable_1 file taking 8GB).
  • Server002:Disk F: is ok with more than 1GB available.
Looking on http://www.microsoft.com/technet, I have found that the user action proposed by Microsoft tech guys is: “The system will attempt to rebuild the database automatically. However, you should ensure there is sufficient disk space on the volume for database maintenance and check the NTFS log for volume errors, which can help you troubleshoot possible hardware failures. If the database cannot be rebuilt, a separate event is generated. If you see this error frequently, you should run Chkdsk on the volume that contains the database to verify that the problem is not disk-related.”

So, in two words, the solution is simple: wait for the temporary SimilarityTable to be emptied and, if you can, free up some space on the full volume to speed up this job. In my case I had a few big files to delete on the F: volume and after two hours everything went back to normal.

If in the mean time your Conflict and Deleted folders has grown up, as in my case, perform a manual clean-up of it. A manual clean-up will permit you to select which files you want to keep. Delete all the rest once you are sure you have on each member the last version of the desired files.

As Microsoft states, DFS Replication uses a "last-writer wins" method for determining which version of a file to keep when a file is modified on two or more members. The losing file is stored in the Conflict and Deleted folder on the member that resolves the conflict. This member might not be the member where the changes originated.

Under this link you will find a good post explaining how to purge the Clnflict and Deleted folder. In a situation where the DFSR is in an error state, go straight to the second scenario:
  • Stop the DFSR service on every member.
  • Delete the contents of the ConflictAndDeleted folder manually (with explorer.exe or DEL) on every member.
  • Delete the ConflictAndDeletedManifest.xml file on every member.
  • Start the DFSR service back upon every member.
  • Wait a few minutes to be sure that replication starts correctly.
Just as a note, remember to properly set Staging Folders size in order to appropriately answer demand. Disks hosting DFSR folders must never fill up!

For tips on configuring and optimizing quota size and information on the consequences of having too small staging folders, refer to this.

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...