Thursday, September 18, 2014

Windows Server 2012 R2 Deduplication

In this article, you will learn about how to manage the deduplication feature in Windows Server 2012 R2.

Introduction

Even as the cost per GB of storage continues to drop as vendors release hard drives of massive size, customers still seek to find ways to maximize their investment in what remain expensive storage solutions. One of the most common methods by which organizations drive down the cost of their storage is by implementing deduplication in their storage environments. In this article, you will learn about how to manage this cost-saving feature in Windows Server 2012 R2.
If you pick just one feature on your Windows Server 2012 R2 server to turn on, you’ll probably want it to be the new deduplication option. Primary storage deduplication is typically left to the hardware layer and may require expensive shared storage (SAN or NAS) with that capability. Now, with Windows Server 2012 R2 you can implement this space, and money saving feature using the native controls at the filesystem.
The feature was unveiled with Windows Server 2012 originally, but in the R2 iteration there were some extra features added including the ability to extend deduplication into CSV (Clustered Shared Volumes), VHD (Hyper-V Virtual Hard Disk), plus the addition of the Expand-DataDedupFile PowerShell Cmdlet in case you need to inflate your previously deduplicated volume.
Advertisement

Installing and Configuring your Windows Server 2012 R2 Deduplication

The steps to deploy are very simple as you can see here. First, we enable the feature through the Add Roles and Features wizard which you will find under File and Storage Services | File and iSCSI Services | Data Deduplication:
Image
Figure 1
Once the installation wizard completes (no reboot required), you will be able to open the File and Storage Services section in Server Manager, then right-click your volume and select Configure Data Deduplication:
Image
Figure 2
The Deduplication Settings page has a simple checkbox to enable, and you have the option to select the file age for deduplication (default is 5 days) plus you can add folders to be excluded in the case where you have workloads that will not be compatible with deduplicated data.
Image
Figure 3
If you click the Set Deduplication Schedule button, you also have the option to throttle the deduplication task priority during specific times to allow for other tasks such as backups, or day-to-day production usage during business hours to take priority over the deduplication tasks.
Image
Figure 4
Now that you have enabled your deduplication at the volume, you can kick off the task manually to get things started. This is easily done using the PowerShell one-liner Start-DedupJob –Type Optimization –Volume D:
Image
Figure 5
Now our deduplication job is running in the background, and you can easily monitor the progress using the Get-DedupStatus and Get-DedupJob Cmdlets:
Image
Figure 6
Depending on the size of your volume, the process may take quite a while. Once the task has been completed, the schedule will re-run the deduplication job to check your volume against the threshold of file age every day.

Performance Concerns

One of the top concerns by Systems Administrators is the concern about server performance during deduplication, and with application performance when using data that is on a deduplicated volume.
On the sample volume for the images above, you can see that the CPU and Memory overhead is nominal, but there will be higher than normal utilization at the disk and disk controller as the first pass runs:
Image
Figure 7
Beyond the server performance, you will have to evaluate whether any issues are occurring where applications are accessing data that is on a deduplicated volume. In those cases, you can simply add the folder exclusion to your volume deduplication settings as we saw in the earlier screenshot during the initial setup.

After Deduplication

In the example volume that was used, there was a final savings of 740 GB after the deduplication job completed. The final status screen shows us how many files were optimized and which are considered to be “in policy”, meaning that they meet the criteria of being over the file age to be evaluated for deduplication.
Image
Figure 8
As you can see from this example, there was a significant savings in disk usage which will definitely turn into better efficiency with our storage footprint.

Conflicts and Challenges with Deduplication

One particular issue with deduplication on Windows Server 2012 and Windows Server 2012 R2 is the use of FSRM (File Server Resource Manager) quotas. Unfortunately, hard quotas are not supported on a volume that is running data deduplication.
This is an issue because the quota is based on actual used space in the volume which will not be represented correctly. This will result in incorrect measurement of the used space as you can imagine, so we have to rely on soft quotas only on deduplicated volumes.
Another service which cannot co-exist with data deduplication is the SIS (Single Instance Store) option which was a predecessor available on Windows Storage Server. The migration to Windows Server 2012 will take some care and feeding if you had this feature in place in the past.
One other thing to think about is your backup processes. During the process of deduplication, the archive bit is set on files which are optimized which will trigger the file to be included in incremental or differential backups. In the example of my sample volume, the incremental backup for that night included the large number of files which were affected during the deduplication process. Results may vary on how this affects your backup software depending on whether it uses the archive bit, or file checksum to mark the file for change.

Realized Savings with Deduplication

The level of space savings will vary depending on a number of factors such as file type, file age, and frequency of change. Even in cases with large binary files, there will be a space savings using this feature. Given the low impact of running the service and limited risk with other Windows services, there really is no reason why you wouldn’t want to run data deduplication at least to see what the potential win could be.
It is important to note that deduplication is not to be used to overcommit storage on a volume. This is a common issue where operational efficiency is misinterpreted as a way to save money on hardware growth. This feature is meant to reduce the utilization, but you do have to be wary of letting deduplicated volumes run at extremely high usage levels in the case that file usage. There is a chance that sudden usage and change on the filesystem could trigger inflation of the real utilized space and risk hitting the actual storage limit on the disks.

What about my Storage Hardware Deduplication?

This is a great question that comes up when we approach the idea of using OS based storage optimization. Luckily, there are no known conflicts between the Windows Server 2012 deduplication features and any of the widely available hardware level deduplication features on shared storage environments. The obvious win will be for customers who do not have the hardware based capability in their data center, and this is a particularly exciting feature for ROBO (Remote Office / Branch Office) deployments.
All in all, Microsoft has delivered a strong product with Windows Server 2012 and Server 2012 R2 deduplication and we look forward to watching more features as them come out with future releases.