Tuesday, July 9, 2013

Testing the new Get-FileHash Powershell cmdlet

I have been playing with Powershell v4 on Windows 2012 R2 for a week now. The last post I wrote was about some new functionalities of workflows, which you can read here.

Today I'm going to focus on a new cmdlet which has been added to the latest release of Powershell: Get-FileHash. This cmdlet takes in any file and generates its hash, which is a kind of "signature" for the stream of data that represents the contents of the file itself. To better understand this concept, let's see some examples of hashes at work. Here we have two files, named smallfile.txt and bigfile.txt:
PS C:\> Get-Content .\smallfile.txt
this is a small file

PS C:\> Get-Content .\bigfile.txt
this is a big file with more content
Now let's generate their signatures:
PS C:\> Get-FileHash .\smallfile.txt | fl

Path : C:\smallfile.txt
Type : System.Security.Cryptography.SHA256Managed
Hash : Wbfe8JVPZEEgYw/FL8nKz597pWmdkA3UDuDnqbvZ9tE=

PS C:\> Get-FileHash .\bigfile.txt | fl

Path : C:\bigfile.txt
Type : System.Security.Cryptography.SHA256Managed
Hash : FshDknouCGcNbDCA0AtYEtRC0hmPblSgP2L5PRliD4c=
The cmdlet returns a Hash property which the algorithm tries to keep unique. To confirm this, try changing just one char of a test file. You'll find that even a single change to the source stream of data yields sweeping changes in the value of the hash, and this is known as the 'Avalanche effect'. The avalanche effect can be best proved by hashing two files with nearly identical content. For instance, I've changed the first char of bigfile.txt fro 't' to 'T' and the whole hash changes:
PS C:\> Get-FileHash .\bigfile.txt | fl

Path : C:\bigfile.txt
Type : System.Security.Cryptography.SHA256Managed
Hash : /iloMfyOMQsRsv/rjMTU9hck7hYeHK9atGen4pm5yWE=
Let's now move to another aspect of this subject. As you can see in the Type property above, the default algorithm used by Get-FileHash is SHA256 (Secure Hash Algorithm 256 bit), which is one of a number of cryptographic hash functions. This SHA-256 algorithm generates an almost-unique, fixed size 256-bit (32-byte) hash.
Other possible hashing algorithm for Get-FileHash are SHA1, SHA384, SHA512, MACTripleDES, MD5 and RIPEMD160. Microsoft probably choose SHA256 as default algorithm because its 256 bit hash is the best compromise between speed and security... and the possibility for collisions (two different files having the same hash value) is lots and lots smaller than for MD5, which is another common choice for hashing.


Now let's have a look at the help of Get-FileHash as it appear in R2 Preview:
Synopsis
    
    Get-FileHash [-FilePath]  [[-Algorithm] ] []
    
Syntax
    Get-FileHash [-FilePath]  [[-Algorithm] ] []

Parameters
    -Algorithm 

        Required?                    false
        Position?                    1
        Default value                
        Accept pipeline input?       false
        Accept wildcard characters?  

    -FilePath 

        Required?                    true
        Position?                    0
        Default value                
        Accept pipeline input?       false
        Accept wildcard characters?  

Inputs
    None
    
Outputs
    System.Object
It's a pretty basic help file. Anyhow, it gives us the syntax to use and that's all we need. The biggest deception for me here was to see that FilePath param does not accept pipeline input, so it is not possible for instance to execute 'dir c: | Get-FileHash' because the cmdlet will stop and ask for a file to process:
PS C:\> dir c:\ | Get-FileHash

cmdlet Get-FileHash at command pipeline position 2
Supply values for the following parameters:
FilePath:
I suppose this is just a beta version of the cmdlet and the final release will accept pipeline input, at least I hope.

Let's move now to some practical aspects of this cmdlet. There are quite a few uses for hashes, and the most obvious use is to verify the integrity of a file when you have downloaded to check it hasn't been altered.

In my case, I have decided to use Get-FileHash in a script which compares two directory trees, find out the files which are in both locations but have different hashes (and that are therefore different), then show the location of the most recent item, so that it can be copied to the other locations.

Such a script can be useful for instance to compare the contents of two DFS-R nodes to check for mismatches.

Here's the script (remark the first line, this script won't work on Powershell v3 or earlier!):
#Requires -Version 4

$referencefolder = '\\dfsserver1\dfs\data'
$differencefolder = '\\dfsserver2\dfs\data'
$hashalgorithm = 'md5'

# Finding all the files in the reference folder and adding a Hash property
$referencefiles = gci $referencefolder -Recurse -File |
    select Length,Fullname,LastWriteTime,@{Label='Hash'; Expression={(Get-FileHash -Algorithm $hashalgorithm $psitem.fullname).hash}},@{Label='RelativePath'; Expression={$_.fullname -replace [regex]::Escape($referencefolder)}}

# Finding all the files in the difference folder and adding a Hash property
$differencefiles = gci $differencefolder -Recurse -File |
    select Length,Fullname,LastWriteTime,@{Label='Hash'; Expression={(Get-FileHash -Algorithm $hashalgorithm $psitem.fullname).hash}},@{Label='RelativePath'; Expression={$_.fullname -replace [regex]::Escape($differencefolder)}}

# Finding all files with different hashes between source and dest...
$diffhash = compare $referencefiles $differencefiles -property hash -PassThru

# Retrieving files which appear more than once, whose hash is different and showing the folder containing the most recent item
$samefiledifferenthash = $diffhash | Group-Object -property relativepath | ? count -gt 1 

# Filtering out older versions
$mostrecent = $samefiledifferenthash | % { $_ | select -ExpandProperty group | sort lastwritetime -desc | select -first 1 }

# Showing recent items which should be copied to other destinations
"Recent items which should be copied to other destinations:"
$mostrecent | ft -Property fullname,lastwritetime -AutoSize
Now let's run it:
PS C:\> C:\hash_compare.ps1

Recent items which should be copied to other destinations:

FullName                         LastWriteTime      
--------                         -------------      
\\dfsserver1...\newfile.txt      05/07/2013 15:30:12
\\dfsserver2...\modifiedfile.txt 03/07/2013 15:30:47
The output of the script allows for an easy identifications of files to copy. In the examples newfile.txt should be copied to dfsserver2 and modifiedfile.txt should be copied to dfsserver1. As I said I would use this script to make sure that the content of two directory trees have all the same version of each file.

Beware anyway because the files that appear on only one host are filtered out, because I am not interested in files that could be deleted on one side and not on the other, nor in files which are maybe not synchronised yet.

Let's now dig a bit on these hashing algorithms. As I said before, possible values for Get-FileHash are SHA1, SHA256, SHA384, SHA512, MACTripleDES, MD5 and RIPEMD160.

Between them, SHA256 and MD5 are two of the most common. They take our input data, in the case of Get-FileHash one filename, and output a 256 or 128-bit number. This number is a called a checksum.

In theory, SHA256 should take somewhat more time to calculate then MD5, because the generated checksum is bigger. I wanted check this so I wrote the following script to determine the faster algorithm to generate the hash of two files:
#Requires -Version 4

$container = @()

1..3000 | % {
"SHA1","SHA256","SHA384","SHA512","MACTripleDES","MD5","RIPEMD160" | % {
    $ht = [PSCustomObject]@{
        Algorithm = $_;
        Milliseconds = (measure-command -Expression {Get-FileHash C:\bootmgr -Algorithm $_;Get-FileHash C:\test_hash01.ps1 -Algorithm $_}).totalmilliseconds
        }
    $container += $ht
    } 
}

$container | group algorithm | % { 
    [PSCustomObject]@{
        Algorithm = $_.Name
        Average = ($_.Group | select -ExpandProperty milliseconds | Measure-Object -Average ).average }
        } | sort average
Here's the result of the execution of this script on a virtual machine running Windows 2012 R2 Preview:
Algorithm Average
--------- ------- MD5 15,8038968333333 SHA1 20,5171854 SHA384 23,6268977666666 SHA512 25,3000916666667 RIPEMD160 25,3952806 SHA256 29,2931979333333 MACTripleDES 69,5396810666666
This confirmed that MD5 is by far the fastest algorithm with 15 milliseconds average processing time. That's probably the reason why ‏@makovec twitted  some days ago about modifying the default algorithm to MD5:

"My new addition to #PSDefaultParameterValues on #psv4: $PSDefaultParameterValues.Add('Get-FileHash:Algorithm', 'MD5') #PowerShell"

Needless to say I did the same in my profile.

That's all for this new cmdlet. I hope the description I gave was interesting to you. Do not hesitate to share if so. Please note also that the two script I wrote are just drafts to show the ins-and-outs of Get-FileHash and I am sure they can be improved if desired and I am open for comments.

Stay tuned for more on Powershell V4!

1 comment:

  1. Brilliant analysis! Wasn't aware of those hashing algorithms before, so thanks a lot!

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...