Sonntag, 24. Januar 2010

Remove-DuplicateFile the ISE Way

The last days I spent checking some PowerShell scripts for finding and deleting duplicate files.

First we need to define which files are equal.
For most big files the criteria is simple: they are equal, when they are true copies of each other.
The second group contains text files you have edited. Here you often want to ignore some insignificant differences like trailing blanks etc. Take a look at the options of the old fc command

Let us focus today on the first approach. The best solution I found on the web is from Jason Stangroome. From him I copied the idea to use the Get-MD5 function, which closes the streams.

Without it you have some fun googling for 'The Process cannot access the file because a it is being used by another process'. Yes there seems to be ways to close files even in PowerShell, but PowerShell documentation is somewhat biased.

Using those Get-MD5 function, there is no problem in invoking Remove-item on the File you found, if you are decided to delete it.

I would prefer to use the delete method of the [System.IO.FileInfo], but I get 'Exception calling "Delete" with "0" argument(s): "Access to the path 'xxx.ps1' is denied.'
On the other side
Remove-item $($file.Fullname) -force

works fine.

I'm not following Jason in sorting the files acording to there length, but keep to a simpler design of comparing the files in the order they are output by Get-Childitem.

Instead of using -whatif I chose a different way. I just output the commands and some comments to a new ISE-file.

There I can check, whether I want to perform the delete or not and at the same time I have a fine log of the files I deleted.

The following code assumes you use PowerShell V2' ISE:

function Get-MD5([System.IO.FileInfo] $file = $(throw 'Usage: Get-MD5 [System.IO.FileInfo]'))            
{
# This Get-MD5 function sourced from:
# http://blogs.msdn.com/powershell/archive/2006/04/25/583225.aspx
$stream = $null;
$cryptoServiceProvider = [System.Security.Cryptography.MD5CryptoServiceProvider];
$hashAlgorithm = new-object $cryptoServiceProvider
$stream = $file.OpenRead();
$hashByteArray = $hashAlgorithm.ComputeHash($stream);
$stream.Close();

## We have to be sure that we close the file stream if any exceptions are thrown.
trap
{
if ($stream -ne $null) { $stream.Close(); }
break;
}

return [string]$hashByteArray;
}

function New-IseFile ($path)
{
$count = $psise.CurrentPowerShellTab.Files.count
$null = $psIse.CurrentPowerShellTab.Files.Add()
$Newfile = $psIse.CurrentPowerShellTab.Files[$count]
$NewFile.SaveAs($path)
$NewFile.Save([Text.Encoding]::default)
$Newfile

}

function Write-IseFile($file, $msg)
{
$Editor = $file.Editor
$Editor.SetCaretPosition($Editor.LineCount, 1)
$Editor.InsertText(($msg + "`r`n"))
}


function Remove-DuplicateFile
{
[CmdletBinding()]
param (
$path = ”.\”,
$extension = ”*.*”,
[switch]$delete
)


$Newfile = New-IseFile "$(get-location)\Delete-DuplicateFiles_$(get-date -f "yyyy-MM-dd-HH").ps1"

$hashtable = new-object system.collections.hashtable

# $global:filesToDelete = @()
$global:totalLength = 0

get-childitem -path $path $extension -recurse | where-object { ! $_.PSIsContainer } |
% {
$file = $_
$hashvalue = Get-MD5 $file
$length = $_.Length
trap {
$global:totalLength += $length
$msg = @"

# current {0} Byte total: {1,8:f3} MB"
# Remove-Item `'$($hashtable[$hashvalue])`' -force
Remove-Item `'$($file.Fullname)`' -force
"@
-f $length, $($global:totalLength/ 1MB)
$msg
Write-IseFile $Newfile $msg
# $global:filesToDelete += $file.Fullname
if ($delete)
{
Remove-item $($file.Fullname) -force
#$file.delete()

}
continue
}
$hashTable.Add($hashvalue, $file.FullName)
}

# Write-Host "`r`nFiles to delete`r`n"

# foreach ($f in $filesToDelete) {
# Write-Host "Remove-Item `'$f`' -force"
# }
$NewFile.Save()
}



Now you I call it with something like

Remove-DuplicateFile D:\myFiles,D:\mybackup

and get the new new ISE-Editor.

Attention using the -delete parameter the deletes where executed immediately.

1 Kommentar:

  1. Hi,

    I'm glad you found my post useful. The reason I group by length first is because checking length is cheap and two files with different lengths cannot be duplicates of each other so I don't waste time calculating their MD5 hash.

    Regards,

    Jason

    AntwortenLöschen