Saturday, April 30, 2016

New PowerShell cmdlets in Windows 2016 TP5

I have just installed the last technical preview of Windows 2016 and couldn't retain myself from having a look at the new PowerShell cmdlet list.

Here's how I do it (I am logged on Win2016tp5):
Get-Command | Export-Clixml c:\temp\2016tp5.xml
icm -ComputerName win2016tp4 {Get-Command | Export-Clixml C:\temp\2016tp4.xml}
$newcmdlet = diff (Import-Clixml .\2016tp5.xml) (Import-Clixml \\win2016tp4\c$\temp\2016tp4.xml) -Property Name
Here's what I get:
$newcmdlet | Sort Name
Name                               SideIndicator
----                               --
Add-LocalGroupMember               <=
Add-NetEventVFPProvider            <=
Add-NetEventVmSwitchProvider       <=
Backup-AuditPolicy                 <=
Backup-SecurityPolicy              <=
Debug-VirtualMachineQueueOperation <=
Disable-LocalUser                  <=
Disable-StorageMaintenanceMode     <=
Disable-TlsEccCurve                <=
Enable-LocalUser                   <=
Enable-StorageMaintenanceMode      <=
Enable-TlsEccCurve                 <=
Find-Command                       <=
Find-RoleCapability                <=
Get-CustomerRoute                  <=
Get-LocalGroup                     <=
Get-LocalGroupMember               <=
Get-LocalUser                      <=
Get-NetEventVFPProvider            <=
Get-NetEventVmSwitchProvider       <=
Get-PACAMapping                    <=
Get-ProviderAddress                <=
Get-TlsEccCurve                    <=
Invoke-AppxPackageCommand          <=
New-LocalGroup                     <=
New-LocalUser                      <=
Remove-LocalGroup                  <=
Remove-LocalGroupMember            <=
Remove-LocalUser                   <=
Remove-NetEventVFPProvider         <=
Remove-NetEventVmSwitchProvider    <=
Remove-RDDatabaseConnectionString  <=
Rename-LocalGroup                  <=
Rename-LocalUser                   <=
Restore-AuditPolicy                <=
Restore-SecurityPolicy             <=
Set-LocalGroup                     <=
Set-LocalUser                      <=
Set-NetEventVFPProvider            <=
Set-NetEventVmSwitchProvider       <=
Test-EncapOverheadValue            <=
Test-LogicalNetworkConnection      <=
Test-VirtualNetworkConnection      <=
As you can see, there are a bunch of new cmdlets for local accounts management:
Get-Command | ? Source -eq 'Microsoft.PowerShell.LocalAccounts'

CommandType Name                    Version Source
----------- ----                    ------- ------
Cmdlet      Add-LocalGroupMember    1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Disable-LocalUser       1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Enable-LocalUser        1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Get-LocalGroup          1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Get-LocalGroupMember    1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Get-LocalUser           1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      New-LocalGroup          1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      New-LocalUser           1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Remove-LocalGroup       1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Remove-LocalGroupMember 1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Remove-LocalUser        1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Rename-LocalGroup       1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Rename-LocalUser        1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Set-LocalGroup          1.0.0.0 Microsoft.PowerShell.LocalAccounts
Cmdlet      Set-LocalUser           1.0.0.0 Microsoft.PowerShell.LocalAccounts
Though being pretty self-explanatory cmdlets (remember the concept of discoverability?), these add nicely to the current set of cmdlets and make your servers even more manageable.

Powershell, an always evolving language!

Wednesday, April 27, 2016

First steps with Microsoft Desired State Configuration

It's been a long time since Microsoft set up the way for Desired State Configuration and this technology is spreading pretty fast through system admins. Rare are the people who have not heard of it: no matter if you are a PowerShell expert, or just someone making your first steps with it, DSC is one of the best feature in the language since Windows 2012 R2 and PowerShell 4.0 and a lot of us are already implementing it.

But this is only partially true. It isn't hard to see a great difference in speed of adoption of such things between the US, which are always moving pretty fast toward everything that is new (fellow MVP Mike F. Robbins has almost half of his audience at the PowerShell & DevOps Summit using DSC!), and good old Europe (where I live and work): here most system admins around me and with whom I spend a lot of time are still a bit lost when it comes to using a shell language to administer their systems (luckily with exceptions, such as fellow MVP Fabien Dibot who is doing a great job of evangelist around everything Cloud in France).

For sure Windows PowerShell rise has been incredibly fast, with five major versions in nine years, and most of us weren't ready for the change, but, hey, the change came, so why keep hesitating and risk being left behind for good?

So now the question is how do you get started with DSC. Well, the answer is not so easy. There are for sure a lot of resources out there, but, hey, it's complicated to find a good starting point. When you start looking at it, a lot of terms gravitate around DSC and make understanding more rude: you have Pester, GitHub, modules, resources, you have the PowerShell Gallery and a lot of stuff starting with x's and you have Pull and Push configurations and DSC resources to configure DSC itself. And finally you have DSC for Azure. Feeling left behind can definitively happen here.

So, thinking to all DSC newbies, I decided to write a basic simple blog post to introduce DSC in a simple way. I won't be talking about Push and Pull models, nor about Partial Configurations or Cross-Computer synchronization. And I won't try to explain you the difference between GPOs, SCCM and DSC (fellow MVP Stephen Owen does a good job of explaining it all in his blog post 'DSC vs. GPO vs. SCCM, the case for each.').

Everything starts with a keyword: Configuration.
Get-Command Configuration

CommandType     Name                           Version    Source
-----------     ----                           -------    ------
Function        Configuration                  1.1        PSDesiredStateConfiguration
Configurations are special types of functions which, at their simplest, are composed of a main block:
Configuration NameOfTheConfiguration {
   }
Inside this block come one Node blocks for each target computer to configure:
Configuration NameOfTheConfiguration {
   Node 'SRV1' {
       }
   Node 'SRV2 {
       }
   }
Each Node block contains one or more Resource blocks:
Configuration NameOfTheConfiguration {
   Node 'SRV1' {
      WindowsFeature FeatureName {
         Ensure = 'Present'
         Name = 'Name'
         }
       }
   }
The list of resources you can declare are easily obtained using the Get-DscResource cmdlet:
Get-DscResource -Module PSDesiredStateConfiguration | select name
Name
----
File
Archive
Environment
Group
GroupSet
Log
Package
ProcessSet
Registry
Script
Service
ServiceSet
User
WaitForAll
WaitForAny
WaitForSome
WindowsFeature
WindowsFeatureSet
WindowsOptionalFeature
WindowsOptionalFeatureSet
WindowsProcess
This cmdlet is pretty powerfull and it's not limited to showing you the lists of resources: it can also be used to get the syntax of a specific resource:
Get-DscResource -Name Service -Syntax
Service [String] #ResourceName
{
    Name = [string]
    [BuiltInAccount = [string]{ LocalService | LocalSystem | NetworkService }]
    [Credential = [PSCredential]]
    [Dependencies = [string[]]]
    [DependsOn = [string[]]]
    [Description = [string]]
    [DisplayName = [string]]
    [Ensure = [string]{ Absent | Present }]
    [Path = [string]]
    [PsDscRunAsCredential = [PSCredential]]
    [StartupType = [string]{ Automatic | Disabled | Manual }]
    [State = [string]{ Running | Stopped }]
Here you go, you have the basics: you know that you can write a script which contains a Configuration function which declaratively configures nodes with desired well-known resources. That's all there is to know about it to start with DSC and see if you can get any benefit from it.

Now you are NOT supposed to grab your keyboard and start writing your resources: most of what is used today on servers is already available for you out there, on GitHub and on the PowerShell Gallery: it's been developed by Microsoft, it's been improved by the community, and even if it is experimental (remember the x's?), you can already take advantage of it. But how?

That's the second step in learning DSC and there is a cmdlet for it: Install-Module.

Install-Module (alias inmo) is a cmdlet available for PowerShell 5.0 which does all the work for you: once you know you are interested in a module from the online Gallery, just ask this cmdlet to fetch it for you and store it under the %systemdrive%:\Program Files\WindowsPowerShell\Modules folder hence making it available for all the local users.

Quick tip: to get a list for all your module paths; just query the right variable:
$env:PSModulePath -split ';'
C:\Users\Carlo\Documents\WindowsPowerShell\Modules
C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules\
C:\Program Files\WindowsPowerShell\Modules\
It's interesting to note that this cmdlet accepts pipeline input, so if you do not know the exact name of a module, just use Find Module:
Find-Module -Name "xSys*Sec*" | Format-List

Name                       : xSystemSecurity
Version                    : 1.1.0.0
Type                       : Module
Description                : Handles Windows related security settings like UAC and IE ESC.
Author                     : Arun Chandrasekhar
CompanyName                : PowerShellTeam
Copyright                  : (c) 2014 Microsoft Corporation. All rights reserved.
PublishedDate              : 11/09/2015 23:28:31
LicenseUri                 : https://github.com/PowerShell/xSystemSecurity/blob/master/LICEN
ProjectUri                 : https://github.com/PowerShell/xSystemSecurity
IconUri                    :
Tags                       : {DesiredStateConfiguration, DSC, DSCResourceKit, PSModule}
Includes                   : {Function, DscResource, Cmdlet, Workflow...}
PowerShellGetFormatVersion :
ReleaseNotes               :
Dependencies               : {}
RepositorySourceLocation   : https://www.powershellgallery.com/api/v2/
Repository                 : PSGallery
PackageManagementProvider  : NuGet
AdditionalMetadata         : {versionDownloadCount, summary, ItemType, copyright...}
Then pass the output down to Install-Module:
Find-Module -Name "xSys*Sec*" | Install-Module

VERBOSE: The installation scope is specified to be 'AllUsers'.
VERBOSE: The specified module will be installed in 'C:\Program Files\WindowsPowerShell\Modules'.
VERBOSE: The specified Location is 'NuGet' and PackageManagementProvider is 'NuGet'.
VERBOSE: Downloading module 'xSystemSecurity' with version '1.1.0.0' from the repository
'https://www.powershellgallery.com/api/v2/'.
VERBOSE: Searching repository 'https://www.powershellgallery.com/api/v2/FindPackagesById()?id='xSystemSecurity'' for
''.
...
VERBOSE: Downloading 'https://www.powershellgallery.com/api/v2/package/xSystemSecurity/1.1.0'.
VERBOSE: Completed downloading 'https://www.powershellgallery.com/api/v2/package/xSystemSecurity/1.1.0'.
VERBOSE: Completed downloading 'xSystemSecurity'.
VERBOSE: InstallPackageLocal' - name='xSystemSecurity',
version='1.1.0.0',destination='C:\Users\Carlo\AppData\Local\Temp\820819460'
VERBOSE: Module 'xSystemSecurity' was installed successfully.
Concerning these steps, I have seen system administrators trying to manually download DSC resources from GitHub, unzipping them to C:\Program Files\WindowsPowerShell\Modules\ but being unable to list their content with Get-DscResource:
Get-DscResource xSystemSecurity-dev

CheckResourceFound : The term 'xSystemSecurity-dev' is not recognized as the name of a Resource.
At C:\windows\system32\windowspowershell\v1.0\Modules\PSDesiredStateConfiguration\PSDesiredStateConfiguration.psm1:3983 char:13
+             CheckResourceFound $Name $Resources
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
    + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,CheckResourceFound
That's not well documented, but, when you download the zip file for a module, the zipped folder can have a '-dev' extension: in this case you have to remove it otherwise Get-DscResource won't be able to discover it.

That's all for today. Stay tuned for more DSC. Do not hesitate to share!

Monday, April 4, 2016

Working with Unicode scripts, blocks and categories in Powershell

This March 2016 I was honored to be the author of the monthly scripting competitions at powershell.org. For the contest, I came up with a scenario where the system administrator was tasked to use PowerShell check a given path and identify all the files whose names had letters (not symbols nor numbers) in the Latin-1 Supplement character block.




This scenario came in two versions: one for beginners, where competitors where allowed to write a oneliner, and a one for experts, where I expected people to write a tool (in the form of an advanced function) to do the job.

In both cases I expected people to focus on understanding how regular expression engines use the Unicode character set and to use the best possible syntax to solve the puzzle. That's why I explicitly asked competitors to work with the Latin-1 Supplement character block. That was the key clue that should have pushed people to learn that Unicode is a so large character set that it has been split up in categories: using these categories in your regex expressions makes them more robust.

1 - OF IMPRACTICAL SOLUTIONS

Let's start looking at some sample answers we got, which is not exactly what I expected:

where {$_.Name -match '[\u00C0-\u00D6]' -or $_.Name -match '[\u00D8-\u00F6]' -or $_.Name -match '[\u00F8-\u00FF]'}

Where {$_.Name -match "[\u00C0-\u00FF]"}

Where {$_.Name -match '[\u00C0-\u00FF]' -and $_.Name -notmatch '[\u00D7]|[\u00F7]'}

Where-Object{[int[]][char[]]$_.name -gt 192}

.Where({$_.Name -match "[\u00C0-\u00FF]"

if (($LetterNumber -in 192..214) -or ($LetterNumber -in 216..246) -or ($LetterNumber -in 248..255))

where name -Match '[\u0080-\u00ff]'

$_.Name -match '[\u00C0-\u00FF -[\u00D7\u00F7]]'

$_.Name -match '[\u0083\u008A\u008C\u008E\u009A\u009C\u009E\u009F\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]'

if ($char -notmatch '[a-z]' -and [Globalization.CharUnicodeInfo]::GetUnicodeCategory($char) -match 'caseLetter$')

Where-Object -FilterScript { @('©','¼','½','÷') -notcontains $_.Name -and [regex]::IsMatch($_.Name,"\p{IsLatin-1Supplement}")

($_.Name -match '[\p{IsLatin-1Supplement}-[\x80-\xbf\xd7\xf7]]+')

It's easily said that all these answers are, to varying degrees, impractical to maintain and error-prone for a simple reason: the code points used in the code are not human-readable and than a simple typo can break the code without raising alerts.

There is also a problem of subjectivity, where each competitors decided to use different code points: 00D6 or 00FF or 00F7, just to make a few examples.

So the question is how you decide which code points to use and how you could have taken advantage of Unicode categories in your regular expression to write a solid answer for this puzzle.

To answer this question I will first walk you through the Unicode model and see how it is structured.
You can think of Unicode as a database maintained by an international consortium which stores all the characters in all the existing languages.

New versions are released to reflect major changes, since new writing system are discovered periodically and new glyphs (which are graphical representations of characters) have to be added: look for instance at those found on the 4 thousands years old Phaistos Disc:


2 - UNICODE VERSIONS, PLANES AND CODE POINTS

The first version of Unicode dates back to 1990 and since then a bunch of versions have followed:
  • 1.0.0 1991 October
  • 2.0.0 1996 July
  • 3.0.0 1999 September
  • 4.0.0 2003 April
  • 5.0.0 2006 July
  • 6.0.0 2010 October
  • 7.0.0 2014 June
  • 8.0.0 2015 June
The last version is 8.0 and defines a code space of 1,114,112 code points in the range 0 hex to 10FFFF hex:
(10FFFF)base16 = (1114111)base10
Concerning the Windows world, the .NET Framework 4.5 conforms to the Unicode 6.0 standard, which dates from 2010, while on previous versions, it conforms to the Unicode 5.0 standard, as you can read here.
Each code point is referred to by writing "U+" followed by its hexadecimal number, where U stands for Unicode. So U+10FFFF is the code point for the last code point in the database.

All these code points are divided into seventeen planes, each with 65,536 elements. The first three planes are named respectively:

  • Basic Multilingual Plane, or BMP
  • Supplementary Multilingual Plane, or SMP
  • Supplementary Ideographic Plane, or SIP

BMP, whose extent corresponds exactly to a Unsigned 16-bit integer ([uint16]::MaxValue = 65535), covers Latin, African and Asian languages as well as a good amount of symbols. So languages like English, Spanish, Italian, Russian, Greek, Ethiopic, Arabic and CJK (which stands for Chines, Japanese and Korean languages) have code points assigned in this plane.

These code points are expressed with four digits long code points, from 0000 to FFFF. So, for instance:

  • U+0058 is the code point for the Latin capital X
  • U+0389 is the code point for the Greek capital letter Omega
  • U+221A is the code point for the square root symbol
  • U+0040 is the code point for the Commercial At symbol
  • U+9999 is the code point for the Han character meaning 'fragrant, sweet smelling, incense'
  • U+0033 is the code point for the digit three
So, letters, digits and symbols we widely use have all their code point in the Unicode database.

The .NET Framework uses the System.Char structure to represent a Unicode character.

3.1 - HOW TO CONVERT A GLYPH TO A UNICODE CODE POINT

There is a simple way in Powershell to find the code point of a given glyph.

First you have to take the given character and find its numeric value, using typecasting on the fly:

$char = 'X'

[int][char]$char
This is the equivalent of the ORD function you have in many other languages (Delphi, PHP, Perl, etc).
Then, using the Format operator with the X format string, convert it to hexadecimal:

'{0:X4}' -f [int][char]$char
Since each Unicode code point is referred to with a U+, we just have to add it to our string through concatenation:

'U+{0:X4}' -f [int][char]$char
3.2 - HOW TO CONVERT A UNICODE CODE POINT TO A GLYPH

Now, if you want to get the glyph of a given code point, you have to reverse your code:

First you have to ask PowerShell to call ToInt32 to convert the hex value (base-16) to a decimal:

[int][Convert]::ToInt32('0058', 16)
Then a step is required to cast the decimal to a char:

[Convert]::ToChar([int][Convert]::ToInt32('0058', 16))
So, if we go back to the examples we saw before, we can use a loop to convert all the four digits long hex values of the Basic Multilingual Plane to their corresponding glyphs.

'0058','0389','221A','0040','9999','0033' | % { [Convert]::ToChar([int][Convert]::ToInt32($_, 16)) }
X
Ή
√
@
香
3
Actually, you have a simpler way to get the same result, which relies on the implicit conversion performed by the compiler when numbers are prefixed by '0x':

0x0058, 0x389, 0x221a, 0x0040, 0x9999, 0x0033 | % { [char]$_ }
X
Ή
√
@
香
3
3.3 - OF CODE POINTS BEYOND THE BASIC MULTILINGUAL PLANE

At this point it is interesting to know here that Unicode adopts UTF-16 as the standard enconding for everything inside the Basic Multilingual Plane, since, as we have seen, most living languages have all (or at least most) of their glyphs within the range 0 - 65535.
For characters beyond the first Unicode plane, that is whose code is superior to 65535 and hence can't fit in a 16 bit integer (a Word), we can use two encodings: UTF-32 or 16-bits Surrogate Pairs.
The latter is a method where a glyph is represented by a first (high) surrogate (16-bit long) code value in the range U+D800 to U+DBFF and a second (low) surrogate (16-bit long as well) code value in the range U+DC00 to U+DFFF. Using this mechanism, UTF-16 can support all 1,114,112 potential Unicode characters (2^16 * 17 Planes).


In any case Windows is not capable of showing non-BMP glyphs even if a font like Code2001 is installed. Let's see this in practice.
In the example below I am outputting the glyph for the commercial AT (which is in the BMP) starting from its UTF-32 serialization using the ConvertFromUTF32 method:
[char]::ConvertFromUtf32(0x00000040)
@
In this other example below I am trying hard to show to screen the glyph for the MUSICAL SYMBOL G CLEF, which has been added to Unicode 3.1 and belongs to the Supplementary Multilingual Plane, but I am only able to get a square box (that is used for all characters for which the font does not have a glyph):
[char]::ConvertFromUtf32(0x0001D11E)
𝄞
Now that you are confident with code points, it is time to step up your game and get an understanding of some Unicode properties which are useful to solve our puzzle: General Category, Script and Block.

4.1 - UNICODE PROPERTIES: GENERAL CATEGORY

Each code point is kind of an object that has a property named General Category. The major categories are: Letter, Number, Mark, Punctuation, Symbol, and Other.

Within these 7 categories, there are the following subdivisions:

  • {L} or {Letter}
  • {Ll} or {Lowercase_Letter}
  • {Lu} or {Uppercase_Letter}
  • {Lt} or {Titlecase_Letter}
  • {L&} or {Cased_Letter}
  • {Lm} or {Modifier_Letter}
  • {Lo} or {Other_Letter}
  • {M} or {Mark}
  • {Mn} or {Non_Spacing_Mark}
  • {Mc} or {Spacing_Combining_Mark}
  • {Me} or {Enclosing_Mark}
  • {Z} or {Separator}
  • {Zs} or {Space_Separator}
  • {Zl} or {Line_Separator}
  • {Zp} or {Paragraph_Separator}
  • {S} or {Symbol}
  • {Sm} or {Math_Symbol}
  • {Sc} or {Currency_Symbol}
  • {Sk} or {Modifier_Symbol}
  • {So} or {Other_Symbol}
  • {N} or {Number}
  • {Nd} or {Decimal_Digit_Number}
  • {Nl} or {Letter_Number}
  • {No} or {Other_Number}
  • {P} or {Punctuation}
  • {Pd} or {Dash_Punctuation}
  • {Ps} or {Open_Punctuation}
  • {Pe} or {Close_Punctuation}
  • {Pi} or {Initial_Punctuation}
  • {Pf} or {Final_Punctuation}
  • {Pc} or {Connector_Punctuation}
  • {Po} or {Other_Punctuation}
  • {C} or {Other}
  • {Cc} or {Control}
  • {Cf} or {Format}
  • {Co} or {Private_Use}
  • {Cs} or {Surrogate}
  • {Cn} or {Unassigned}

The Char.GetUnicodeCategory and the CharUnicodeInfo.GetUnicodeCategory method are used to return the General Category property of a char.

'X','Ή','√','@','香','3' | % { [System.Globalization.CharUnicodeInfo]::GetUnicodeCategory($_) }
UppercaseLetter
UppercaseLetter
MathSymbol
OtherPunctuation
OtherLetter
DecimalDigitNumber
As you can see, Unicode also brings interesting possibilities. Once you know that each Unicode character belongs to a certain category, you can try to match a single character to a category with \p (in lowercase) in your regular expression:

#A is a letter
'A' -match "(\p{L})"
True

#3 is a digit
3 -match "(\p{N})"
True
You can also match a single character not belonging to a category with \P (uppercase):

#X is not a digit
'X' -match "(\P{N})"
True

#3 is not a letter
3 -match "(\P{L})"
True
4.2 - UNICODE PROPERTIES: SCRIPT AND BLOCK

Other useful properties of a character are Script and Block: each character belongs to a Script and to a Block.

A Script is a group of code points defining a given human writing system, so we can generally think of a script as of a language. Though many scripts (like Cherokee, Lao or Thai) correspond to a single natural language, others (like Latin) are common to multiple languages (Italian, French, English...). Code points in a Script are scattered and don't form a contigous range.

The list of the existing Scripts is kept by the Unicode Consortium in the Unicode Character Database (UCD), which consists of a number of textual data files listing Unicode character properties and related data.

The UDB file for Scripts is here.
A block on the other side is a contiguous range of code points.

The UDB file for Blocks is here

5.1 - HOW TO GET THE UNICODE SCRIPT FOR A CHARACTER

At this point it can be interesting to see how you can use PowerShell to check if a given char belongs to which script.

This is a tough task since \p in .NET is not aware of script names, so there's no straightforward way to match a char to a Script, meaning the following code won't work:
'X' -match "(\p{Anatolian_Hieroglyphs})"

parsing "(\p{Anatolian_Hieroglyphs})" - Unknown property 'Anatolian_Hieroglyphs'.
At line:1 char:1
+ 'X' -match "(\P{Anatolian_Hieroglyphs})"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], ArgumentException
    + FullyQualifiedErrorId : System.ArgumentException

But, since we know now that the UCD contains a list of all the Scripts in a file on the Unicode.org website, we just have to retrieve it via Invoke-WebRequest:

$sourcescripts = "http://www.unicode.org/Public/UNIDATA/Scripts.txt"

$scriptsweb = Invoke-WebRequest $sourcescripts
The a bit of manipulation is required to translate this text file to an object:

$scriptsinfo = ($scriptsweb.content.split("`n").trim() -ne "") |
                sls "^#" -n |
                convertfrom-csv -Delimiter ';' -header "range","scriptname"
Basically, I am

  • splitting the content of the web page so that I have one Script per line: split("`n")
  • removing empty lines: -ne ""
  • suppressing comments (indicated by hash marks): sls "^#" -n
  • converting the data to a CSV with two columns names Range and ScriptName: -header "range","scriptname"

That makes for a pretty nice oneliner: I had a text file on a web server and in three lines of code I have an object containing all the possible Scripts and their code point ranges:

AB60..AB64    Latin # L&   [5] LATIN SMALL LETTER SAKHA YAT......
FB00..FB06    Latin # L&   [7] LATIN SMALL LIGATURE FF..LATIN....
FF21..FF3A    Latin # L&  [26] FULLWIDTH LATIN CAPITAL LETTER....
FF41..FF5A    Latin # L&  [26] FULLWIDTH LATIN SMALL LETTER A....
0370..0373    Greek # L&   [4] GREEK CAPITAL LETTER HETA..GRE....
0375          Greek # Sk       GREEK LOWER NUMERAL SIGN          ....
0376..0377    Greek # L&   [2] GREEK CAPITAL LETTER PAMPHYLIA....
037A          Greek # Lm       GREEK YPOGEGRAMMENI               ....
037B..037D    Greek # L&   [3] GREEK SMALL REVERSED LUNATE SI....
037F          Greek # L&       GREEK CAPITAL LETTER YOT      ....
0384          Greek # Sk       GREEK TONOS                       ....
0386          Greek # L&       GREEK CAPITAL LETTER ALPHA WIT....
0388..038A    Greek # L&   [3] GREEK CAPITAL LETTER EPSILON W....
038C          Greek # L&       GREEK CAPITAL LETTER OMICRON W....
Now to see what Script a char belongs to, I simply have to find its numeric value, then see if it is in the range (converted to decimal) of code points (converted from hex to decimal) and return the Script name:

$char = 'Ή'

$decimal = [int][char]$char

foreach($line in $scriptsinfo){

    #Splitting each range to the double points ..
    $hexrange = $line.range.split('..')

    #Getting the start value of the range
    $hexstartvalue = $hexrange[0].trim()

    #Getting the end value of the range (if it exists, hence the try/catch)
    try{
        
        $hexendvalue = $hexrange[2].trim()
        
        }
        
    catch{
    
        $hexendvalue = $null
        
        }
    
    #Converting the start value from he to decimal for easier comparison
    $startvaluedec = [Convert]::ToInt32($hexstartvalue, 16)

    if($hexendvalue){
    
        $endvaluedec = [Convert]::ToInt32($hexendvalue, 16)
    
        #Cheking existence in range
        if($decimal -in ($startvaluedec..$endvaluedec)){
        
            "$char (dec: $decimal) is in script $($line.scriptname -replace '\#.*$') between $startvaluedec and $endvaluedec"
            }
        }
    
    else{
    
        #Checking equality with single value (in case it is not a range)
        if($decimal -like $startvaluedec){
        
            "$char (dec: $decimal) is in script $($line.scriptname -replace '\#.*$') because it's equal to $startvaluedec"
        
            }
    
        }
    }

Ή (dec: 905) is in script Greek  between 904 and 906
Nice isn't it?

Another nicety is to use the same code we saw above to get the full list of all the existing Scripts:

((($scriptsweb.content.split("`n").trim() -ne "") | sls "^#" -n | convertfrom-csv -Delimiter ';' -header "range","scriptname").scriptname -replace '\#.*$').trim() | select -unique
Common, Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian_Aboriginal, Ogham, Runic, Khmer, Mongolian, Hiragana, Katakana, Bopomofo, Han, Yi, Old_Italic, Gothic, Deseret, Inherited, Tagalog, Hanunoo, Buhid, Tagbanwa, Limbu, Tai_Le, Linear_B, Ugaritic, Shavian, Osmanya, Cypriot, Braille, Buginese, Coptic, New_Tai_Lue, Glagolitic, Tifinagh, Syloti_Nagri, Old_Persian, Kharoshthi, Balinese, Cuneiform, Phoenician, Phags_Pa, Nko, Sundanese, Lepcha, Ol_Chiki, Vai, Saurashtra, Kayah_Li, Rejang, Lycian, Carian, Lydian, Cham, Tai_Tham, Tai_Viet, Avestan, Egyptian_Hieroglyphs, Samaritan, Lisu, Bamum, Javanese, Meetei_Mayek, Imperial_Aramaic, Old_South_Arabian, Inscriptional_Parthian, Inscriptional_Pahlavi, Old_Turkic, Kaithi, Batak, Brahmi, Mandaic, Chakma, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Sharada, Sora_Sompeng, Takri, Caucasian_Albanian, Bassa_Vah, Duployan, Elbasan, Grantha, Pahawh_Hmong, Khojki, Linear_A, Mahajani, Manichaean, Mende_Kikakui, Modi, Mro, Old_North_Arabian, Nabataean, Palmyrene, Pau_Cin_Hau, Old_Permic, Psalter_Pahlavi, Siddham, Khudawadi, Tirhuta, Warang_Citi, Ahom, Anatolian_Hieroglyphs, Hatran, Multani, Old_Hungarian, SignWriting
As you can see, in a few lines of code, we added to our code the ability to compare a character against a Unicode Script name, which is something that is not supported by .Net regex engine out of the box.

5.2 - HOW TO GET THE UNICODE BLOCK FOR A CHARACTER

The next step is to see how we can get which Block a given character belongs to. This is easier the getting the Script because, while .NET doesn't support regex against Script names, it natively supports running matches against Block names.

Just remember to prepend 'Is' to the Block name: not all Unicode regex engines use the same syntax to match Unicode blocks and, while Perl uses the «\p{InBlock}» syntax, .NET uses «\p{IsBlock}» instead:

'Ω' -match "(\p{Greek})"
parsing "(\p{Greek})" - Unknown property 'Greek'.
At line:1 char:1
+ 'Ω' -match "(\p{Greek})"
+ ~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], ArgumentException
    + FullyQualifiedErrorId : System.ArgumentException

'Ω' -match "(\p{IsGreek})"
True
If I want to check for a character against all the existing Blocks, I just have to rely on the UDB and dynamically build all the possible regex:

$sourceblocks = "http://www.unicode.org/Public/UNIDATA/Blocks.txt"

$blocksweb = Invoke-WebRequest $sourceblocks

$blocklist = (($blocksweb.content.split("`n").trim() -ne "") | 
                    
                sls "^#" -n |
                    
                convertfrom-csv -Delimiter ';' -header "range","blockname").blockname

$char = 'Ω'
    
foreach($block in $blocklist){

    $block = $block -replace ' ',''

    $regex = "(?=\p{Is$block})"

    try{

        if($char -match $regex)

            {"$char is in $block"}

        }

    catch{}

    }

Ω is in GreekandCoptic
Another funny exercise is to try to get all the characters in the Cherokee Block, just to see how that can be done:

0..65535 | % { if([char]$_ -match "(?=\p{IsCherokee})"){[char]$_} }
Ꭰ Ꭱ Ꭲ Ꭳ Ꭴ Ꭵ Ꭶ Ꭷ Ꭸ Ꭹ Ꭺ Ꭻ Ꭼ Ꭽ Ꭾ Ꭿ Ꮀ Ꮁ Ꮂ Ꮃ Ꮄ Ꮅ Ꮆ Ꮇ Ꮈ Ꮉ Ꮊ Ꮋ Ꮌ Ꮍ Ꮎ Ꮏ Ꮐ Ꮑ Ꮒ Ꮓ Ꮔ Ꮕ Ꮖ Ꮗ Ꮘ Ꮙ Ꮚ Ꮛ Ꮜ Ꮝ Ꮞ Ꮟ Ꮠ Ꮡ Ꮢ Ꮣ Ꮤ Ꮥ Ꮦ Ꮧ Ꮨ Ꮩ Ꮪ Ꮫ Ꮬ Ꮭ Ꮮ Ꮯ Ꮰ Ꮱ Ꮲ Ꮳ Ꮴ Ꮵ Ꮶ Ꮷ Ꮸ Ꮹ Ꮺ Ꮻ Ꮼ Ꮽ Ꮾ Ꮿ Ᏸ Ᏹ Ᏺ Ᏻ Ᏼ Ᏽ ᏶ ᏷ ᏸ ᏹ ᏺ ᏻ ᏼ ᏽ ᏾ ᏿

6 - BACK TO THE POWERSHELL PUZZLE

Now that we are proficient with Unicode in our regexes, let's how we could have easily soved the puzzle.

I asked to detect all filenames that had letters (not symbols nor numbers) in the Latin-1 Supplement character block.

The Latin-1 Supplement is the second Unicode block in the Basic Multilingual Plane. It ranges from U+0080 (decimal 128) to U+00FF (decimal 255) and contains 64 code points in the Latin Script and 64 code points in the Common Script. Basically it contains some currency symbols (Yen, Pound), a few math signs (multiplication, division) and all lowercase and uppercase letters that have diacritics.

What's a diacritic you ask? The answer comes from Wikipedia:

Diacritic /daɪ.əˈkrɪtɪk/ – also diacritical mark, diacritical point, or diacritical sign – is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός (diakritikós, distinguishing"), which is composed of the ancient Greek διά (diá, through) and κρίνω (krínein or kríno, to separate). Diacritic is primarily an adjective, though sometimes used as a noun, whereas diacritical is only ever an adjective. Some diacritical marks, such as the acute ( ´ ) and grave ( ` ), are often called accents. Diacritical marks may appear above or below a letter, or in some other position such as within the letter or between two letters. The main use of diacritical marks in the Latin script is to change the sound-values of the letters to which they are added.

Since a Unicode Block exists listing all of the diacritical marks, they can be shown with a oneliner:
0..65535 | % { if([char]$_ -match "(?=\p{IsCombiningDiacriticalMarks})"){[char]$_} }
Since we have seen the syntax to check if a character has a specific Unicode Block property, and since Latin-1 Supplement IS a Block property, here's what to do:

'A' -match "(\p{IsLatin-1Supplement})"
False

'é' -match "(\p{IsLatin-1Supplement})"
True
Good! No hardcoded values here, meaning no stuff like:

where name -Match '[\u0080-\u00ff]'
or

if (($LetterNumber -in 192..214)
Subjectivity is gone!
At the same time I did ask to include only the filenames containing Letters from that Unicode Block, not Symbols, nor Digits. Here's where the General Category property we saw above comes to the rescue. I can force the regex engine to include all letters ( \p{L} ), and exclude digits ( \P{N} ), punctuation ( \P{P} ), symbols ( \P{S} ) and separators ( \P{Z} ).

'A' -match "(?=\p{IsLatin-1Supplement})(?=\p{L})(?=\P{N})(?=\P{P})(?=\P{S})(?=\P{Z})"
False

'é' -match "(?=\p{IsLatin-1Supplement})(?=\p{L})(?=\P{N})(?=\P{P})(?=\P{S})(?=\P{Z})"
True
Concerning the expression, I am using here a positive lookahead assertion (?=), which is a non-consuming regular expression. I can do this as many times as I want, and this will be act as a logic "and" between the different categories I am passing to \p or \P .

7 - THE SOLUTION IS...

For sure this can be shortened to
'é' -match "(?=\p{IsLatin-1Supplement})(?=\p{L})"
since the are no code points which are at the same time letters and numbers or letters and symbols, etc.

To sum it up, to get a list of all Latin letters with diacritics, it is as simple as typing the following line:

Get-ChildItem 'C:\FileShare' -Recurse -Force |
   Where { $_.Name -match "(?=\p{IsLatin-1Supplement})(?=\p{L})"} |
   ForEach-Object {[PSCustomObject]@{
    Name=$_.Name;Directory=$_.Directory;
    'Creation Date'=$_.CreationTime;
    'Last Modification Date'=$_.LastWriteTime;
    'File Size'=$_.Length} } |
        Format-Table -AutoSize


I hope you enjoyed this explanation. If you are a Unicode guru, and you find something incorrect, do not hesitate to drop a comment and I'll update. Thanks again to Powershell.org for giving me the occasion of being part of a larger community.

Stay tuned for more PowerShell fun!
Related Posts Plugin for WordPress, Blogger...