Monday, April 4, 2016

Working with Unicode scripts, blocks and categories in Powershell

This March 2016 I was honored to be the author of the monthly scripting competitions at For the contest, I came up with a scenario where the system administrator was tasked to use PowerShell check a given path and identify all the files whose names had letters (not symbols nor numbers) in the Latin-1 Supplement character block.

This scenario came in two versions: one for beginners, where competitors where allowed to write a oneliner, and a one for experts, where I expected people to write a tool (in the form of an advanced function) to do the job.

In both cases I expected people to focus on understanding how regular expression engines use the Unicode character set and to use the best possible syntax to solve the puzzle. That's why I explicitly asked competitors to work with the Latin-1 Supplement character block. That was the key clue that should have pushed people to learn that Unicode is a so large character set that it has been split up in categories: using these categories in your regex expressions makes them more robust.


Let's start looking at some sample answers we got, which is not exactly what I expected:

where {$_.Name -match '[\u00C0-\u00D6]' -or $_.Name -match '[\u00D8-\u00F6]' -or $_.Name -match '[\u00F8-\u00FF]'}

Where {$_.Name -match "[\u00C0-\u00FF]"}

Where {$_.Name -match '[\u00C0-\u00FF]' -and $_.Name -notmatch '[\u00D7]|[\u00F7]'}

Where-Object{[int[]][char[]]$ -gt 192}

.Where({$_.Name -match "[\u00C0-\u00FF]"

if (($LetterNumber -in 192..214) -or ($LetterNumber -in 216..246) -or ($LetterNumber -in 248..255))

where name -Match '[\u0080-\u00ff]'

$_.Name -match '[\u00C0-\u00FF -[\u00D7\u00F7]]'

$_.Name -match '[\u0083\u008A\u008C\u008E\u009A\u009C\u009E\u009F\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]'

if ($char -notmatch '[a-z]' -and [Globalization.CharUnicodeInfo]::GetUnicodeCategory($char) -match 'caseLetter$')

Where-Object -FilterScript { @('©','¼','½','÷') -notcontains $_.Name -and [regex]::IsMatch($_.Name,"\p{IsLatin-1Supplement}")

($_.Name -match '[\p{IsLatin-1Supplement}-[\x80-\xbf\xd7\xf7]]+')

It's easily said that all these answers are, to varying degrees, impractical to maintain and error-prone for a simple reason: the code points used in the code are not human-readable and than a simple typo can break the code without raising alerts.

There is also a problem of subjectivity, where each competitors decided to use different code points: 00D6 or 00FF or 00F7, just to make a few examples.

So the question is how you decide which code points to use and how you could have taken advantage of Unicode categories in your regular expression to write a solid answer for this puzzle.

To answer this question I will first walk you through the Unicode model and see how it is structured.
You can think of Unicode as a database maintained by an international consortium which stores all the characters in all the existing languages.

New versions are released to reflect major changes, since new writing system are discovered periodically and new glyphs (which are graphical representations of characters) have to be added: look for instance at those found on the 4 thousands years old Phaistos Disc:


The first version of Unicode dates back to 1990 and since then a bunch of versions have followed:
  • 1.0.0 1991 October
  • 2.0.0 1996 July
  • 3.0.0 1999 September
  • 4.0.0 2003 April
  • 5.0.0 2006 July
  • 6.0.0 2010 October
  • 7.0.0 2014 June
  • 8.0.0 2015 June
The last version is 8.0 and defines a code space of 1,114,112 code points in the range 0 hex to 10FFFF hex:
(10FFFF)base16 = (1114111)base10
Concerning the Windows world, the .NET Framework 4.5 conforms to the Unicode 6.0 standard, which dates from 2010, while on previous versions, it conforms to the Unicode 5.0 standard, as you can read here.
Each code point is referred to by writing "U+" followed by its hexadecimal number, where U stands for Unicode. So U+10FFFF is the code point for the last code point in the database.

All these code points are divided into seventeen planes, each with 65,536 elements. The first three planes are named respectively:

  • Basic Multilingual Plane, or BMP
  • Supplementary Multilingual Plane, or SMP
  • Supplementary Ideographic Plane, or SIP

BMP, whose extent corresponds exactly to a Unsigned 16-bit integer ([uint16]::MaxValue = 65535), covers Latin, African and Asian languages as well as a good amount of symbols. So languages like English, Spanish, Italian, Russian, Greek, Ethiopic, Arabic and CJK (which stands for Chines, Japanese and Korean languages) have code points assigned in this plane.

These code points are expressed with four digits long code points, from 0000 to FFFF. So, for instance:

  • U+0058 is the code point for the Latin capital X
  • U+0389 is the code point for the Greek capital letter Omega
  • U+221A is the code point for the square root symbol
  • U+0040 is the code point for the Commercial At symbol
  • U+9999 is the code point for the Han character meaning 'fragrant, sweet smelling, incense'
  • U+0033 is the code point for the digit three
So, letters, digits and symbols we widely use have all their code point in the Unicode database.

The .NET Framework uses the System.Char structure to represent a Unicode character.


There is a simple way in Powershell to find the code point of a given glyph.

First you have to take the given character and find its numeric value, using typecasting on the fly:

$char = 'X'

This is the equivalent of the ORD function you have in many other languages (Delphi, PHP, Perl, etc).
Then, using the Format operator with the X format string, convert it to hexadecimal:

'{0:X4}' -f [int][char]$char
Since each Unicode code point is referred to with a U+, we just have to add it to our string through concatenation:

'U+{0:X4}' -f [int][char]$char

Now, if you want to get the glyph of a given code point, you have to reverse your code:

First you have to ask PowerShell to call ToInt32 to convert the hex value (base-16) to a decimal:

[int][Convert]::ToInt32('0058', 16)
Then a step is required to cast the decimal to a char:

[Convert]::ToChar([int][Convert]::ToInt32('0058', 16))
So, if we go back to the examples we saw before, we can use a loop to convert all the four digits long hex values of the Basic Multilingual Plane to their corresponding glyphs.

'0058','0389','221A','0040','9999','0033' | % { [Convert]::ToChar([int][Convert]::ToInt32($_, 16)) }
Actually, you have a simpler way to get the same result, which relies on the implicit conversion performed by the compiler when numbers are prefixed by '0x':

0x0058, 0x389, 0x221a, 0x0040, 0x9999, 0x0033 | % { [char]$_ }

At this point it is interesting to know here that Unicode adopts UTF-16 as the standard enconding for everything inside the Basic Multilingual Plane, since, as we have seen, most living languages have all (or at least most) of their glyphs within the range 0 - 65535.
For characters beyond the first Unicode plane, that is whose code is superior to 65535 and hence can't fit in a 16 bit integer (a Word), we can use two encodings: UTF-32 or 16-bits Surrogate Pairs.
The latter is a method where a glyph is represented by a first (high) surrogate (16-bit long) code value in the range U+D800 to U+DBFF and a second (low) surrogate (16-bit long as well) code value in the range U+DC00 to U+DFFF. Using this mechanism, UTF-16 can support all 1,114,112 potential Unicode characters (2^16 * 17 Planes).

In any case Windows is not capable of showing non-BMP glyphs even if a font like Code2001 is installed. Let's see this in practice.
In the example below I am outputting the glyph for the commercial AT (which is in the BMP) starting from its UTF-32 serialization using the ConvertFromUTF32 method:
In this other example below I am trying hard to show to screen the glyph for the MUSICAL SYMBOL G CLEF, which has been added to Unicode 3.1 and belongs to the Supplementary Multilingual Plane, but I am only able to get a square box (that is used for all characters for which the font does not have a glyph):
Now that you are confident with code points, it is time to step up your game and get an understanding of some Unicode properties which are useful to solve our puzzle: General Category, Script and Block.


Each code point is kind of an object that has a property named General Category. The major categories are: Letter, Number, Mark, Punctuation, Symbol, and Other.

Within these 7 categories, there are the following subdivisions:

  • {L} or {Letter}
  • {Ll} or {Lowercase_Letter}
  • {Lu} or {Uppercase_Letter}
  • {Lt} or {Titlecase_Letter}
  • {L&} or {Cased_Letter}
  • {Lm} or {Modifier_Letter}
  • {Lo} or {Other_Letter}
  • {M} or {Mark}
  • {Mn} or {Non_Spacing_Mark}
  • {Mc} or {Spacing_Combining_Mark}
  • {Me} or {Enclosing_Mark}
  • {Z} or {Separator}
  • {Zs} or {Space_Separator}
  • {Zl} or {Line_Separator}
  • {Zp} or {Paragraph_Separator}
  • {S} or {Symbol}
  • {Sm} or {Math_Symbol}
  • {Sc} or {Currency_Symbol}
  • {Sk} or {Modifier_Symbol}
  • {So} or {Other_Symbol}
  • {N} or {Number}
  • {Nd} or {Decimal_Digit_Number}
  • {Nl} or {Letter_Number}
  • {No} or {Other_Number}
  • {P} or {Punctuation}
  • {Pd} or {Dash_Punctuation}
  • {Ps} or {Open_Punctuation}
  • {Pe} or {Close_Punctuation}
  • {Pi} or {Initial_Punctuation}
  • {Pf} or {Final_Punctuation}
  • {Pc} or {Connector_Punctuation}
  • {Po} or {Other_Punctuation}
  • {C} or {Other}
  • {Cc} or {Control}
  • {Cf} or {Format}
  • {Co} or {Private_Use}
  • {Cs} or {Surrogate}
  • {Cn} or {Unassigned}

The Char.GetUnicodeCategory and the CharUnicodeInfo.GetUnicodeCategory method are used to return the General Category property of a char.

'X','Ή','√','@','香','3' | % { [System.Globalization.CharUnicodeInfo]::GetUnicodeCategory($_) }
As you can see, Unicode also brings interesting possibilities. Once you know that each Unicode character belongs to a certain category, you can try to match a single character to a category with \p (in lowercase) in your regular expression:

#A is a letter
'A' -match "(\p{L})"

#3 is a digit
3 -match "(\p{N})"
You can also match a single character not belonging to a category with \P (uppercase):

#X is not a digit
'X' -match "(\P{N})"

#3 is not a letter
3 -match "(\P{L})"

Other useful properties of a character are Script and Block: each character belongs to a Script and to a Block.

A Script is a group of code points defining a given human writing system, so we can generally think of a script as of a language. Though many scripts (like Cherokee, Lao or Thai) correspond to a single natural language, others (like Latin) are common to multiple languages (Italian, French, English...). Code points in a Script are scattered and don't form a contigous range.

The list of the existing Scripts is kept by the Unicode Consortium in the Unicode Character Database (UCD), which consists of a number of textual data files listing Unicode character properties and related data.

The UDB file for Scripts is here.
A block on the other side is a contiguous range of code points.

The UDB file for Blocks is here


At this point it can be interesting to see how you can use PowerShell to check if a given char belongs to which script.

This is a tough task since \p in .NET is not aware of script names, so there's no straightforward way to match a char to a Script, meaning the following code won't work:
'X' -match "(\p{Anatolian_Hieroglyphs})"

parsing "(\p{Anatolian_Hieroglyphs})" - Unknown property 'Anatolian_Hieroglyphs'.
At line:1 char:1
+ 'X' -match "(\P{Anatolian_Hieroglyphs})"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], ArgumentException
    + FullyQualifiedErrorId : System.ArgumentException

But, since we know now that the UCD contains a list of all the Scripts in a file on the website, we just have to retrieve it via Invoke-WebRequest:

$sourcescripts = ""

$scriptsweb = Invoke-WebRequest $sourcescripts
The a bit of manipulation is required to translate this text file to an object:

$scriptsinfo = ($scriptsweb.content.split("`n").trim() -ne "") |
                sls "^#" -n |
                convertfrom-csv -Delimiter ';' -header "range","scriptname"
Basically, I am

  • splitting the content of the web page so that I have one Script per line: split("`n")
  • removing empty lines: -ne ""
  • suppressing comments (indicated by hash marks): sls "^#" -n
  • converting the data to a CSV with two columns names Range and ScriptName: -header "range","scriptname"

That makes for a pretty nice oneliner: I had a text file on a web server and in three lines of code I have an object containing all the possible Scripts and their code point ranges:

AB60..AB64    Latin # L&   [5] LATIN SMALL LETTER SAKHA YAT......
FB00..FB06    Latin # L&   [7] LATIN SMALL LIGATURE FF..LATIN....
0370..0373    Greek # L&   [4] GREEK CAPITAL LETTER HETA..GRE....
0375          Greek # Sk       GREEK LOWER NUMERAL SIGN          ....
0376..0377    Greek # L&   [2] GREEK CAPITAL LETTER PAMPHYLIA....
037A          Greek # Lm       GREEK YPOGEGRAMMENI               ....
037B..037D    Greek # L&   [3] GREEK SMALL REVERSED LUNATE SI....
037F          Greek # L&       GREEK CAPITAL LETTER YOT      ....
0384          Greek # Sk       GREEK TONOS                       ....
0386          Greek # L&       GREEK CAPITAL LETTER ALPHA WIT....
0388..038A    Greek # L&   [3] GREEK CAPITAL LETTER EPSILON W....
038C          Greek # L&       GREEK CAPITAL LETTER OMICRON W....
Now to see what Script a char belongs to, I simply have to find its numeric value, then see if it is in the range (converted to decimal) of code points (converted from hex to decimal) and return the Script name:

$char = 'Ή'

$decimal = [int][char]$char

foreach($line in $scriptsinfo){

    #Splitting each range to the double points ..
    $hexrange = $line.range.split('..')

    #Getting the start value of the range
    $hexstartvalue = $hexrange[0].trim()

    #Getting the end value of the range (if it exists, hence the try/catch)
        $hexendvalue = $hexrange[2].trim()
        $hexendvalue = $null
    #Converting the start value from he to decimal for easier comparison
    $startvaluedec = [Convert]::ToInt32($hexstartvalue, 16)

        $endvaluedec = [Convert]::ToInt32($hexendvalue, 16)
        #Cheking existence in range
        if($decimal -in ($startvaluedec..$endvaluedec)){
            "$char (dec: $decimal) is in script $($line.scriptname -replace '\#.*$') between $startvaluedec and $endvaluedec"
        #Checking equality with single value (in case it is not a range)
        if($decimal -like $startvaluedec){
            "$char (dec: $decimal) is in script $($line.scriptname -replace '\#.*$') because it's equal to $startvaluedec"

Ή (dec: 905) is in script Greek  between 904 and 906
Nice isn't it?

Another nicety is to use the same code we saw above to get the full list of all the existing Scripts:

((($scriptsweb.content.split("`n").trim() -ne "") | sls "^#" -n | convertfrom-csv -Delimiter ';' -header "range","scriptname").scriptname -replace '\#.*$').trim() | select -unique
Common, Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian_Aboriginal, Ogham, Runic, Khmer, Mongolian, Hiragana, Katakana, Bopomofo, Han, Yi, Old_Italic, Gothic, Deseret, Inherited, Tagalog, Hanunoo, Buhid, Tagbanwa, Limbu, Tai_Le, Linear_B, Ugaritic, Shavian, Osmanya, Cypriot, Braille, Buginese, Coptic, New_Tai_Lue, Glagolitic, Tifinagh, Syloti_Nagri, Old_Persian, Kharoshthi, Balinese, Cuneiform, Phoenician, Phags_Pa, Nko, Sundanese, Lepcha, Ol_Chiki, Vai, Saurashtra, Kayah_Li, Rejang, Lycian, Carian, Lydian, Cham, Tai_Tham, Tai_Viet, Avestan, Egyptian_Hieroglyphs, Samaritan, Lisu, Bamum, Javanese, Meetei_Mayek, Imperial_Aramaic, Old_South_Arabian, Inscriptional_Parthian, Inscriptional_Pahlavi, Old_Turkic, Kaithi, Batak, Brahmi, Mandaic, Chakma, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Sharada, Sora_Sompeng, Takri, Caucasian_Albanian, Bassa_Vah, Duployan, Elbasan, Grantha, Pahawh_Hmong, Khojki, Linear_A, Mahajani, Manichaean, Mende_Kikakui, Modi, Mro, Old_North_Arabian, Nabataean, Palmyrene, Pau_Cin_Hau, Old_Permic, Psalter_Pahlavi, Siddham, Khudawadi, Tirhuta, Warang_Citi, Ahom, Anatolian_Hieroglyphs, Hatran, Multani, Old_Hungarian, SignWriting
As you can see, in a few lines of code, we added to our code the ability to compare a character against a Unicode Script name, which is something that is not supported by .Net regex engine out of the box.


The next step is to see how we can get which Block a given character belongs to. This is easier the getting the Script because, while .NET doesn't support regex against Script names, it natively supports running matches against Block names.

Just remember to prepend 'Is' to the Block name: not all Unicode regex engines use the same syntax to match Unicode blocks and, while Perl uses the «\p{InBlock}» syntax, .NET uses «\p{IsBlock}» instead:

'Ω' -match "(\p{Greek})"
parsing "(\p{Greek})" - Unknown property 'Greek'.
At line:1 char:1
+ 'Ω' -match "(\p{Greek})"
+ ~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], ArgumentException
    + FullyQualifiedErrorId : System.ArgumentException

'Ω' -match "(\p{IsGreek})"
If I want to check for a character against all the existing Blocks, I just have to rely on the UDB and dynamically build all the possible regex:

$sourceblocks = ""

$blocksweb = Invoke-WebRequest $sourceblocks

$blocklist = (($blocksweb.content.split("`n").trim() -ne "") | 
                sls "^#" -n |
                convertfrom-csv -Delimiter ';' -header "range","blockname").blockname

$char = 'Ω'
foreach($block in $blocklist){

    $block = $block -replace ' ',''

    $regex = "(?=\p{Is$block})"


        if($char -match $regex)

            {"$char is in $block"}




Ω is in GreekandCoptic
Another funny exercise is to try to get all the characters in the Cherokee Block, just to see how that can be done:

0..65535 | % { if([char]$_ -match "(?=\p{IsCherokee})"){[char]$_} }
Ꭰ Ꭱ Ꭲ Ꭳ Ꭴ Ꭵ Ꭶ Ꭷ Ꭸ Ꭹ Ꭺ Ꭻ Ꭼ Ꭽ Ꭾ Ꭿ Ꮀ Ꮁ Ꮂ Ꮃ Ꮄ Ꮅ Ꮆ Ꮇ Ꮈ Ꮉ Ꮊ Ꮋ Ꮌ Ꮍ Ꮎ Ꮏ Ꮐ Ꮑ Ꮒ Ꮓ Ꮔ Ꮕ Ꮖ Ꮗ Ꮘ Ꮙ Ꮚ Ꮛ Ꮜ Ꮝ Ꮞ Ꮟ Ꮠ Ꮡ Ꮢ Ꮣ Ꮤ Ꮥ Ꮦ Ꮧ Ꮨ Ꮩ Ꮪ Ꮫ Ꮬ Ꮭ Ꮮ Ꮯ Ꮰ Ꮱ Ꮲ Ꮳ Ꮴ Ꮵ Ꮶ Ꮷ Ꮸ Ꮹ Ꮺ Ꮻ Ꮼ Ꮽ Ꮾ Ꮿ Ᏸ Ᏹ Ᏺ Ᏻ Ᏼ Ᏽ ᏶ ᏷ ᏸ ᏹ ᏺ ᏻ ᏼ ᏽ ᏾ ᏿


Now that we are proficient with Unicode in our regexes, let's how we could have easily soved the puzzle.

I asked to detect all filenames that had letters (not symbols nor numbers) in the Latin-1 Supplement character block.

The Latin-1 Supplement is the second Unicode block in the Basic Multilingual Plane. It ranges from U+0080 (decimal 128) to U+00FF (decimal 255) and contains 64 code points in the Latin Script and 64 code points in the Common Script. Basically it contains some currency symbols (Yen, Pound), a few math signs (multiplication, division) and all lowercase and uppercase letters that have diacritics.

What's a diacritic you ask? The answer comes from Wikipedia:

Diacritic /daɪ.əˈkrɪtɪk/ – also diacritical mark, diacritical point, or diacritical sign – is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός (diakritikós, distinguishing"), which is composed of the ancient Greek διά (diá, through) and κρίνω (krínein or kríno, to separate). Diacritic is primarily an adjective, though sometimes used as a noun, whereas diacritical is only ever an adjective. Some diacritical marks, such as the acute ( ´ ) and grave ( ` ), are often called accents. Diacritical marks may appear above or below a letter, or in some other position such as within the letter or between two letters. The main use of diacritical marks in the Latin script is to change the sound-values of the letters to which they are added.

Since a Unicode Block exists listing all of the diacritical marks, they can be shown with a oneliner:
0..65535 | % { if([char]$_ -match "(?=\p{IsCombiningDiacriticalMarks})"){[char]$_} }
Since we have seen the syntax to check if a character has a specific Unicode Block property, and since Latin-1 Supplement IS a Block property, here's what to do:

'A' -match "(\p{IsLatin-1Supplement})"

'é' -match "(\p{IsLatin-1Supplement})"
Good! No hardcoded values here, meaning no stuff like:

where name -Match '[\u0080-\u00ff]'

if (($LetterNumber -in 192..214)
Subjectivity is gone!
At the same time I did ask to include only the filenames containing Letters from that Unicode Block, not Symbols, nor Digits. Here's where the General Category property we saw above comes to the rescue. I can force the regex engine to include all letters ( \p{L} ), and exclude digits ( \P{N} ), punctuation ( \P{P} ), symbols ( \P{S} ) and separators ( \P{Z} ).

'A' -match "(?=\p{IsLatin-1Supplement})(?=\p{L})(?=\P{N})(?=\P{P})(?=\P{S})(?=\P{Z})"

'é' -match "(?=\p{IsLatin-1Supplement})(?=\p{L})(?=\P{N})(?=\P{P})(?=\P{S})(?=\P{Z})"
Concerning the expression, I am using here a positive lookahead assertion (?=), which is a non-consuming regular expression. I can do this as many times as I want, and this will be act as a logic "and" between the different categories I am passing to \p or \P .


For sure this can be shortened to
'é' -match "(?=\p{IsLatin-1Supplement})(?=\p{L})"
since the are no code points which are at the same time letters and numbers or letters and symbols, etc.

To sum it up, to get a list of all Latin letters with diacritics, it is as simple as typing the following line:

Get-ChildItem 'C:\FileShare' -Recurse -Force |
   Where { $_.Name -match "(?=\p{IsLatin-1Supplement})(?=\p{L})"} |
   ForEach-Object {[PSCustomObject]@{
    'Creation Date'=$_.CreationTime;
    'Last Modification Date'=$_.LastWriteTime;
    'File Size'=$_.Length} } |
        Format-Table -AutoSize

I hope you enjoyed this explanation. If you are a Unicode guru, and you find something incorrect, do not hesitate to drop a comment and I'll update. Thanks again to for giving me the occasion of being part of a larger community.

Stay tuned for more PowerShell fun!

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...