## Friday, October 9, 2015

### Powershell gymnastics - Prime numbers

Some days have passed since the last time we did gym and by now you should be properly rested and ready for another Powershell workout. Today we will start using modulo as a way to find out prime numbers, and then we will move to some kind of pretty obfuscated code to get the very same piece of information.

THE THEORY OF PRIME NUMBERS

A prime number is a number greater than 1 that has no positive divisors other than 1 and itself. For example, 13 is prime, as only 1 and 13 divide it, whereas 14 is not, since it has the divisors 2 and 7 in addition to 1. Nothing new here since those days at school.

A lot of brilliant people have been involved in projects around the world to find the largest known prime number, since they are infinite. Just out of curiosity, you may want to know that the largest known prime number to date is a number with more than 17 million digits and can only be written with this notation:

For sure you can’t use Powershell nor any standard CPU architecture to calculate that, but we can all the same have some fun with the code without going out hunting big numbers.

USING MODULO TO FIND PRIMES

Let’s see how that modulo operator in Powershell can be used to find primes.

Different solutions exist.

The first method, which comes out straight from prime definition, is based on trial division:

```\$Numbers = 1..13
foreach (\$Number in \$Numbers)
{
\$i = 2
\$Prime = \$true
if (\$Number -lt 2)
{
\$Prime = \$false
}
if (\$Number -eq 2)
{
\$Prime = \$true
}
while (\$i -lt \$Number)
{
if (!(\$Number % \$i))
{
\$Prime = \$false
}
\$i++
}
if(\$Prime){\$Number}
}```
Basically, when \$Number is less than 2, that’s not a prime, by definition. When \$Number is 2, that’s a prime, also by definition. And then we proceed with some kind of brute force method, that is, we check for divisibility of \$Number against any number between 2 and \$Number-1 until we find one. If we find at least a divisor, then \$Number is not prime. We say in that case that \$Number is composite.

Here you should notice the line:

`if (!(\$Number % \$i))`
which returns 0 when \$Number can be divided by \$i, because the remainder is 0, which the if statement smartly interprets as a [boolean]\$False.

The previous algorithm is simple but inefficient because it uses trial division up to \$Number-1, which is far from being an optimal method if finding primes.

One way to improve this algorithm is to test only up to sqrt(\$Number).

Now the question is: why do we check up to the square root of a prime number to determine if it is prime? As explained out on StackOverflow"If a number n is not a prime, it can be factored into two factors a and b: n = a*b. If both a and b were greater than the square root of n, a*b would be greater than n. So at least one of those factors must be less or equal to the square root of n, and to check if n is prime, we only need to test for factors less than or equal to the square root."
This translates to Powershell like this:
```\$Numbers = 1..13
foreach (\$Number in \$Numbers)
{
\$i = 2
\$SqrtNumber = [math]::sqrt(\$Number)
\$Prime = \$true
if (\$Number -lt 2)
{
\$Prime = \$false
}
if (\$Number -eq 2)
{
\$Prime = \$true
}
while (\$i -le \$SqrtNumber)
{
if (!(\$Number % \$i))
{
\$Prime = \$false
}
\$i++
}
if(\$Prime){\$Number}
}```
Notice here I added the calculation of the square root of \$Number:
`\$SqrtNumber = [math]::sqrt(\$Number)`
which is exactly what we need to reduce the duration of the while loop. Let’s compare the speed of both these scripts:
```Measure-Command -Expression {
\$Numbers = 1..2000
foreach (\$Number in \$Numbers)
{
\$i = 2
\$Prime = \$true
if (\$Number -lt 2)
{
\$Prime = \$false
}
if (\$n -eq 2)
{
\$Prime = \$true
}
while (\$i -lt \$Number)
{
if (!(\$Number % \$i))
{
\$Prime = \$false
}
\$i++
}
if(\$Prime){\$Number}
}
}
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 7
Milliseconds      : 751
Ticks             : 77512749
TotalDays         : 8,97138298611111E-05
TotalHours        : 0,00215313191666667
TotalMinutes      : 0,129187915
TotalSeconds      : 7,7512749
TotalMilliseconds : 7751,2749```
```Measure-Command -Expression {
\$Numbers = 1..2000
foreach (\$Number in \$Numbers)
{
\$i = 2
\$SqrtNumber = [math]::sqrt(\$Number)
\$Prime = \$true
if (\$Number -lt 2)
{
\$Prime = \$false
}
if (\$Number -eq 2)
{
\$Prime = \$true
}
while (\$i -le \$SqrtNumber)
{
if (!(\$Number % \$i))
{
\$Prime = \$false
}
\$i++
}
if(\$Prime){\$Number}
}
}
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 553
Ticks             : 5537351
TotalDays         : 6,40897106481481E-06
TotalHours        : 0,000153815305555556
TotalMinutes      : 0,00922891833333333
TotalSeconds      : 0,5537351
TotalMilliseconds : 553,7351 ```
The difference is simply huge: from 7 seconds to half a second.

THE GENIUS FROM CYRENE

To get better performance, we have to go back in time and take advantage of a pretty ancient method to find out if a number is a prime. This method is called by the name of its inventor: the sieve of Eratosthenes (to give you an idea, this man calculated the Earth's circumference around 240 BC just looking down a deep well in the middle of the Egyptian desert near Aswan). In his method you take a list of all the integers less than or equal to n (and greater than one), then you filter out the multiples of all primes less than n. The numbers that are left are all primes.
To make it simple, suppose you have the numbers from 1 to 13. Write them down:
1 2 3 4 5 6 7 8 9 10 11 12 13

The number 1 is not a prime by definition, so knock it out. Then you filter out all the multiples of 2, apart from 2 itself, which is a prime:

2 3 5 7 9 11 13

Then you keep filtering out every multiple of 3, 4, 5 and so on. The numbers that are left are, you name it, primes:

2 3 5 7 11 13

To translate this to Powershell we need to build a sorted hashtable using a SortedDictionary:
`\$Numbers = New-Object 'System.Collections.Generic.SortedDictionary[int, bool]'`
The first step is to set fill the keys of the dictionary with numbers and set the value part to a boolean \$true:
```\$List = New-Object 'System.Collections.Generic.SortedDictionary[int, bool]'
for (\$i=2;\$i -lt \$Numbers[-1]; \$i++)
{
}```

Then to recursively write \$False beside the numbers (or keys) that are multiples of 3, 4, 5, etc
Here’s the full code for the sieve of Eratosthenes:
```\$Numbers = 1..13
\$List = New-Object 'System.Collections.Generic.SortedDictionary[int, bool]'
for (\$i=2;\$i -le \$Numbers[-1]; \$i++)
{
}
foreach (\$i in @(\$List.GetEnumerator()))
{
\$c = 2
while ((\$i.Key * \$c) -le \$Numbers[-1])
{
if (\$List.ContainsKey(\$i.Key * \$c))
{
\$List[(\$i.Key * \$c)] = \$false
}
\$c++
}
}
\$List.GetEnumerator() | Where-Object value | Select-Object @{Name='Primes';Expression={(\$_.key)}}```
Notice that the elements inside a SortedDictionary can be iterated through with the GetEnumerator() method, as in the last line. This must to be done before filtering with Where-Object.
Also notice that, just for fun, the last line can be shortened to:
`\$List.GetEnumerator()|? value|select @{N='Primes';E={(\$_.key)}}`
When I pass the previous code (the one based on sieving primes the Eratosthenes' way) to Measure-Command I get pretty good results:
```Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 174
Ticks             : 1742899
TotalDays         : 2,01724421296296E-06
TotalHours        : 4,84138611111111E-05
TotalMinutes      : 0,00290483166666667
TotalSeconds      : 0,1742899
TotalMilliseconds : 174,2899```
This was my way (as my friend Emin always says) to translate this kind of sieve to Powershell code.

A LOOK AT THE 2008 WINTER SCRIPTING GAMES

I remember that in 2008, during the Winter Scripting Games, in the Advanced section, people were challenged to write a code that would output prime numbers. This was the description of the event:
"In Prime Time, competitors must write a script that calculates – and displays – all the prime numbers between 1 and 200, inclusive."
The official answer, which I kept in my archives, reads like this:
```\$arrNumbers = @()

for (\$i = 1; \$i -le 200; \$i++)
{\$arrNumbers += \$i}

\$arrNumbers[0] = "Not prime"

\$intDivisor = 2

do
{
for (\$j = 0; \$j -le \$arrNumbers.length - 1; \$j++)
{
if (\$arrNumbers[\$j] -ne "Not prime")
{
\$intResult = \$arrNumbers[\$j] % \$intDivisor
if (\$intResult -eq 0)
{\$arrNumbers[\$j] = "Not prime"}
}
}

for (\$i = 2; \$i -le \$arrNumbers.length; \$i++)
{
if (\$arrNumbers[\$i] -ne "Not prime")
{
\$intDivisor = \$arrNumbers[\$i]
break
}
}
}
until (\$intDivisor -ge \$arrNumbers.length - 1)

In addition to this official answer (which is so Powershell 1.0), there were some interesting solutions, with some people going brute force while other went with a sieve. Shay Levy and Chris Warwick had interesting approaches. Shay went on with trial division using modulo (as we explained) up to ([math]::Sqrt(\$num)))
Chris used the sieve of Eratosthenes and added a minor but important optimization: he shaved off all the elements less than the square of the current prime, because they have already been set by previous factors:
`for (\$j=\$i*\$i; \$j -le \$max; \$j+=\$i) {\$sieve[\$j]=0}`
There was also an interesting entry by Marc van Orsouw, who was able to solve this event with a slow but short one-liner that I kept for future reference:
`2;2..200 |? {\$i=\$_;(2..(\$i -1) |% {\$i%\$_}) -notcontains 0}`
Well, this is not a pure oneliner, because of the semi-colon. Nonetheless the approach is interesting and therefore I mentioned it. Based on what we said before, we know now that this oneliner can be improved by topping our tests against the square root of the number to test:
`2..200|?{\$i=\$_;(2..[math]::sqrt(\$i)|%{\$i%\$_})-notcontains0}`
The performance improvement is, again, impressive, moving the slider from 47 seconds to a bit more than 1 second:
```Measure-Command {2;2..2000 |? {\$i=\$_;(2..(\$i -1) |% {\$i%\$_}) -notcontains 0}}
ays              : 0
Hours             : 0
Minutes           : 0
Seconds           : 47
Milliseconds      : 125
Ticks             : 471250436
TotalDays         : 0,00054542874537037
TotalHours        : 0,0130902898888889
TotalMinutes      : 0,785417393333333
TotalSeconds      : 47,1250436
TotalMilliseconds : 47125,0436```

```Measure-Command {2..2000|?{\$i=\$_;(2..[math]::sqrt(\$i)|%{\$i%\$_})-notcontains0}}
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 1
Milliseconds      : 697
Ticks             : 16972911
TotalDays         : 1,96445729166667E-05
TotalHours        : 0,00047146975
TotalMinutes      : 0,028288185
TotalSeconds      : 1,6972911
TotalMilliseconds : 1697,2911```

This is starting to become juicy. Oneliners are a strange world, where you push the capabilities of a language to its further extent, you bend it, and sometimes you make it do things it was not conceived for. In these cases the code starts looking messy, obfuscated and cryptic. No fear to have tough. Keep thinking to it as to a game.

JUST ANOTHER PERL HACKER,

In the early 1980s, following the boom of personal computing, some people started to write oneliners in BASIC, opening the way to surprisingly obfuscated programming. Back in 1984, the American computer scientists Landon Curt Noll and Larry Bassel started the International Obfuscated C Code Contest (IOCCC), as a competition to write the most opaque C code.

The goals of the contest, as it is explained on the main page of the competition, are:
• To write the most Obscure/Obfuscated C program within the rules.
• To show the importance of programming style, in an ironic way.
• To stress C compilers with unusual code.
• To illustrate some of the subtleties of the C language.
• To provide a safe forum for poor C code. :-)
Taking its inspiration from the International Obfuscated C Contest, the Obfuscated Perl Contest was run by The Perl Journal from 1996 until 2000. From that very moment Perl started achieving a pretty good reputation for recreational activities such as code obfuscation.

As far as I was able to understand, this obfuscated Perl coding competition spread out in a community (whose members are known as Perl Monks) where programmers signed their messages with JAPHs.

A JAPH is a Perl program that outputs the sentence “Just another Perl hacker," which is used as a signature.

The first JAPH is one by Randal Schwartz (aka Merlyn) who in 1988 signed his message with:
`print "Just another Perl hacker,"`
Building on top of that, he started inventing new obfuscated JAPHs for his signatures:
`\$_=",rekcah lreP rehtona tsuJ";s/.\$/eval 'print \$&',""/e while length`
Starting from this kind of obfuscated signature, Perl Monks have started to write their JAPHs at full speed, many of which can still be found at the JAPH archive I linked above.

ABIGAIL, REGEX AND PRIMES, ALL IN A JAPH

One of the most famous JAPH is one from the mind twister Perl programmer known as Abigail, who wrote a well-known signature which, using a regular expression, checks for primality of a given number.
Actually I found two different JAPH-style signatures by Abigail containing similar regexes, or regexen (which is the Anglo-Saxon plural of regex, as Larry Wall, the author or Perl, wrote).
The first signature, which appeared on October, 22 1997 in an answer by Abigail to the aforementioned inventor of JAPHs Randal L. Schwartz is:
`perl -e 'print "PRIME" if (1 x shift) !~ /^(11+)\1+\$/'`

The second JAPH, which dates back to September, 16 1998, is an improvement of the other, is:

`perl -wle 'print "Prime" if (1 x shift) !~ /^1?\$|^(11+?)\1+\$/'`

Now if you're not a Perl Monk (Abigail is, and I'm certainly not), you're probably quite confused as to how on earth this works.
There are at least a couple of good articles out there on the subject, but I'll try my hand at it, and I’ll add a conversion to Powershell code as well.

This JAPH uses the regex engine to do the math of checking primality on the number supplied as an argument. The number is retrieved via the shift function and translated into a sequence of 1s.
This means that if we supply 13 as input, it will translate to:
```PS C:\> '1' * 13
1111111111111```
Beware here on the order of the multiplicands, since Powershell will choose the data type on the first element of the operation. In this case we need the first element to be a string, so that the output is unary, hence the single quotes around the number 1:
```PS C:\> ('1' * 13).GetType()

IsPublic IsSerial Name                                     BaseType
-------- -------- ----                                     --------
True     True     String                                   System.Object```
If we switch the elements of the multiplication, the first element being an integer, we will get an integer as output:
```PS C:\> 13 * '1'
13
PS C:\> (13 * '1').GetType()

IsPublic IsSerial Name                                     BaseType
-------- -------- ----                                     --------
True     True     Int32                                    System.ValueType```
Now the pattern matching starts and it is exactly here that the regular expression engine hidden inside our .NET framework kicks in.
The expression is made up of two parts, separated by a pipe, which is an alternation operator. It tells the regex engine to match either everything to the left of the pipe, or everything to the right.
Since, as we saw above, 0 and 1 are not primes by definition, the first part of the regex will match 0 and 1 as non-primes. That’s obtained through the use of a question mark, which is a greedy quantifier that makes the preceding token optional.
So 1? matches the number 1 zero or one time. Let's test it:
```PS C:\> 1 -match '^1?\$'
True
PS C:\> 11 -match '^1?\$'
False```
If it’s not exactly one character, as in the second example above which returns \$False, the regex tries the right side of the alternation, which is where the calculation for numbers greater the 1 is performed in a very smart way. And that’s what made Abigail enter history of Perl (beside the fact of being one of the best known Perl 5 Porters):
`^(11+?)\1+\$`
There are two parts to this: the capture (between parentheses) and the backreference. The backreference can match one or more times (\1+). So, the engine looks for complete groups, and nothing more, of the same thing. The actual thing we match hardly matter. It comes down being able to break up the string evenly into groups.

For instance, let's consider the number 12. The string for that is a sequence of twelve 1s:
111111111111

Can this be broken up into equal groups with nothing left over? Sure; there are six groups of twos:
11 11 11 11 11 11

The number of groups represents one factor (6), while the length of any one of the groups represents another factor. Since it finds two factors, the number must be composite, so not prime.
If we take the number 13:
1111111111111

And split it up in groups of two, we get:
11 11 11 11 11 11 1

There is a remainig 1, so the number is not composite, for the moment. The check keeps trying with all the numbers until it can get en even result. Split in groups of threes, we get:

111 111 111 111 1

Still a one left alone, so it’s not composite. Splitting in four:

1111 1111 1111 1

Nope. So the engine keeps splitting in increasing length groups until no option is left. This process is analogous to trying to divide a number by successively larger divisors, leaving no remainder. For n = 13 we tried 2, and that failed, and then 3, and then 4, and so on. No divisor succeeded, so that’s not composite number.
But, wait, weren’t we checking for primes, not composites? Yes. That’s where the initial inversion performed with !~ comes in. The !~ is a negated match, so the condition is true if the string does not match the regular expression.

Some people will complain that this algorithm is computationally expensive, or memory devastating. Some will say it is inefficient (why bothering checking for numbers greater than the square root of the length?). Some will say it looks so much like as the brute-forcing methods we saw up above in the beginning of this post.
Personally, I don’t mind. Clever is the word here, and the word fully applies to Abigail, whose algorithm is so powerful that it can find primes only by matching strings whose length is NOT prime.

BEYOND REGULAR EXPRESSIONS AND FINITE STATE MACHINES

I feel like here I must say a few words about Finite State Machines (FSM) and how they relate to Abigail's regex. I'll try to make it as simple as I can, and if after reading the whole section you find yourself left open-mouthed and saying "Huh?" to yourself... well, that's pretty normal. Just allow for some time for the concepts to sink in before saying that this is way over your head and that this blog post might as well be in Ancient Greek for the amount of sense it makes to you.
Regular expressions map directly to Finite State Machines (FSM). FSM, also known as Finite State Automata (FSA), are models of the behaviors of a system or a complex object, with a limited number of defined conditions or states, where state transitions change with circumstance.

In the particular regex found in Abigail’s JAPH, the left part of the expression fits perfectly in a Deterministic Finite State Machines model and can be represented with a State Diagram:

On the contrary, the brute-force happening on the right side can’t fit into a FSM model since it uses a backreference:

`\1`

The rule is that you cannot write pure regular expressions with backreferences, since mathematically-regular languages simply cannot include them: a backreference makes the language non-regular. The inclusion of backreferences moves us out of the regular languages and into what it’s called the "language of squares".

So, how is it that the expressions used to find composites number works? Well, the .NET regex engine (which is compatible with the Perl 5 machine used by Abigail), extends the basic regular expression language and it can be defined as Turing machine compatible. In this context, backreferences can be seen as a feature that allows the following diagram to be working successfully while it should not:

FROM PERL TO POWERSHELL

Being the .NET regex engine compatible with the Perl 5 regex engine, this expression can be ported into Powershell code as it is.

The negated match !~ translates into the –notmatch operator.

Here’s a first version of the translated code:

`13|?{'1'*\$_-notmatch"^1?\$|^(11+?)\1+\$"}`
Since 13 is not composite, it will show up in the output of the oneliner:

```PS C:\> 13|?{'1'*\$_-notmatch"^1?\$|^(11+?)\1+\$"}
13```
Using the range operator, we can even output the list of primes in a shot:

```PS C:\> 1..13|?{'1'*\$_-notmatch"^1?\$|^(11+?)\1+\$"}
2
3
5
7
11
13```
For the purpose of code golfing, we can shave off a few more chars, using a syntax I like a lot, based on the Select-String cmdlets with the –notmatch parameter (-n):

```PS C:\> 1..13|?{'1'*\$_|sls "^1?\$|^(11+?)\1+\$"-n}
2
3
5
7
11
13```

How about getting the same result using the [regex] class when we know that no negative matching method is offered?

Well, we know that an IsMatch method is available, which takes two parameters: a input in the form of a string, and a regular expression pattern in the form of a string as well.

So, to find composite numbers we just have to put up the following oneliner:

```PS C:\> [regex]::IsMatch('1'*12,'^1?\$|^(11+?)\1+\$')
True
PS C:\> [regex]::IsMatch('1'*13,'^1?\$|^(11+?)\1+\$')
False```
At this point, the equivalent of the negative match performed by Abigail in Perl with !~ is obtained through the use of the exclamation mark in Powershell. Here we go:

```PS C:\> 1..13|?{!([regex]::IsMatch('1'*\$_,'^1?\$|^(11+?)\1+\$'))}
2
3
5
7
11
13```

Simple and powerful.
Let’s see now how the regex proposed by Abigail performs compared to the algorithm we wrote above:

```PS C:\> Measure-Command {1..2000|?{'1'*\$_-notmatch"^1?\$|^(11+?)\1+\$"}}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 5
Milliseconds      : 260
Ticks             : 52602583
TotalDays         : 6,0882619212963E-05
TotalHours        : 0,00146118286111111
TotalMinutes      : 0,0876709716666667
TotalSeconds      : 5,2602583
TotalMilliseconds : 5260,2583```

Five seconds. Not so good, as we could expect. Let’s give a shot to the version based on Select-String:

```PS C:\> Measure-Command {1..2000|?{'1'*\$_|sls "^1?\$|^(11+?)\1+\$"-n}}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 5
Milliseconds      : 654
Ticks             : 56546428
TotalDays         : 6,54472546296296E-05
TotalHours        : 0,00157073411111111
TotalMinutes      : 0,0942440466666667
TotalSeconds      : 5,6546428
TotalMilliseconds : 5654,6428```

Still very slow, but, hey, this depends on the efficiency of a regex that was meant to be magic rather than quick.

#### 1 comment:

1. For the purpose of code golfing, we can shave off a few more chars, using a syntax I like a lot, based on the Select-String cmdlets with the –notmatch