Wednesday, October 22, 2014

Reading large text files with Powershell

Any sysadmin out there knows that log files are an invaluable asset for troubleshooting issues on their servers. Sometimes though these kind of files can be very large and become difficult to handle, as I had the occasion to notice in a Powershell forum discussion one week ago.

The thing to understand is that the right tool to use in Powershell is determined from the kind of file you are dealing with.

Here's the bunch of options you have for reading text files:

  • Get-Content
  • Get-Content -Raw
  • Get-Content -ReadCount
  • Switch -File
  • [System.IO.File]::ReadAllText()
  • [System.IO.File]::ReadAllLines()
  • [System.IO.File]::ReadLines()
  • [System.IO.File]::OpenText().readtoend()
 
Get-Content is your out-of-the-box option. It works well with small log files and its use is pretty straightforward, but on large log files it can be very slow.

Get-Content .\small.csv

Let's suppose we have four logfiles of different size and number of lines:

- logfile0.log, a small logfile: 4 lines and 1KB size
- logfile1.log, a medium size logfile: 700K lines and 160MB size
- logfile2.log, a large size logfile: 300K lines and 1.2GB size
- logfile3.log, a logfile containing more than a million of lines: 1.4M lines and 650MB size

Let's see what Get-Content can do against these files:

(Measure-Command {Get-Content .\logfile0.log}).TotalSeconds
0,0013753

(Measure-Command {Get-Content .\logfile1.log}).TotalSeconds
12,7561849

(Measure-Command {Get-Content .\logfile2.log}).TotalSeconds
11,9634649

(Measure-Command {Get-Content .\logfile3.log}).TotalSeconds
229,0962224

Pretty swift on the smaller ones, but way too slow on that last large file, which, with its over one million lines becomes difficult to deal with.

Let's see than how the other methods I listed above perform against it. Doing some performance testing is not only fun, but it helps better understanding how your systems work.

As we just saw, with its basic syntax Get-Content performs poorly:

(Measure-Command {Get-Content .\logfile3.log}).TotalSeconds
229,0962224

Since Powershell v3, Get-Content has a -Raw parameter, which makes it to read text files in a text stream, keeping newline character intact. As you can see below, it is way faster than Get-Content:

(Measure-Command {Get-Content .\logfile3.log -Raw}).TotalSeconds
46,4201292

Nonetheless it a huge impact on your memory use:



Happily enough we can dispose data in memory by forcing the Garbage Collector to do its job before continuing our tests:

[GC]::Collect()



Let's move on to the third option: the -ReadCount parameter. As it was showed in the mentioned forum discussion, this switch can speed Get-Content beyond what one could expect, given that you find the right value for it:

(Measure-Command {Get-Content .\logfile3.log -ReadCount 10}).TotalSeconds
33,9024254

(Measure-Command {Get-Content .\logfile3.log -ReadCount 100}).TotalSeconds
11,1052492

(Measure-Command {Get-Content .\logfile3.log -ReadCount 1000}).TotalSeconds
7,9866107

(Measure-Command {Get-Content .\logfile3.log -ReadCount 10000}).TotalSeconds
10,4790464

This -ReadCount switch helps speeding up the process by controlling the number of records Get-Content writes into the pipeline at a time. From my experience, reading the file by chunks of 1000 or 10000 lines usually gives the best results and has low to no impact on CPU and memory even for such a large logfile.

The problem with -ReadCount is that it takes a bit of guessing to find out the right value for a specific file, since performance varies not only with the size of the file but also with the size of each record.

An interesting alternative to Get-Content, which you should know, is the use of the Switch keyword:

(Measure-Command {Switch -File .\logfile3.log {default {$PSItem}}}).TotalSeconds
10,0711553

Not so bad, isn't it? Let's remark that the impact on CPU usage of Switch is slightly higher than Get-Content -ReadCount:



Let's see now if we can have better results by crossing the unwelcoming bridge to .NET.

First option is the ReadAllText() method:

(Measure-Command {[System.IO.File]::ReadAllText('C:\temp\get-content\logfile3.log')}).TotalSeconds
13,5002564

ReadAllText() is very fast but has a temporary impact on system memory.

Second option is ReadAllLines():

(Measure-Command {[System.IO.File]::ReadAllLines('C:\temp\get-content\logfile3.log')}).TotalSeconds
16,3589887

Not as good as ReadAllText(), but it has a slightly less aggressive impact on memory allocation:



Third option is ReadLine(), which differs from ReadAllLines() in the fact that it allows you to start enumerating the collection of strings before the whole collection is returned:

(Measure-Command {[System.IO.File]::ReadLines('C:\temp\get-content\logfile3.log')}).TotalSeconds
6,5551735

Impressive result here.

Let's move to the fourth and last option which is OpenText().readtoend():

(Measure-Command {[System.IO.File]::OpenText('C:\temp\get-content\logfile3.log').readtoend()}).TotalSeconds
10,8803161

Great performance, but, same as for the two other .NET methods, a lot of RAM has to be allocated since they read the entire file into memory and your code could in some cases throw an exception of type System.OutOfMemoryException. Also, you have noticed it, when using .NET, the file path must be fully specified or Powershell won't be able to find it.

In conclusion, seeking through the file with Get-Content -ReadCount seems to be the seconds fastest way to go through a large text file. First option with large files is [System.IO.File]::ReadLines():

  1. [System.IO.File]::ReadLines(): 6 seconds
  2. Get-Content -ReadCount 1000: 7 seconds
  3. Switch -File: 10 seconds
  4. [System.IO.File]::OpenText().readtoend(): 10 seconds
  5. [System.IO.File]::ReadAllText(): 13 seconds
  6. [System.IO.File]::ReadAllLines(): 16 seconds
  7. Get-Content -Raw: 46 seconds
  8. Get-Content: 229 seconds
If you liked this post, please share, or leave a comment if your tests gave you different results.

Wednesday, October 1, 2014

Powershell MVP, once again!

I am so honored to say that I have received the Microsoft MVP award for the second year following my contributions to the Powershell Community and on this very same blog.
 
 
It's definitively been a long year, full of intense moments, like in February when I spoke on Powershell in front of several hundreds of people during the Techdays in Paris with my friend Fabien Dibot and for the first time I met the MVP project coordinators Martine and Marjorie.
 
Or like when I took part in coaching the competing teams during the 2014 Powershell Winter Scripting Games and came to work with brilliant people like Mike F Robbins, Boe Prox, Emin Atac, Jeff Wouters and Jan Egil Ring, just to mention a few.
 
It's always nice to see that the effort I make to publish good content have once again been noticed and that the Community appreciate them. I will do my best to improve content quality up to the highest standards, so stay tuned for more on Powershell and System Administration!
Related Posts Plugin for WordPress, Blogger...