Wednesday, October 22, 2014

Reading large text files with Powershell

Any sysadmin out there knows that log files are an invaluable asset for troubleshooting issues on their servers. Sometimes though these kind of files can be very large and become difficult to handle, as I had the occasion to notice in a Powershell forum discussion one week ago.

The thing to understand is that the right tool to use in Powershell is determined from the kind of file you are dealing with.

Here's the bunch of options you have for reading text files:

  • Get-Content
  • Get-Content -Raw
  • Get-Content -ReadCount
  • Switch -File
  • [System.IO.File]::ReadAllText()
  • [System.IO.File]::ReadAllLines()
  • [System.IO.File]::ReadLines()
  • [System.IO.File]::OpenText().readtoend()
 
Get-Content is your out-of-the-box option. It works well with small log files and its use is pretty straightforward, but on large log files it can be very slow.

Get-Content .\small.csv

Let's suppose we have four logfiles of different size and number of lines:

- logfile0.log, a small logfile: 4 lines and 1KB size
- logfile1.log, a medium size logfile: 700K lines and 160MB size
- logfile2.log, a large size logfile: 300K lines and 1.2GB size
- logfile3.log, a logfile containing more than a million of lines: 1.4M lines and 650MB size

Let's see what Get-Content can do against these files:

(Measure-Command {Get-Content .\logfile0.log}).TotalSeconds
0,0013753

(Measure-Command {Get-Content .\logfile1.log}).TotalSeconds
12,7561849

(Measure-Command {Get-Content .\logfile2.log}).TotalSeconds
11,9634649

(Measure-Command {Get-Content .\logfile3.log}).TotalSeconds
229,0962224

Pretty swift on the smaller ones, but way too slow on that last large file, which, with its over one million lines becomes difficult to deal with.

Let's see than how the other methods I listed above perform against it. Doing some performance testing is not only fun, but it helps better understanding how your systems work.

As we just saw, with its basic syntax Get-Content performs poorly:

(Measure-Command {Get-Content .\logfile3.log}).TotalSeconds
229,0962224

Since Powershell v3, Get-Content has a -Raw parameter, which makes it to read text files in a text stream, keeping newline character intact. As you can see below, it is way faster than Get-Content:

(Measure-Command {Get-Content .\logfile3.log -Raw}).TotalSeconds
46,4201292

Nonetheless it a huge impact on your memory use:



Happily enough we can dispose data in memory by forcing the Garbage Collector to do its job before continuing our tests:

[GC]::Collect()



Let's move on to the third option: the -ReadCount parameter. As it was showed in the mentioned forum discussion, this switch can speed Get-Content beyond what one could expect, given that you find the right value for it:

(Measure-Command {Get-Content .\logfile3.log -ReadCount 10}).TotalSeconds
33,9024254

(Measure-Command {Get-Content .\logfile3.log -ReadCount 100}).TotalSeconds
11,1052492

(Measure-Command {Get-Content .\logfile3.log -ReadCount 1000}).TotalSeconds
7,9866107

(Measure-Command {Get-Content .\logfile3.log -ReadCount 10000}).TotalSeconds
10,4790464

This -ReadCount switch helps speeding up the process by controlling the number of records Get-Content writes into the pipeline at a time. From my experience, reading the file by chunks of 1000 or 10000 lines usually gives the best results and has low to no impact on CPU and memory even for such a large logfile.

The problem with -ReadCount is that it takes a bit of guessing to find out the right value for a specific file, since performance varies not only with the size of the file but also with the size of each record.

An interesting alternative to Get-Content, which you should know, is the use of the Switch keyword:

(Measure-Command {Switch -File .\logfile3.log {default {$PSItem}}}).TotalSeconds
10,0711553

Not so bad, isn't it? Let's remark that the impact on CPU usage of Switch is slightly higher than Get-Content -ReadCount:



Let's see now if we can have better results by crossing the unwelcoming bridge to .NET.

First option is the ReadAllText() method:

(Measure-Command {[System.IO.File]::ReadAllText('C:\temp\get-content\logfile3.log')}).TotalSeconds
13,5002564

ReadAllText() is very fast but has a temporary impact on system memory.

Second option is ReadAllLines():

(Measure-Command {[System.IO.File]::ReadAllLines('C:\temp\get-content\logfile3.log')}).TotalSeconds
16,3589887

Not as good as ReadAllText(), but it has a slightly less aggressive impact on memory allocation:



Third option is ReadLine(), which differs from ReadAllLines() in the fact that it allows you to start enumerating the collection of strings before the whole collection is returned:

(Measure-Command {[System.IO.File]::ReadLines('C:\temp\get-content\logfile3.log')}).TotalSeconds
6,5551735

Impressive result here.

Let's move to the fourth and last option which is OpenText().readtoend():

(Measure-Command {[System.IO.File]::OpenText('C:\temp\get-content\logfile3.log').readtoend()}).TotalSeconds
10,8803161

Great performance, but, same as for the two other .NET methods, a lot of RAM has to be allocated since they read the entire file into memory and your code could in some cases throw an exception of type System.OutOfMemoryException. Also, you have noticed it, when using .NET, the file path must be fully specified or Powershell won't be able to find it.

In conclusion, seeking through the file with Get-Content -ReadCount seems to be the seconds fastest way to go through a large text file. First option with large files is [System.IO.File]::ReadLines():

  1. [System.IO.File]::ReadLines(): 6 seconds
  2. Get-Content -ReadCount 1000: 7 seconds
  3. Switch -File: 10 seconds
  4. [System.IO.File]::OpenText().readtoend(): 10 seconds
  5. [System.IO.File]::ReadAllText(): 13 seconds
  6. [System.IO.File]::ReadAllLines(): 16 seconds
  7. Get-Content -Raw: 46 seconds
  8. Get-Content: 229 seconds
If you liked this post, please share, or leave a comment if your tests gave you different results.

8 comments:

  1. How about Get-Content -ReadCount 0, always seems to work faster for me? Interested in seeing your comparison.

    ReplyDelete
    Replies
    1. Hi DJ,

      (Measure-Command {Get-Content .\logfile3.log -ReadCount 0}).TotalSeconds
      20,0709778

      ReadCount set to 0 is by design too memory expensive against such a large file if compared to ReadCount values between 1000 and 10000.

      Carlo

      Delete
  2. Carlo, have you tried reading from a stream in dotnet? Rob

    ReplyDelete
    Replies
    1. Nice suggestion, Rob, but still the result goes well beyond what I could achieve with Get-Content -ReadCount:

      PS C:\> Measure-Command {
      >>
      >> $i = 0
      >> $streamreader = New-Object IO.StreamReader 'C:\logfile3.log'
      >> while($streamreader.ReadLine() -ne $null){ $i++ }
      >> $streamreader.Close()
      >> $i
      >>
      >> }
      >>


      Days : 0
      Hours : 0
      Minutes : 0
      Seconds : 46
      Milliseconds : 812
      Ticks : 468120065
      TotalDays : 0,000541805630787037
      TotalHours : 0,0130033351388889
      TotalMinutes : 0,780200108333333
      TotalSeconds : 46,8120065
      TotalMilliseconds : 46812,0065

      Delete
  3. nice analysis carlo, i'm trying to create a ps script to backup a folder to correct usb memory sticks by checking their volume label, but have difficulty figuring this out.

    ReplyDelete
  4. Hi Carlo,

    Great post here! Here I have a question, what if I have this very large log file that grows continually and I want to do a point-check every hour and only read a section of the log. In other words, break the log file into n sections, and at Hour 0, read section 0; at Hour 1, read section 1.
    Do you have any suggestion to achieve this without getting the performance issue? Currently from I searched, I can use Get-Content -path $path -Tail 10000, but I have to take a guess and test.

    Thank you for any help!

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...