Wednesday, October 22, 2014

Reading large text files with Powershell

Any sysadmin out there knows that log files are an invaluable asset for troubleshooting issues on their servers. Sometimes though these kind of files can be very large and become difficult to handle, as I had the occasion to notice in a Powershell forum discussion one week ago.

The thing to understand is that the right tool to use in Powershell is determined from the kind of file you are dealing with.

Here's the bunch of options you have for reading text files:

  • Get-Content
  • Get-Content -Raw
  • Get-Content -ReadCount
  • Switch -File
  • [System.IO.File]::ReadAllText()
  • [System.IO.File]::ReadAllLines()
  • [System.IO.File]::ReadLines()
  • [System.IO.File]::OpenText().readtoend()
Get-Content is your out-of-the-box option. It works well with small log files and its use is pretty straightforward, but on large log files it can be very slow.

Get-Content .\small.csv

Let's suppose we have four logfiles of different size and number of lines:

- logfile0.log, a small logfile: 4 lines and 1KB size
- logfile1.log, a medium size logfile: 700K lines and 160MB size
- logfile2.log, a large size logfile: 300K lines and 1.2GB size
- logfile3.log, a logfile containing more than a million of lines: 1.4M lines and 650MB size

Let's see what Get-Content can do against these files:

(Measure-Command {Get-Content .\logfile0.log}).TotalSeconds

(Measure-Command {Get-Content .\logfile1.log}).TotalSeconds

(Measure-Command {Get-Content .\logfile2.log}).TotalSeconds

(Measure-Command {Get-Content .\logfile3.log}).TotalSeconds

Pretty swift on the smaller ones, but way too slow on that last large file, which, with its over one million lines becomes difficult to deal with.

Let's see than how the other methods I listed above perform against it. Doing some performance testing is not only fun, but it helps better understanding how your systems work.

As we just saw, with its basic syntax Get-Content performs poorly:

(Measure-Command {Get-Content .\logfile3.log}).TotalSeconds

Since Powershell v3, Get-Content has a -Raw parameter, which makes it to read text files in a text stream, keeping newline character intact. As you can see below, it is way faster than Get-Content:

(Measure-Command {Get-Content .\logfile3.log -Raw}).TotalSeconds

Nonetheless it a huge impact on your memory use:

Happily enough we can dispose data in memory by forcing the Garbage Collector to do its job before continuing our tests:


Let's move on to the third option: the -ReadCount parameter. As it was showed in the mentioned forum discussion, this switch can speed Get-Content beyond what one could expect, given that you find the right value for it:

(Measure-Command {Get-Content .\logfile3.log -ReadCount 10}).TotalSeconds

(Measure-Command {Get-Content .\logfile3.log -ReadCount 100}).TotalSeconds

(Measure-Command {Get-Content .\logfile3.log -ReadCount 1000}).TotalSeconds

(Measure-Command {Get-Content .\logfile3.log -ReadCount 10000}).TotalSeconds

This -ReadCount switch helps speeding up the process by controlling the number of records Get-Content writes into the pipeline at a time. From my experience, reading the file by chunks of 1000 or 10000 lines usually gives the best results and has low to no impact on CPU and memory even for such a large logfile.

The problem with -ReadCount is that it takes a bit of guessing to find out the right value for a specific file, since performance varies not only with the size of the file but also with the size of each record.

An interesting alternative to Get-Content, which you should know, is the use of the Switch keyword:

(Measure-Command {Switch -File .\logfile3.log {default {$PSItem}}}).TotalSeconds

Not so bad, isn't it? Let's remark that the impact on CPU usage of Switch is slightly higher than Get-Content -ReadCount:

Let's see now if we can have better results by crossing the unwelcoming bridge to .NET.

First option is the ReadAllText() method:

(Measure-Command {[System.IO.File]::ReadAllText('C:\temp\get-content\logfile3.log')}).TotalSeconds

ReadAllText() is very fast but has a temporary impact on system memory.

Second option is ReadAllLines():

(Measure-Command {[System.IO.File]::ReadAllLines('C:\temp\get-content\logfile3.log')}).TotalSeconds

Not as good as ReadAllText(), but it has a slightly less aggressive impact on memory allocation:

Third option is ReadLine(), which differs from ReadAllLines() in the fact that it allows you to start enumerating the collection of strings before the whole collection is returned:

(Measure-Command {[System.IO.File]::ReadLines('C:\temp\get-content\logfile3.log')}).TotalSeconds

Impressive result here.

Let's move to the fourth and last option which is OpenText().readtoend():

(Measure-Command {[System.IO.File]::OpenText('C:\temp\get-content\logfile3.log').readtoend()}).TotalSeconds

Great performance, but, same as for the two other .NET methods, a lot of RAM has to be allocated since they read the entire file into memory and your code could in some cases throw an exception of type System.OutOfMemoryException. Also, you have noticed it, when using .NET, the file path must be fully specified or Powershell won't be able to find it.

In conclusion, seeking through the file with Get-Content -ReadCount seems to be the seconds fastest way to go through a large text file. First option with large files is [System.IO.File]::ReadLines():

  1. [System.IO.File]::ReadLines(): 6 seconds
  2. Get-Content -ReadCount 1000: 7 seconds
  3. Switch -File: 10 seconds
  4. [System.IO.File]::OpenText().readtoend(): 10 seconds
  5. [System.IO.File]::ReadAllText(): 13 seconds
  6. [System.IO.File]::ReadAllLines(): 16 seconds
  7. Get-Content -Raw: 46 seconds
  8. Get-Content: 229 seconds
If you liked this post, please share, or leave a comment if your tests gave you different results.


  1. How about Get-Content -ReadCount 0, always seems to work faster for me? Interested in seeing your comparison.

    1. Hi DJ,

      (Measure-Command {Get-Content .\logfile3.log -ReadCount 0}).TotalSeconds

      ReadCount set to 0 is by design too memory expensive against such a large file if compared to ReadCount values between 1000 and 10000.


  2. Carlo, have you tried reading from a stream in dotnet? Rob

    1. Nice suggestion, Rob, but still the result goes well beyond what I could achieve with Get-Content -ReadCount:

      PS C:\> Measure-Command {
      >> $i = 0
      >> $streamreader = New-Object IO.StreamReader 'C:\logfile3.log'
      >> while($streamreader.ReadLine() -ne $null){ $i++ }
      >> $streamreader.Close()
      >> $i
      >> }

      Days : 0
      Hours : 0
      Minutes : 0
      Seconds : 46
      Milliseconds : 812
      Ticks : 468120065
      TotalDays : 0,000541805630787037
      TotalHours : 0,0130033351388889
      TotalMinutes : 0,780200108333333
      TotalSeconds : 46,8120065
      TotalMilliseconds : 46812,0065

  3. nice analysis carlo, i'm trying to create a ps script to backup a folder to correct usb memory sticks by checking their volume label, but have difficulty figuring this out.

  4. Hi Carlo,

    Great post here! Here I have a question, what if I have this very large log file that grows continually and I want to do a point-check every hour and only read a section of the log. In other words, break the log file into n sections, and at Hour 0, read section 0; at Hour 1, read section 1.
    Do you have any suggestion to achieve this without getting the performance issue? Currently from I searched, I can use Get-Content -path $path -Tail 10000, but I have to take a guess and test.

    Thank you for any help!

  5. Nice thorough job on the different options and timing it, but I miss the string search bit.
    Seems a bit theoretical merely looking at how fast you can get through a logfile without actually looking for anything.

    What you usually need is something like
    Get-Content C:\temp\logfile.log | Where-Object{$_ -match 'ERROR'} | ForEach-Object{($_ -split "\s+")[7]}|Out-File c:\temp\errorlines.log

  6. This comment has been removed by the author.

  7. Great job, that's what I need to use powershell to compare files with more than 5GB line by line.

  8. Gosh, great job. Exactly what I need to compare 6gb files line by line through powershell.

  9. How to sort huge 30million csv file with power shell.

  10. Great post! Thanks a lot, I converted huge (1.3 GB) UCS-2 encoded text file to ANSI on the fly, using your method:

    Get-Content .\file.log -Raw -ReadCount 10000 | Out-File -Encoding ascii .\file-ansi.log

    It took less than 10 seconds!


Related Posts Plugin for WordPress, Blogger...