locked
Why PowerShell regexp match filter runs at least 400 times slower then EndWith filter? RRS feed

  • Question

  • Anybody could repeate my test as it is based on the reading of the system log:

    I've decided to improve the script run time replacing filter: if ( $s.EndsWith( "MicrosoftAccount" ) -and $s.Contains( "@" ) ) with regexp filter: if ( $s -match '.+@.+MicrosoftAccount' ). The run time appeared to be more that 400 time worse. Then I've tryed two more cases, multi string regexp, Select-String with regexp and both apeared to be even worse. ALL 4 cases works, find the string I search but I can not understand WHY all 3 RexExp cases run so slow? Each case duration time in milliseconds are given at the end of the code. There is also $PSVersionTable info.

    I know for sure that the line with MS account e-mail is present in all Setupact.log only once. I've spend a time detecting the [Log.Length-16000,log.Length-10000] region in which it is always resides, nearly 250 Setupact.log files were used to detect this region.

    OK, Cases 3 and 4 can be excluded as they do not breaks when find a match and thus evaluates all of the lines. But Case 2 IS breaked.

    Have replaced RegExp in Case 2 on '\S+@\S+\\MicrosoftAccount' and search duration decreased till 1750 milliseconds. But again too much comparing Case 1.

    $buff = Get-Content c:\Windows\Panther\Setupact.log -Tail 16000;
    $sw   = [System.Diagnostics.Stopwatch]::StartNew();
    $email = "";
    # Case 1
    foreach( $s in $buff )
    {
        if ( $s.EndsWith( "MicrosoftAccount" ) -and $s.Contains( "@" ) )
        {
            $email = $s.Split("\")[-2];
            break; 
        }
    }
    $milli = [int]$sw.elapsedmilliseconds;
    Write-Host "Case1, email=$email, continued=$milli";
    $email = "";
    $sw.Stop();
    $sw.Reset();
    $sw   = [System.Diagnostics.Stopwatch]::StartNew();
    # Case 2
    foreach( $s in $buff )
    {
        if ( $s -match '.+@.+MicrosoftAccount' )
        {
            $email = $s.Split("\")[-2];
            break; 
        }
    }
    $milli = [int]$sw.elapsedmilliseconds;
    Write-Host "Case2, email=$email, continued=$milli";
    $email = "";
    $sw.Stop();
    $sw.Reset();
    $sw   = [System.Diagnostics.Stopwatch]::StartNew();
    # Case 3
    $em = $buff -match '(?m).+@.+MicrosoftAccount';
    if ( $em )
    {
       $email = $s.Split("\")[-2];
    }
    $milli = [int]$sw.elapsedmilliseconds;
    Write-Host "Case3, email=$email, continued=$milli";

    $email = "";
    $sw.Stop();
    $sw.Reset();
    $sw   = [System.Diagnostics.Stopwatch]::StartNew();
    # Case 4
    $em = $buff | Select-String -Pattern '.+@.+MicrosoftAccount';
    if ( $em.Matches.Success )
    {
        $email = $em.Matches.Value.Split("\")[-2];
    }
    $milli = [int]$sw.elapsedmilliseconds;
    Write-Host "Case4, email=$email, continued=$milli";
    $sw.Stop();
    $sw.Reset();
    <#
    PS E:\History> . .\Test-RegExp
    Case1, email=sysprg@live.ru, continued=6
    Case2, email=sysprg@live.ru, continued=2605
    Case3, email=sysprg@live.ru, continued=9716
    Case4, email=sysprg@live.ru, continued=9708
    PS E:\History>

    PS E:\History> $PSVersionTable
    Name                           Value
    ----                           -----
    PSVersion                      5.1.17760.1
    PSEdition                      Desktop
    PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
    BuildVersion                   10.0.17760.1
    CLRVersion                     4.0.30319.42000
    WSManStackVersion              3.0
    PSRemotingProtocolVersion      2.3
    SerializationVersion           1.1.0.1


    #>







    • Edited by Oleg Kulikov Tuesday, September 18, 2018 9:06 PM
    Tuesday, September 18, 2018 2:51 PM

Answers

All replies

  • Yes, Regex is slow. Internally, it uses a very complex algorithm to determine how a regex expression has to be evaluated. There are thing such as the + signs that mean "one or more of this", and the regex has to decide how to examine the various possible combinations of values attempting to "fit" them into the expression. This requires much more processing than "EndsWith", which can be done by a trivial comparison on the characters at the end of the string. It is especially bad if you do it in a loop, because the expression has to be re-evaluated on each iteration to establish the algorithm to resolve it.

    One way in which you can improve it is to use the .NET Regex, i.e., [System.Text.RegularExpressions.Regex]. You can create an instance of a Regex object before starting the foreach loop, and set the option that it has for compiling the expression (sorry, you'll have to look it up in the docs, I don't remember the exact syntax right now). This will be done only once, and then the loop iterations will be much faster. But I'm pretty sure that it will still be significantly slower than the EndsWith.

    Saturday, September 22, 2018 7:57 AM
  • Many thnx, I've thought about using .Net Regex but an example I've demonstrated is only one of the TENS of the comparisons I use in a real code. That is better to re-write all of the code to C#. OK, Got it, Powershell StartsWith, EndsWith are really very efficient and all of my attempts to improve the real code by implementing RegExp, have failed. Same for the "val1,val2,..valn".Contains(input) is very effiicient if I need to compare a fixed length substring of an input line with a given values.
    Saturday, September 22, 2018 8:07 PM
  • Off topic for this forum.

    Danny

    Wednesday, October 31, 2018 1:00 PM
    Answerer