Page 1 of 2

Performance Issues with Code

Posted: 04 Aug 2020 07:41
by SIMMS7400
Hi Folks -

A few months back, Atonio helped me with a chunk of code that determined "min" and "max" month in a data file. I expanded it a bit more to except months in the following formats:
1,01,Jan and January
It's working fine. However, performance is not so great. 1600 lines takes a few minutes to run and I have data files that are 40k rows at times. Is there any way to speed up this code? I assume to the call to the string length function causes this to slow down quite a bit?

Sample file:
"Years"|"Period"|"Scenario"|"Version"|"Plan Element"|"Account"|"Entity"|"Funding"|"State"|"Segment"|"Department"|"Product"|"Amount"
"2020"|"Jan"|"Actual"|"PreAlloc"|"TLoad"|"7020"|"150"|"No Fund"|"No State"|"No Segment"|"150ADCO152"|"No Product"|"100"
"2020"|"Jan"|"Actual"|"PreAlloc"|"TLoad"|"7140"|"150"|"No Fund"|"No State"|"No Segment"|"150ADCO152"|"No Product"|"100"
"2020"|"Jan"|"Actual"|"PreAlloc"|"TLoad"|"7750"|"150"|"No Fund"|"No State"|"No Segment"|"150ADCO152"|"No Product"|"100"
"2020"|"Jan"|"Actual"|"PreAlloc"|"TLoad"|"6010"|"100"|"No Fund"|"No State"|"No Segment"|"100HCCN550"|"No Product"|"100"
"2020"|"Jan"|"Actual"|"PreAlloc"|"TLoad"|"6010"|"100"|"No Fund"|"No State"|"No Segment"|"100HCCO100"|"No Product"|"100"
"2020"|"Jan"|"Actual"|"PreAlloc"|"TLoad"|"6010"|"100"|"No Fund"|"No State"|"No Segment"|"100HCCO101"|"No Product"|"100"
"2020"|"Jan"|"Actual"|"PreAlloc"|"TLoad"|"6010"|"100"|"No Fund"|"No State"|"No Segment"|"100HCCO105"|"No Product"|"100"
"2020"|"Jan"|"Actual"|"PreAlloc"|"TLoad"|"6010"|"100"|"No Fund"|"No State"|"No Segment"|"100HCCO150"|"No Product"|"100"
"2020"|"Feb"|"Actual"|"PreAlloc"|"TLoad"|"6010"|"100"|"No Fund"|"No State"|"No Segment"|"100HCCO154"|"No Product"|"100"
Code:

Code: Select all

@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION
		SET /A "MAXM=-99999", "MINM=99999"
		FOR /F "skip=1 USEBACKQ tokens=1-2 delims=|" %%a IN ("test.txt") DO ( 
            SET "MONTH=%%~b"
            ECHO "!MONTH!"| FINDSTR /r "^[1-9][0-9]*$">NUL || (
                SET "MONTH=!MONTH:~0,3!"
                IF "!MONTH!"=="Jan" SET "MONTH=01"
                IF "!MONTH!"=="Feb" SET "MONTH=02"
                IF "!MONTH!"=="Mar" SET "MONTH=03"
                IF "!MONTH!"=="Apr" SET "MONTH=04"
                IF "!MONTH!"=="May" SET "MONTH=05"
                IF "!MONTH!"=="Jun" SET "MONTH=06"
                IF "!MONTH!"=="Jul" SET "MONTH=07"
                IF "!MONTH!"=="Aug" SET "MONTH=08"
                IF "!MONTH!"=="Sep" SET "MONTH=09"
                IF "!MONTH!"=="Oct" SET "MONTH=10"
                IF "!MONTH!"=="Nov" SET "MONTH=11"
                IF "!MONTH!"=="Dec" SET "MONTH=12"
            ) && (
                CALL :STRLEN RESULT MONTH
                IF "!RESULT!"=="1" SET "MONTH=0%%~b"
            )

            IF !MONTH! GTR !MAXM! SET "MAXM=!MONTH!"
            IF !MONTH! LSS !MINM! SET "MINM=!MONTH!"
		)
        
        echo %MINM%
        echo %MAXM%
        pause
        
        
 :STRLEN <resultVar> <stringVar>
(   
    SET "S=!%~2!#"
    SET "LEN=0"
    FOR %%P IN (4096 2048 1024 512 256 128 64 32 16 8 4 2 1) DO (
        IF "!S:~%%P,1!" NEQ "" ( 
            SET /a "LEN+=%%P"
            SET "S=!S:~%%P!"
        )
    )
)
( 
    ENDLOCAL
    SET "%~1=%LEN%"
    EXIT /B
)

Re: Performance Issues with Code

Posted: 04 Aug 2020 08:14
by ShadowThief
Since you're only checking to see if the string has one character, you can replace

Code: Select all

CALL :STRLEN RESULT MONTH
IF "!RESULT!"=="1" SET "MONTH=0%%~b"
with

Code: Select all

if "!MONTH:~1,1!"=="" set "MONTH=0%%~b"

Re: Performance Issues with Code

Posted: 04 Aug 2020 08:51
by ShadowThief
Or if you don't mind adding a ton of arrays, you can manually set the values for each possible variable (on my machine, this can process half a million lines in 26 seconds).

Code: Select all

@echo off
SETLOCAL ENABLEDELAYEDEXPANSION
echo %TIME%

set /a "month_val[January]=1",   "month_val[Jan]=1",  "month_val[1]=1", "month_val[01]=1"
set /a "month_val[February]=2",  "month_val[Feb]=2",  "month_val[2]=2", "month_val[02]=2"
set /a "month_val[March]=3",     "month_val[Mar]=3",  "month_val[3]=3", "month_val[03]=3"
set /a "month_val[April]=4",     "month_val[Apr]=4",  "month_val[4]=4", "month_val[04]=4"
set /a "month_val[May]=5",                            "month_val[5]=5", "month_val[05]=5"
set /a "month_val[June]=6",      "month_val[Jun]=6",  "month_val[6]=6", "month_val[06]=6"
set /a "month_val[July]=7",      "month_val[Jul]=7",  "month_val[7]=7", "month_val[07]=7"
set /a "month_val[August]=8",    "month_val[Aug]=8",  "month_val[8]=8", "month_val[08]=8"
set /a "month_val[September]=9", "month_val[Sep]=9",  "month_val[9]=9", "month_val[09]=9"
set /a "month_val[October]=10",  "month_val[Oct]=10",                   "month_val[10]=10"
set /a "month_val[November]=11", "month_val[Nov]=11",                   "month_val[11]=11"
set /a "month_val[December]=12", "month_val[Dec]=12",                   "month_val[12]=12"

SET /A "MAXM=-99999", "MINM=99999"
FOR /F "skip=1 USEBACKQ tokens=1-2 delims=|" %%a IN ("test.txt") DO ( 
    SET "MONTH=!month_val[%%~b]!"
    IF !MONTH! GTR !MAXM! SET "MAXM=!MONTH!"
    IF !MONTH! LSS !MINM! SET "MINM=!MONTH!"
)

echo %MINM%
echo %MAXM%
echo %TIME%
exit /b

Re: Performance Issues with Code

Posted: 04 Aug 2020 10:13
by SIMMS7400
Shadow -

Holy Shit** - this is an incredible performance increase!!! It just took a process from ~2 hours down to 8 seconds, wow! Thank you!

One thing is, I copied your code before you made your second edit, adding the SET /A logic. Prior to your edit, "Jan" in the file was returning as "01" which is correct. However now it's returning as "1", which poses an issue for me.

Is there a way to account for that in your new updates or should I should use this logic as you provided earlier to ensure they are 2 digits:

Code: Select all

if "!MONTH:~1,1!"=="" set "MONTH=0%%~b"
Thank you again Shadow, this isawesome!!

Re: Performance Issues with Code

Posted: 04 Aug 2020 10:35
by ShadowThief
I had to do that because SET /A doesn't allow 08 or 09, since prefixing a number with 0 means it gets interpreted as octal, and octal numbers can't end in 8 or 9.

If you need MINM and MAXM to be zero-padded, you can just add

Code: Select all

if !MINM! LSS 10 set "MINM=0!MINM!"
if !MAXM! LSS 10 set "MAXM=0!MAXM!"
after the for loop.

Re: Performance Issues with Code

Posted: 04 Aug 2020 11:54
by Eureka!
Purely out of curiousity ...

Would replacing this:

Code: Select all

FOR /F "skip=1 USEBACKQ tokens=1-2 delims=|" %%a IN ("test.txt") DO ( 
    SET "MONTH=!month_val[%%~b]!"
with:

Code: Select all

FOR /F "skip=1 USEBACKQ tokens=2 delims=|" %%a IN ("test.txt") DO ( 
    SET "MONTH=!month_val[%%~a]!"
make the code any faster? (it would save on setting / un-setting an extra variable every round/line)

Re: Performance Issues with Code

Posted: 04 Aug 2020 12:03
by SIMMS7400
Shadow -

Awesome, that's exactly what I did and it's working great. Thank you! Just for my own curiosity, what's the advantage of the SET /A logic vs the previous way of just listing out the different arrays other than a smaller block of code? Any performance increases?

Re: Performance Issues with Code

Posted: 04 Aug 2020 12:04
by ShadowThief
No performance increases whatsoever, it just saved some lines by being able to set multiple values on the same line. It's purely for aesthetic reasons.

Re: Performance Issues with Code

Posted: 04 Aug 2020 12:06
by ShadowThief
Eureka! wrote:
04 Aug 2020 11:54
Purely out of curiousity ...

Would replacing this:

Code: Select all

FOR /F "skip=1 USEBACKQ tokens=1-2 delims=|" %%a IN ("test.txt") DO ( 
    SET "MONTH=!month_val[%%~b]!"
with:

Code: Select all

FOR /F "skip=1 USEBACKQ tokens=2 delims=|" %%a IN ("test.txt") DO ( 
    SET "MONTH=!month_val[%%~a]!"
make the code any faster? (it would save on setting / un-setting an extra variable every round/line)
In my tests, it saved about 2-3 seconds on a million-line file, so I don't think they'd see any benefits on their side, unfortunately.

Re: Performance Issues with Code

Posted: 04 Aug 2020 12:12
by ShadowThief
However, "not setting a variable" gave me an idea to not set the !MONTH! variable at all, and now the script runs in about half the time.

Code: Select all

@echo off
SETLOCAL ENABLEDELAYEDEXPANSION
echo %TIME%

:: Month hashes
set /a "month_val[January]=1",   "month_val[Jan]=1",  "month_val[1]=1", "month_val[01]=1"
set /a "month_val[February]=2",  "month_val[Feb]=2",  "month_val[2]=2", "month_val[02]=2"
set /a "month_val[March]=3",     "month_val[Mar]=3",  "month_val[3]=3", "month_val[03]=3"
set /a "month_val[April]=4",     "month_val[Apr]=4",  "month_val[4]=4", "month_val[04]=4"
set /a "month_val[May]=5",                            "month_val[5]=5", "month_val[05]=5"
set /a "month_val[June]=6",      "month_val[Jun]=6",  "month_val[6]=6", "month_val[06]=6"
set /a "month_val[July]=7",      "month_val[Jul]=7",  "month_val[7]=7", "month_val[07]=7"
set /a "month_val[August]=8",    "month_val[Aug]=8",  "month_val[8]=8", "month_val[08]=8"
set /a "month_val[September]=9", "month_val[Sep]=9",  "month_val[9]=9", "month_val[09]=9"
set /a "month_val[October]=10",  "month_val[Oct]=10",                   "month_val[10]=10"
set /a "month_val[November]=11", "month_val[Nov]=11",                   "month_val[11]=11"
set /a "month_val[December]=12", "month_val[Dec]=12",                   "month_val[12]=12"

SET /A "MAXM=-99999", "MINM=99999"
FOR /F "skip=1 usebackq tokens=2 delims=|" %%a IN ("test.txt") DO ( 
    IF !month_val[%%~a]! GTR !MAXM! SET "MAXM=!month_val[%%~a]!"
    IF !month_val[%%~a]! LSS !MINM! SET "MINM=!month_val[%%~a]!"
)

if !MINM! LSS 10 set "MINM=0!MINM!"
if !MAXM! LSS 10 set "MAXM=0!MAXM!"

echo %MINM%
echo %MAXM%
echo %TIME%
exit /b

Re: Performance Issues with Code

Posted: 04 Aug 2020 12:45
by SIMMS7400
Hi both -

I do need token 1 as I do want to extract the year, but this runs is such quick time anymore, keeping that in make little difference.

Absolutely incredible performance gains on this, thank you Shadow!!! Very much appreciated!

Re: Performance Issues with Code

Posted: 04 Aug 2020 14:41
by Aacini
Perhaps this would run a little faster...

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Empty environment
(
   for /F "delims==" %%a in ('set') do set "%%a="
   set "ComSpec=%ComSpec%"
)

set /A "i=0, j=100"
for %%a in (January February March April May June July August September October November December) do (
   set /A i+=1, j+=1
   set "month=%%a"
   set /A "m%%a=i, m!month:~0,3!=i, m!i!=i, m!j:~1!=i"
)

SET /A "zMAXM=1, zMINM=12"
FOR /F "skip=1 USEBACKQ tokens=2 delims=|" %%a in ("test.txt") DO (
   set /A "zDiff=zMAXM-m%%~a, zMAXM+=(zDiff>>31)*zDiff, zDiff=m%%~a-zMINM, zMINM-=(zDiff>>31)*zDiff"
)

if %zMAXM% lss 10 set "zMAXM=0%zMAXM%"
if %zMINM% lss 10 set "zMINM=0%zMINM%"

echo Min: %zMINM%
echo Max: %zMAXM%
Antonio

PS - Please, post the timing... Thanks

Re: Performance Issues with Code

Posted: 04 Aug 2020 15:35
by ShadowThief
My tests have your code process one million lines in an average of 60 seconds.

Re: Performance Issues with Code

Posted: 04 Aug 2020 16:25
by Eureka!
ShadowThief wrote:
04 Aug 2020 12:06
In my tests, it saved about 2-3 seconds on a million-line file, so I don't think they'd see any benefits on their side, unfortunately.
Thanks, @ShadowThief!

Inspired by @Aacini's solution, some pseudo-code as I dont have the time and experience to convert this to proper code:

Code: Select all

Instead of month n=1..12, set month= 2^^n -1 (1.. 4095)
set /a min=4095, max=0
For loop:
  set /a min="min & month", max="max | month"

After the for-loop, convert 2^^n - 1 back to n.
 
Might be faster ..

Re: Performance Issues with Code

Posted: 04 Aug 2020 17:32
by ShadowThief
Eureka! wrote:
04 Aug 2020 16:25
ShadowThief wrote:
04 Aug 2020 12:06
In my tests, it saved about 2-3 seconds on a million-line file, so I don't think they'd see any benefits on their side, unfortunately.
Thanks, @ShadowThief!

Inspired by @Aacini's solution, some pseudo-code as I dont have the time and experience to convert this to proper code:

Code: Select all

Instead of month n=1..12, set month= 2^^n -1 (1.. 4095)
set /a min=4095, max=0
For loop:
  set /a min="min & month", max="max | month"

After the for-loop, convert 2^^n - 1 back to n.
 
Might be faster ..
I'm getting between 60 and 80 seconds for a million rows for this, likely because two set statements are being run every single iteration.