Page 1 of 2

Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 29 Jan 2020 14:57
by jeb
Hi,

while building some magic macro methods I found a strange behavior:

Code: Select all

@echo off

for /F "usebackq delims= " %%C in (`copy /z "%~f0" nul`) do set "$CR=%%C"

set "PROMPT=$LPHASE3$G "
echo on
set ^"var1=BEGIN-%===%3>NUL END"
set ^"var2=BEGIN-%$CR%3>NUL END"

set var
Output wrote:$ ./remoteExec.sh "phase_error_CR.bat"

<PHASE3> set "var1=BEGIN-3 END" 1>NUL

<PHASE3> set "var2=BEGIN- END" 3>NUL

<PHASE3> set var
var1=BEGIN-3 END
var2=BEGIN- END
:shock:

I never saw a difference between no text at all, expanding an empty or a CR variable :!:

Therefore I had postulated:
phase 1.5: Remove <CR>: Remove all Carriage Return (0x0D) characters
Phase 2) Process special characters, tokenize, and build a cached command block:

But now, how is this possible :?:
Phase 1.5 removes the CR
Phase 2 handles redirects, it shouldn't know that there was a CR at all

jeb

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 29 Jan 2020 18:33
by penpen
What proof do we have that removing <CR> characters is performed in an own phase and not handled as a special character in Phase 2, where tokenization might cause that above difference?
I tried to search it, but my google-fu is not strong enough tonight... :( .

penpen

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 29 Jan 2020 21:13
by dbenham
jeb wrote: But now, how is this possible :?:
Phase 1.5 removes the CR
Phase 2 handles redirects, it shouldn't know that there was a CR at all
Arggghhhhh :shock: :!: :!: :!: :!: :twisted:
penpen wrote: What proof do we have that removing <CR> characters is performed in an own phase and not handled as a special character in Phase 2, where tokenization might cause that above difference?
Evidence can be found at viewtopic.php?t=6369

Read all of the first 3 posts.

I am convinced that CR is stripped after phase 1.

But I'm not convinced the evidence is definitive that CR is stripped before phase 2. Phase 2 is certainly easier to explain if they are stripped first. But I can sort of see how they might be stripped as part of phase 2, though not well enough to propose modified Phase 2 that can account for all the behaviors.

And of course now jeb's new evidence proves that CR must not be removed prior to phase 2 :(


Dave Benham

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 29 Jan 2020 22:14
by dbenham
Processing redirection special characters in phase 2 is special in that I think it is the only case where the parser must look backward one or two characters - it has to determine if there is a file handle associated with the redirection. I'm thinking that may be an important factor why the CR only seems to effect redirection in phase 2.

I have a few additional tests:

Code: Select all

@echo off
setlocal
for /F "usebackq delims= " %%C in (`copy /z "%~f0" nul`) do set "$CR=%%C"

set "PROMPT=$G "
echo on
set ^"var1=BEGIN-%===%3>NUL END"
set ^"var2=BEGIN-%$CR%3>NUL END"
set ^"var3=BEGIN-%$CR%3%$CR%>NUL END"
set ^"var4=BEGIN 3>NUL END"
set ^"var5=BEGIN ^3>NUL END"
set ^"var6=BEGIN ^^3>NUL END"
set ^"var7=BEGIN^%$CR%3>NUL END"

set var
--OUTPUT--

Code: Select all

> set "var1=BEGIN-3 END" 1>NUL

> set "var2=BEGIN- END" 3>NUL

> set "var3=BEGIN-3 END" 1>NUL

> set "var4=BEGIN  END" 3>NUL

> set "var5=BEGIN 3 END" 1>NUL

> set "var6=BEGIN ^3 END" 1>NUL

> set "var7=BEGIN3 END" 1>NUL

> set var
var1=BEGIN-3 END
var2=BEGIN- END
var3=BEGIN-3 END
var4=BEGIN  END
var5=BEGIN 3 END
var6=BEGIN ^3 END
var7=BEGIN3 END
So suppose phase 2 immediately discards every CR that it finds, as if the character were never there. If the prior character was an escape ^ that has been removed, then the flag stating the next character is escaped would not be cleared by CR.
I don't see how that could be distinguished from the existing phase 1.5 rule.

Also, suppose that the output of Phase 2 is in some buffer, and phase 2 processes the characters in that buffer and loads them into a new parsed command structure, one character at a time.

Now if redirection is discovered, the parser could look backward in the original phase 1 output buffer to determine if the prior character is a digit (valid handle identifier). If it is a digit, then it may represent a file handle, in which case it must look at the character before that. If it sees a CR, ignore it and keep looking at the prior character until it is not CR or there are no more characters. Now if that prior character is an appropriate stop character, then the digit is indeed a file handle that must be removed from the prior parsed token and moved to the redirection token. If the prior character is ^ then it could be an escape or literal; either way it doesn't matter as long as ^ is not considered a stop character - the digit is a literal and not a file handle.

It's a bit complicated, but I think it could explain all the test cases.


Dave Benham

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 30 Jan 2020 00:40
by jeb
dbenham wrote:
29 Jan 2020 22:14
Now if redirection is discovered, the parser could look backward in the original phase 1 output buffer to determine if the prior character is a digit (valid handle identifier). If it is a digit, then it may represent a file handle, in which case it must look at the character before that. If it sees a CR, ignore it and keep looking at the prior character until it is not CR or there are no more characters. Now if that prior character is an appropriate stop character, then the digit is indeed a file handle that must be removed from the prior parsed token and moved to the redirection token. If the prior character is ^ then it could be an escape or literal; either way it doesn't matter as long as ^ is not considered a stop character - the digit is a literal and not a file handle.

This explains the behavior, but that is much too complex, I can't believe that someone build a reverse parser that way.
Especially #5 can't be detected by a simple backward scan.

See how complex that would be

Code: Select all

echo #1: 3 is ... part of redirect ^^^^^^%$CR%3>&2 END
echo #2: 3 is not part of redirect ^^^^^^^%$CR%3>&2 END

echo #3: 3 is not part of redirect -^
3>&2 END

echo #4: 3 is not part of redirect -^
%$CR%3>&2 END

echo #5: 3 is ... part of redirect 4>NUL -^
%$CR%3>&2 END
I know there is a general tokenizer in phase2, but it doesn't seem that CR breaks these tokens.
Else in Test #2 the second line shouldn't be part of the comment

Code: Select all

REM -----------^
#1 Line is part of the comment

REM -%$CR%-----^
#2 Line is part of the comment

REM - ---------^
#3 Line is NOT part of the comment

But probably there could be another smaller tokenizer for redirects with a size of only one character,
or even only a flag that indicates when the tokenizer container contains only one unescaped character.

But the second speculation sounds not very convincing in the light of the "REM#2" test,
I would expect more side effects on tokenizing in that case.

An aditional mini tokenizer could be implemented really easy.

var minibufLen=0
var miniBuf
if escape or quote is active then
ignore char
minibufLen=0
else If char is whitespace
ignore char
minibufLen=0
else
miniBuf=char
minibufLen++
end

Then it's only necessary to check if miniBuflen==2 and miniBuf contains a digit.

jeb

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 30 Jan 2020 07:18
by dbenham
Thanks for the minibuff idea - That always bothered me how redirection file handle parsing could be handled without going backward.

But now I am really shocked. :shock:

How can echo #1 and #2 in your most recent test give different behavior
yet var 5 and 6 give the same result in my test :?: :?: :?: :?: Never mind. My var 5 and 6 don't have the CR, so they are not comparable

Also, the difference in your echo #4 and #5 makes no sense to me.

I can't detect any pattern, let alone imagine how the parser would work. :(

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 30 Jan 2020 11:24
by penpen
My hypothesis for that behaviour was,that CR could be a not escapable character, which just ends the actual token (and starts a new one).
Then i saw the "echo #5 example"... .

Code: Select all

echo #5: 3 is ... part of redirect 4>NUL -^
%$CR%3>&2 END
The minus-char is ignored, but i think that has only todo with the placement between the two redirections, so i tested that and some other characters:

Code: Select all

echo on
echo #5.1: 3 is ... part of redirect 4>NUL ,,,,,,,,,,,,,,,,,,,------------^
3>&2 END
echo #5.2: 3 is ... part of redirect 4>NUL ------------^
3>&2 END
echo #5.3: 3 is ... part of redirect 4>NUL ,,,,,,,,,,,,,,,,,,,^
3>&2 END
echo #5.4: 3 is not part of redirect 4>NUL ------------,,,,,,,,,,,,,,,,,,,^
3>&2 END
echo #5.5: 3 is ... part of redirect 4>NUL ===;;;,,,...!!!(((aaaZZZ)))---[[[]]]{{{}}}'''+++```~~~^
3>&2 END
Result:

Code: Select all

#echo #5.1: 3 is ... part of redirect  END 4>NUL 3>&2
#5.1: 3 is ... part of redirect  END

#echo #5.2: 3 is ... part of redirect  END 4>NUL 3>&2
#5.2: 3 is ... part of redirect  END

#echo #5.3: 3 is ... part of redirect  END 4>NUL 3>&2
#5.3: 3 is ... part of redirect  END

#echo #5.4: 3 is not part of redirect  ------------,,,,,,,,,,,,,,,,,,,3 END 4>NUL 1>&2
#5.4: 3 is ... part of redirect  ------------,,,,,,,,,,,,,,,,,,,3 END

#echo #5.5: 3 is ... part of redirect  END 4>NUL 3>&2
#5.5: 3 is ... part of redirect  END
Some kind of funny that the order of the used special characters does matter... .
I'm unsure if a single additional (mini) tokenizer could cause something like that... .
(I have the strange feeling, that i've seen that behaviour before... somewhere.)

penpen

Edit: Added as many (representative characters to the echo 5.5 as possible.

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 30 Jan 2020 12:11
by dbenham
OMG :shock:

So the parsing rules for redirection need to be established before we can ever hope to figure out what is going on with CR.

Right now it looks like complete chaos.

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 30 Jan 2020 15:44
by penpen
I finally remembered, where i saw that:
In the topic "Comments without increasing macro size" post #1:
jeb wrote:
13 Feb 2014 03:22
To avoid this you can simply use the obvious :wink: dummy redirect trick.

Code: Select all

echo Word1<nul ^
%= REM this is a comment line%^
Word2
In post #10 jeb noticed that technique removes a token:
jeb wrote:
19 Feb 2014 12:30
First a sample

Code: Select all

echo Hello <nul the^
only^
world
Output wrote:Hello world

Sidenotes (maybe usefull):
1) In post #14 jeb created an unexpected "<<".
2) Another version of the same unexpected result (slightly less confusing) in post #18:
3) My final approach in post #26 uses that behaviour to remove comments (single token) - somehow i forgot keep finding out why that worked... :oops: .

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 30 Jan 2020 16:33
by dbenham
Whew. I think I have detailed rules for the redirection parser, disregarding Carriage Returns.
I've since updated this post to show proposed rules for how CR behaves in phase 2, especially as it relates to redirection detection.
Phase 1.5 is now null and void. Instead, each Carriage Return (CR) is immediately stripped during the phase 2 scan process, such that the rest of the phase 2 parser acts as if the CR was never there. There is one exception - the CR functions as a temporary whitespace in step 1 of the redirection parser before it is stripped. This exception is probably an artifact of how the parser keeps track of the last two characters that preceded a redirection operator. Perhaps the parser maintains a special rolling two character buffer of the 2 most recent characters. The special buffer might only be used for redirection parsing. Perhaps the CR is placed in the 2 character buffer before it is stripped.


This deserves examples to show how I derived the following, but I'm exhausted with testing, and want to put these rules out for review ASAP

Pseudo regular expression to identify redirection:

Code: Select all

1                    2        3                                    4
^                    ^        ^                                    ^
((^|[\s=,;"()&|])\d)?(>>?|<)(&\d|[\s=,;]*(\^&\d|FilePath|device))(([\s=,;]+(\^[\s=,;]|[^\s=,;])+)\^\n)*
penpen effectively rediscovered and posted my rule 4 already while I was slavishly trying to come up with these rules and compose this post

In English:
  1. Optional file handle to redirect or define:
    • Must be a single digit.
    • There must not be a character before the digit (beginning of "line")
      or
      The preceding character must be whitespace , ; = ( ) " & or |
    • CR functions as a temporary whitespace in this step, even though it is subsequently removed.
      If before the digit, then it may allow the digit to be a file handle.
      If after the digit, then it may prevent the digit from being a file handle.
    • The digit must not be escaped (I did not explicitly put this in the regex)
      An intervening temporary CR whitespace does not prevent a caret from escaping the digit.
  2. Redirection operator:
    • < or > or >>
  3. Redirection destination
    • File Handle:
      • & followed by a digit
        All subsequent characters are preserved
      • or Optional unescaped token delimiters followed by ^& followed by a digit.
        Subsequent characters are ignored until the next unescaped/unquoted token delimiter
        meaning they are functionally stripped, but still appear in the phase 3 echo output
    • or File Path:
      • optional unescaped token delimiters followed by "normal" file specification (may include escaped/quoted token delimiters)
    • or Device:
      • optional unescaped token delimiters followed by "normal" device specification (I'm a bit fuzzy on these rules)
  4. Strip token delimiters followed by token if it ends with line continuation (recursive)
    • If after the redirection target there exists one or more unescaped token delimiters,
      followed by a single token (may include escaped/quoted token delimiters), followed by line continuation,
      then strip everything after the redirection destination before appending the next line.
      The first character of the next line is not escaped (unlike normal line continuation)
    • Repeat until no more lines or multiple tokens after destination.
    • I believe this is the same rule as is used for the REM parser.

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 30 Jan 2020 17:31
by Squashman
Seriously, why don't we send a nicely worded email to one of the people at the Microsoft Command Line Blog and see if they will chime in on this.

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 30 Jan 2020 23:09
by dbenham
I've updated my prior post with proposed rules for how CR is handled in phase 2.

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 31 Jan 2020 04:29
by penpen
I'm unsure if i miss something (currently too little time, sry), but i don't see why the "nul" in the next example is removed (the leading "<" is kept):

Code: Select all

@echo on
@set "prompt=#"
echo set macro=(^

>con ^
<nul ^
cmdToken and some params^

)
@set "prompt="
@goto :eof
Result:

Code: Select all

#echo set macro=(
 and some params
) 1>con 0<cmdToken
Das System kann die angegebene Datei nicht finden.

@Squashman: I like your idea (although it seems like Dave nearly solved it, but i am curious about how they see it).


penpen

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 31 Jan 2020 14:03
by dbenham
Trouble maker :!:
Why do you insist on shooting down a perfectly good theory with facts :?: :lol:

Actually, that is an interesting find. This redirection parser is certainly a complicated beast.

I think I have a basic understanding as to how (sort of why) a single line continued token after redirection gets stripped.

When the phase 2 character scanner encounters < or >, it must establish 3 things

1) The operator. < is easy, but > may be > or >>. I don't think we have investigated whether >> also has complications

2) The file handle that is being redirected or created. It is either implicit, or determined by the preceding character. I think the proposed mini 2 character rolling buffer rule explains this behavior pretty well.

3) The redirection destination. Here is where the line continuation issues lie. The redirection target may be appended to the current token (as defined by the "standard" token delimiters) or it may be in the next token. I'm thinking the token containing the redirection has already been identified, and then the redirection parser blindly determines the next token, all before deciding where the actual destination info lies. But it is possible that the next token ends with line continuation. The redirection parser must suspend itself, and then do the line continuation, completing phases 0 and 1, and then pick back up with finding the following token in phase 2. I'm guessing there is a bug/design flaw/unanticpated consequence of the line continuation mechanism that drops the incomplete token from the prior line and substitutes the first token it finds on the next line. It also leads to the first character not being escaped in the case that the destination was appended to the redirection token.

This process continues as long as it takes until the redirection parser finally has the subsequent token that may be the redirection destination. Only then does it actually discover that the destination was actually appended to the redirection operator token, or else it actually uses the already parsed next token as the destination.

Now you (penpen) have discovered another complication with this redirection line continuation process. If the next token also has redirection, then sometimes the destination that is appended to this 2nd redirection operator is dropped. This also is related to line continuation, but it seems to be different then before. I haven't established a pattern yet, let alone come up with any predictive rules. I'm thinking this complication might even impact the preceding 2 character buffer.

There is still lots to investigate and explain, but somehow my current "understanding" feels right.

Also still unresolved is why REM would have a similar token dropping line continuation issue. It seems like REM and redirection must be sharing some specialized code, but I still don't see any rational reason why that would be so.

Re: Bug/Mystery in the phase parsing rules 1.5 and 2 CR vs redirect

Posted: 31 Jan 2020 18:45
by penpen
dbenham wrote:
31 Jan 2020 14:03
Now you (penpen) have discovered another complication with this redirection line continuation process.
To be honest, that is jeb's example in disguise (replaced all variables):
viewtopic.php?p=32687#p32687


I further minimized that example (and played around) - it's definitively caused by the '^'-character but also works with only 1 line:

Code: Select all

@echo off
set "prompt=#" 
cls
echo on

echo >con ^<nul cmdToken and some params
echo >con  <nul cmdToken and some params

echo,>con,<!!!(((aaaZZZ)))---[[[]]]{{{}}}'''+++```~~~,^<nul,cmdToken,and,some,params
echo,>con,<!!!(((aaaZZZ)))---[[[]]]{{{}}}'''+++```~~~,<nul,cmdToken,and,some,params

@echo off
set "prompt="
goto :eof
The only (standard) way to remove some text from a statement using a redirection (without using any bugs) is to just overwrite it with another redirection:

Code: Select all

echo abc <this <that <nul
I suspect that the "^<nul cmdToken" part somehow recognizes the redirection "<nul" removing that part from the input buffer(/string) and then the '^'-character doubles the '<'-character, so that "< cmdToken" overwrites the previous result.

But actually i have no good idea why that should happen only after another redirection, because it doesn't work if first or in between other text:

Code: Select all

echo ^<This doesn't produce the issue.

penpen