How Set/p works

Message

OJBakker · #1 Post by **OJBakker** » 26 Aug 2011 07:17

How Set/p works

I have done many test to better understand the way set/p operates especially when used with input-redirection.

set/p can add a variable and give it a value.
set/p can NOT remove a variable.
Trying to use set/p to set a variable to an empty string will not give a clear message but set errorlevel to 1.
This can be interpreted as 'Error: no data/input'

When reading from stream the same thing happens when encountering empty lines.
This is also the reason you have to reset the variable yourself inside the read-loop.
Set /p will not reset the variable if current input is empty as with an empty line.

special note: set/p can set errorlevel but never resets the errorlevel to zero !!!
So if your batchfile checks errorlevel somewhere be sure to reset errorlevel in the read-loop.
The fastest way to reset errorlevel seems to be with 'verify>nul'

Code: Select all

   set /p line=
   if errorlevel 1 set "line=" & verify>nul

I have been experimenting a lot with the set /p command
apparently this is how 'set/p VAR=' works

Reading characters:
Characters are read from the inputstream and put in a characterbuffer until one of three conditions is true:
Condition 1: The last 2 read char are CRLF. (a complete line)
Condition 2: There are 1024 characters in the charbuffer. (buffer full)
Condition 3: A timeout, usually caused by end-of-stream. (time-out error-condition)

Processing the charbuffer:
All control-characters from the end of the char-buffer are discarded (possible dataloss)
If there is a NUL character in the charbuffer all characters following the first NUL in the charbuffer
will be discarded (dataloss)

Moving from charbuffer to Var.
If charbuffer is empty report errorcondition:
...set errorlevel to 1 (meaning No value entered)
If charbuffer is not empty:
... Move the string from charbuffer to the variable named in the set/p command.

Set/p is done and returns control to the batchfile.

Some remarks regarding control-characters (ASCII 0-31)
All control char except NUL can be read and put in a variable by use of the set/p read from stream.
The end-of-line combination CRLF will be discarded. However single CR and LF can be put in a variable as long
as they are not combined to a CRLF pair and there is at least one non-control character following the control character.
Remark: Because TAB is also a control-char TAB can not be used to 'protect' control characters from being skipped.
All normal chars like letters numbers, even a space can 'protect' control characters.

In pseudo-code

Code: Select all

initially charbuffer is empty
:innerloop
  REM reading characters
  repeat 
     get char from input (stream or manual input and store in charbuffer)
  until - char buffer full (number of chars in charbuffer is 1024)
     OR - last two char added to charbuffer are CRLF
    OR - a timeout occurred when requesting next char (usually meaning end-of-input-stream)

  REM Processing the charbuffer:
  loop charbuffer from first received to last received char:
     if char == NUL delete this char and all following chars in the charbuffer.
  loop charbuffer from last received to (first or to a non-controlchar character)
     if char is control-char remove this char from the charbuffer
     (REM this loop removes all trailing control chars including the CRLF)

  REM Moving from charbuffer to Var.
  If charbuffer is empty (REM report errorcondition: Note: Var value remains unchanged! 
     set errorlevel to 1 (meaning No value entered)
  else
     Move the string from charbuffer to the variable named in the set/p command.

  REM Set/p ends it's inner loop.
  Control is returned to the batchfile or command prompt.

of course for manual input the entered values can be changed by using backspace etc.

  On return in the batchfile there are 2 states possible.
  1: the error-state, errorlevel is set, no input found, don't use Var-value because its
     value is from a previous loop
    Error-state can be caused by an empty line but also by timeout (end-of-stream)
  2: no error-state, we got some value in our Var-variable.
     This value can be one of the following
    - a complete non-empty line
    - an incomplete non-empty line caused bij time-out (end-of-input-stream)
    - a data-chunk of max 1024 characters, not line-oriented

TODO's
[1]
What is the length of the timeout used internally by set/p.
[2]
I'm not sure if a timeout can occur before the inputstream is empty.
It might be that the inner loop from the set/p command can be stopped by for example the pause-key.
If this can be done it will result in set/p returning 2 half lines instead of 1 line.
So far I have tested the time-out behavior only in the end-of-stream siituation.
[3]
Test if time-polling can be used for better handling of the error-state.
That is to determine if the error was caused by an empty line/lines in the inputstream
or was caused by timeout on end-of-input-stream situation.
[4]
Test if special characters like ^ & ! % " ' = might cause problems when read using set/p
| have not yet tested these but i don't expect these to have special handling in set/p
[5]
Everything I might have missed or misinterpreted in my tests.

OJB

jeb · #2 Post by **jeb** » 26 Aug 2011 10:01

Respect

Perfect analysis and a good explanation.
The theory sounds very convincing.

Now I will do some testings with pipes and set /p.

jeb

jeb · #3 Post by **jeb** » 26 Aug 2011 13:53

My first testings with pipes ...

In all my previous test with set /p and pipes (some years/month ago),
I used simply to few data, so it failed at the 1024 byte buffer limit.

The main problem of pipes are that both parts runs (mostly) asynchronous in different processes.

And one interessting thing is that the PAUSE key only pauses the right process.

For my tests I create a "num.txt" file with 500 lines, each with 32 bytes of data (counting also CR and LF).

CreateNum.bat wrote:@echo off
setlocal EnableDelayedExpansion
(
for /L %%n in (1,1,500) DO (
set "num=1000%%n"
set "num=!num:~-4!"
echo a!num!,b!num!,c!num!,d!num!,e!num!#
)
) > num.txt

SlowType.bat wrote:@echo off
setlocal EnableDelayedExpansion
set lineNr=0
for /F "delims=" %%A in (num.txt) DO (
call call call set wait=4
set /a lineNr+=1
title "SlowType !lineNr!"
echo(%%A
)

ReadPipe.bat wrote:@echo off
cls
setlocal EnableDelayedExpansion
set empty=0
set /a loopCnt=0
for /L %%n in (1 0 1) do (
set /a loopCnt+=1
set line=
set /p line=
if defined line (
echo( !loopCnt!: !line!
) else (
set /a Empty+=1
if !empty! GTR 10 call :HALT
)
)
exit /b

:Halt
call :_halt 2> NUL
:_halt
()

And now tested with

Code: Select all

slowType.bat | ReadPipe.bat

On my system three call's are enough to slow down the system,
and the readPipe can read each line.

But if I remark the call call call set wait=2 line, I got only 16 lines.

Output wrote: 1: a0001,b0001,c0001,d0001,e0001#
2:
a0033,b0033,c0033,d0033,e0033#
4: #
5: 8#
6: 60#
7: 192#
8: 0224#
9: e0256#
10: ,e0288#
11: 0,e0320#
12: 52,e0352#
13: 384,e0384#
14: 0416,e0416#
15: d0448,e0448#
16: ,d0480,e0480#

If you press the pause-key, the slowType.bat doesn't stop, the title still count up,
but only till the internal "Pipe-Buffer" is full, then also the slowType.bat stops.
And if the pause-key is pressed again both parts starts again.

There seems to be a problem if the data creator is faster then the data consumer.
Then set /p can't read the correct data from the buffers (MORE of FINDSTR can read it).
Perhaps it's an effect of the internal timeout and/or buffer limit.
One of my theories is that in the buffer the CR/LF are reduced to single LF, and therefore the set/p can't access the data any more.

As one result I got ... we need more investigations

jeb

#4 Post by **dbenham** » 28 Dec 2011 12:12

OJBakker wrote:Reading characters:
Characters are read from the inputstream and put in a characterbuffer until one of three conditions is true:
Condition 1: The last 2 read char are CRLF. (a complete line)
Condition 2: There are 1024 characters in the charbuffer. (buffer full)
Condition 3: A timeout, usually caused by end-of-stream. (time-out error-condition)

I have slight corrections based on experiments on a Vista 64 machine.

Reading characters:
Characters are read from the inputstream and put in a characterbuffer until one of three conditions is true:
Condition 1: The last 2 read char are CRLF or LFCR. (a complete line)
Condition 2: There are 1023 characters in the charbuffer. (buffer full)
Condition 3: A timeout, usually caused by end-of-stream. (time-out error-condition)

The longest line that can be processed reliably is 1021 (not including the terminating CRLF or LFCR).

I suppose some moderately complex logic could be written to detect buffer full condition and concatenate lines appropriately. Special processing would be required to handle whenever CRLF (or LFCR) is split across the 1023 buffer length boundary. The process would have to assume that CR and LF are always paired in the source file.

Dave Benham

#5 Post by **aGerman** » 28 Dec 2011 12:44

dbenham wrote:Condition 2: There are 1023 characters in the charbuffer. (buffer full)

I assume the 1024th character is NUL (string terminator).

Regards
aGerman

Liviu · #6 Post by **Liviu** » 09 Feb 2012 00:46

OJBakker wrote:I have done many test to better understand the way set/p operates especially when used with input-redirection.

Neat! And confirmed under xp.sp3 (with dbenham's correction of 1,023 vs. 1,024).

dbenham wrote:The longest line that can be processed reliably is 1021 (not including the terminating CRLF or LFCR).

For a single line, 1023 looks OK (but if one continues reading off the same input stream then, yes, 1021 is the highest with "default" behavior).

From what I see here, a 1022 long line returns the expected 1022 string to 'set /p' but (discards the 1023'rd CR character and) leaves the following LF into the stream, which can be read by a subsequent 'set /p'. A 1023 long line also returns the full string to 'set /p' but leaves the CR/LF in the stream, which the next read will take as a blank line.

dbenham wrote:I suppose some moderately complex logic could be written to detect buffer full condition and concatenate lines appropriately. Special processing would be required to handle whenever CRLF (or LFCR) is split across the 1023 buffer length boundary. The process would have to assume that CR and LF are always paired in the source file.

Indeed. Below is a possible draft, assuming normal CRLF line endings. Save it as typefile.cmd and run it with a text file as first argument...

Code: Select all

:: 'for /f (.txt) do echo' mockup, but preserves empty lines
:: and is safe with un-escaped odd characters, except it
:: drops trailing control characters at the end of the line
:: due to 'set /p' quirk
:: and misses by one multiple empty lines at the end of file
:: due to 'find' quirk

@echo off
setlocal disableDelayedExpansion

@rem original call must reference existing file
if exist "%~1" (
  cmd /s /c ""%~f0" :loop "%~1""
  @rem --- reached after nested call ---
  endlocal
  exit /b %errorlevel%
)

@rem nested call expected to reference ':loop'
@rem can't match existing file because of leading colon
set "arg1=%~1" || set "arg1="
if not "%arg1%"==":loop" (
  echo.
  echo *** unrecognized target "%~1" 2>&1
  exit /b -1
)
shift /1

@rem label itself not used, but 'cmd /c' nesting needed
@rem in order to break out of infinite 'for' loop cleanly
:loop
set lf=^


@rem above 2 blank lines are required - do not remove
set "file=%~1"

@rem 'find' counts 2 empty lines at the very end of file as 1
for /f %%a in ('find /c /v "" ^<"%file%"') do (
  set /a lines = %%a
)

@rem loop 'set /p' until line count matches
<"%file%" (
  setlocal enableDelayedExpansion
  set "line="
  for /l %%a in () do (
    @rem loop break condition which requires 'cmd /c' nesting
    if "!lines!" leq "0" exit
    @rem read next chunk
    set "chunk=" & set /p "chunk="
    @rem process current chunk
    if not defined chunk (
      @rem either empty line, or leftover '\r\n' from previous 1,023 one
      call :line
      set "line="
    ) else (
      if "!chunk:~0,1!"=="!lf!" (
        @rem leftover '\n' from previous 1,022+'\r', flush preceding line
        call :line
        set "line=!chunk:~1!"
      ) else (
        @rem regular chunk, append to current line
        set "line=!line!!chunk!"
      )
      if "!chunk:~1021,1!"=="" (
        @rem proper ending chunk, flush line
        call :line
        set "line="
      )
    )
  )
  endlocal
)
echo --- never reached ---
endlocal
goto :eof

@rem process current line
:line
@rem '(' paranthesis guards against '', '/?', echo* matches
echo(!line!
set /a lines -= 1
goto :eof

Critique most welcome, of course... For one thing, it won't handle lines ending in control characters, nor lines wider than around 8K, nor multiple empty lines at the end of the file, but other than that it seems to generate identical copies.

Liviu

#7 Post by **penpen** » 01 Oct 2013 17:46

Good analysis!
But i assume, that one thing is not correct (i cannot prove this, but i think it is improbable enough):

OJBakker wrote:Condition 3: A timeout, usually caused by end-of-stream. (time-out error-condition)

I think there is no timeout, just because this is not neccessary (using MS C or C++, i assume MS had used MS C/C++ for programming Dos/Win any version).

Code: Select all

// timeout version
void setSlashPImplTo (char* variable) {
   std::ifstream is (STDIN);
   is.read ((byte*) variable, 1023, timeout);   // last byte in buffer is always 0
}

// std version
void setSlashPImplStdUse (char* variable) {
   std::ifstream is (STDIN);
   if (is) {
      is.read ((byte*) variable, 1023);   // last byte in buffer is always 0
   }
}

This is C++ style pseudo-code.
The second code faster and more secure, as this has no side effects, and i assume this is the cause why MS recommends it doing it this way.
Additionally they had to program the timeout functionality, but such a functionality is not found using MSVS (MS Visual Studio) and i doubt they just have forgotten to add it in all versions since 1.0.
(Similar if MS has used their C language to program this.)

penpen

#8 Post by **dbenham** » 03 Oct 2013 13:50

It really is true - SET /P can read from the file while another process is writing to it. SET /P does not wait around for a newline or for the 1023 char buffer to get full. If it reaches the end of the available input stream, then it will return the partial line. I don't know how to test to see if there is a timeout period that must expire before continuing, or it if returns immediately after detecting the end of the data. After reading the partial line, SET /P can then try again and read subsequent data.

Testing the behavior can be tricky, because the writing process may be buffered, and it may wait until it writes a newline before it flushes the content to disk.

Dave Benham

#9 Post by **penpen** » 03 Oct 2013 15:05

There is a way to test this (and to test all other unclear behavior):
You may write a C++ programm, using Microsoft Visual Studio that performs:

Code: Select all

#include <process.h>
#include <stdlib.h>

int main (int argc, char** argv) {
   system("Z:\test.bat");
}

And do what you like within the batch file.
You can debug using "trace into" and watch the source code, after you have downloaded all (needed) debug symbols (if you don't download it you have fun with assembler):
http://msdn.microsoft.com/en-us/windows/hardware/gg463028.aspx

But the problem, why i actually haven't done this is: At work i don't have the rights to install the debug symbols, i don't want to watch assembler, and i think they wouldn't allow me to investigate this just for my curiosity.
And at home i have not the full version (only the express version) where this seems to be not possible... . In addition the needed packages seem to be really BIG, and i have only a 20GB hdd.

If anyone has the access to these things and want to find this out, AND this is allowed in your country, AND it isn't disallowed by microsofts EULA, and whatever they additionally have: Feel free to find it out.

penpen

#10 Post by **Squashman** » 20 Nov 2014 14:38

Well since this thread seems to get dug up every year I might as well unearth it before the year is up.

A long time ago in a Galaxy far far away, a newbie batch programmer (me) wrote the following batch file to read in the first line of all the CSV files in a directory and then write out the file name and first line to an output file. My old department that I worked for would use this to visually make sure all the header records of all the CSV files they were going to process were the same. Just kind of validating that the data layouts were the same.

Code: Select all

@echo off & setlocal enableextensions enabledelayedexpansion

FOR /F "tokens=*" %%A IN ('dir /b /a-d *.csv') DO (
   set /P _Header=<%%A
   set _Tfile=%%~nA                                                  .
   set _Filename=!_Tfile:~0,50!
   echo.!_Filename!!_Header!>>header.log
   )

I have learned a thing or maybe even two since then but never knew that SET /P was limited to 1023 characters.

Now I could change this to CALL to a subroutine that used the FOR command to read the files and then use a goto :EOF to exit the function but the FOR command will read in the entire file before it starts processing each line and breaking it up into tokens. I deal with very very very large files. On some of them we would just run out of memory. But this would also then run into the normal 8191 character limit for the SET command. God Forbid I would have clients sending in data files with the header record that long!

What are my options for replicating this logic and overcome the 1023 Character limit? I suppose I could just bite the bullet and use a third party HEAD command. Wish Microsoft would have baked that into the OS years ago. You find it on every BSD, Unix and Linux distribution.

#11 Post by **foxidrive** » 20 Nov 2014 17:38

This should handle line lines - not sure of how fast but findrepl is good for large files.

Use the version of findrepl before the current V2.0 release

Code: Select all

@echo off
(
 FOR /F "delims=" %%A IN ('dir /b /a-d *.csv') DO (
    set /P "=%%~nxA - " <nul
    call findrepl /o:1:1 <"%%A"
 )
)>header.log

#12 Post by **dbenham** » 20 Nov 2014 19:42

I believe findrepl.bat always reads the entire file content into a single string variable, which is limited to 2 gigabytes. If I am right, this would not work with many of the large files that Squashman deals with.

The JREPL.BAT utility reads line by line (unless /M is used), so it can work with mammoth files. It is inefficient in that it still reads the entire file (one line at a time).

head.bat

Code: Select all

::head.bat  count  [/F inFile] [/O outFile|-] [/N minWidth]
@echo off
jrepl "^.*" "ln<=%1?$0:false" /jmatch %2 %3 %4 %5 %6 %7

The head.bat script accepts the same options as JREPL.BAT (once you get past the line count)

Print the first 5 lines of test.txt:

Code: Select all

head 5 /f test.txt

List the 3 most recently updated files in the current directory:

Code: Select all

dir /b /a-d /o-d *|head 3

Dave Benham

DosTips.com

How Set/p works

How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works

Re: How Set/p works