This was much more complicated than I thought it would be. Results are NOT as expected - in fact, I find some of them shocking!
1) This does NOT work Code: Select all
<%file% (
for /f "skip=2" %%i in ('find /n /v "" %file%') do (
set "ln="
set /p "ln="
echo(!ln!
)
)>%out%
It reads and writes the correct number of lines - but the lines are empty
I don't understand the mechanism of failure.
Yet this works just fine:
Code: Select all
<%file% (
for /f "skip=2" %%i in ('type %file%|find /n /v ""') do (
set "ln="
set /p "ln="
echo(!ln!
)
)>%out%
And so does aGerman's original suggestion with FINDSTR.
2) As often happens with batch programming - the methods that seems like they should be faster are actually slower as the size of the file increases.Once I proved that each method was able to copy properly, I stripped out the copy portion and preserved only the read portion of the code. In this way I was able to time just the code that is necessary to do the read.
I tested 6 different methods for reading a text file: 4 using the new SET /P syntax, and two using the "traditional" FOR /F approach. I did not further test the SET /P approach using a GOTO loop because A) I've already shown it is slower, and B) it requires altering the text file with an appended STOP flag. The read should be non-destructive.
For test files to read, I started with one batch file that was approximately 1 kbyte in size and progressively doubled the size until I reached 32k. I did the same with a file that was approximately 50k and doubled it until I reached 1600k.
I tested each of the methods 10 times against the 1k derived test files, and 3 times against the 50k derived files, and averaged the results.
I also ran the tests on two different machines.
The test code takes 2 arguments:
%1 = the test file to read
%2 = the number of times to test each method
Here is the test code:
Code: Select all
@echo off
setlocal enableDelayedExpansion
if not defined macro\load.macrolib_time call macrolib_time
set cnt=%2
set file="%~1"
for %%a in (%file%) do set size=%%~za
for /f %%a in ('type %file%^|find /c /v ""') do set lines=%%a
echo file=%file%, lines=%lines%, size=%size%
::read1 FOR /F ('FIND /C FILE') FOR /L () SET /P
for /l %%A in (1 1 %cnt%) do (
%macro_Call% ("t1") %macro.getTime%
<%file% (
for /f "delims=" %%n in ('find /c /v "" %file%') do set "len=%%n"&for /l %%l in (1 1 !len:*: ^=!) do (
set "ln="
set /p "ln="
)
)
%macro_Call% ("t2") %macro.getTime%
%macro_Call% ("t1 t2 read1") %macro.diffTime%
set read1
)
echo(
::read2 FOR /F ('TYPE FILE|FIND /C') FOR /L () SET /P
for /l %%A in (1 1 %cnt%) do (
%macro_Call% ("t1") %macro.getTime%
<%file% (
for /f %%n in ('type %file%^|find /c /v ""') do for /l %%l in (1 1 %%n) do (
set "ln="
set /p "ln="
)
)
%macro_Call% ("t2") %macro.getTime%
%macro_Call% ("t1 t2 read2") %macro.diffTime%
set read2
)
echo(
::read3 FOR /F ('FINDSTR FILE') SET /P
for /l %%A in (1 1 %cnt%) do (
%macro_Call% ("t1") %macro.getTime%
<%file% (
for /f %%a in ('findstr /n "^" %file%') do (
set "ln="
set /p "ln="
)
)
%macro_Call% ("t2") %macro.getTime%
%macro_Call% ("t1 t2 read3") %macro.diffTime%
set read3
)
echo(
::read4 FOR /F ('TYPE FILE|FIND') SET /P
for /l %%A in (1 1 %cnt%) do (
%macro_Call% ("t1") %macro.getTime%
<%file% (
for /f %%a in ('type %file%^|find /n /v ""') do (
set "ln="
set /p "ln="
)
)
%macro_Call% ("t2") %macro.getTime%
%macro_Call% ("t1 t2 read4") %macro.diffTime%
set read4
)
echo(
setlocal DisableDelayedExpansion
::read5 "Traditional" FOR /F ('FINDSTR')
for /l %%A in (1 1 %cnt%) do (
%macro_Call% ("t1") %macro.getTime%
(
for /f "tokens=*" %%a in ('findstr /n "^" %file%') do (
set "ln=%%a"
setlocal enableDelayedExpansion
set "ln=!ln:*:=!"
endlocal
)
)
%macro_Call% ("t2") %macro.getTime%
%macro_Call% ("t1 t2 read5") %macro.diffTime%
set read5
)
echo(
::read6 "Traditional" FOR /F ('FIND')
for /l %%A in (1 1 %cnt%) do (
%macro_Call% ("t1") %macro.getTime%
(
for /f "skip=2 tokens=*" %%a in ('find /n /v "" %file%') do (
set "ln=%%a"
setlocal enableDelayedExpansion
set "ln=!ln:*]=!"
endlocal
)
)
%macro_Call% ("t2") %macro.getTime%
%macro_Call% ("t1 t2 read6") %macro.diffTime%
set read6
)
I've summarized the methods used above using an abbreviated syntax:
Read1 = FOR /F ('FIND /C FILE') FOR /L () SET /PThis is a variation of my original Copy1 method where I eliminate the pipe while determining the file size.
Read2 = FOR /F ('TYPE FILE|FIND /C') FOR /L () SET /PThis is my original Copy1 method
Read3 = FOR /F ('FINDSTR /N FILE') SET /PThis is aGerman's suggestion
Read4 = FOR /F ('TYPE FILE|FIND /N') SET /PThis is my variation of aGerman's suggestion, with the added pipe to get around the unexplained failure.
Read5 = FOR /F ('FINDSTR /N')This is the "traditional" method using FINDSTR /N to preserve the empty lines
Read6 = "Traditional" FOR /F ('FIND /N')This is the "traditional" method using FIND /N to preserve the empty lines
Results on a Vista64 Gateway Intel Quad Core2 Code: Select all
A V E R A G E T I M E ( s e c o n d s )
Size Lines runs Read1 Read2 Read3 Read4 Read5 Read6
~1k 24 10 0.49 0.15 0.53 0.15 0.60 0.15
~2k 48 10 0.39 0.20 0.59 0.16 0.47 0.12
~4k 96 10 0.50 0.12 0.60 0.22 0.66 0.16
~8k 192 10 0.42 0.19 0.53 0.24 0.58 0.24
~16k 384 10 0.45 0.18 0.17 0.19 0.39 0.39
~32k 768 10 0.47 0.32 0.27 0.29 0.71 0.71
~50k 1685 3 0.56 0.65 0.54 0.56 1.47 1.48
~100k 3370 3 0.63 0.81 1.16 1.19 3.04 3.06
~200k 6740 3 1.32 1.66 2.92 2.98 6.66 6.71
~400k 13480 3 2.55 3.22 8.42 8.61 15.90 16.07
~800k 26960 3 5.00 6.37 27.54 28.17 42.44 43.16
~1600k 53920 3 9.90 12.62 98.73 101.32 129.13 131.55
Results on an Old Dell XP machineCode: Select all
A V E R A G E T I M E ( s e c o n d s )
Size Lines runs Read1 Read2 Read3 Read4 Read5 Read6
~1k 24 10 0.34 0.49 0.34 0.34 0.37 0.36
~2k 48 10 0.35 0.49 0.34 0.51 0.42 0.39
~4k 96 10 0.35 0.51 0.37 0.52 0.47 0.45
~8k 192 10 0.38 0.55 0.39 0.56 0.60 0.62
~16k 384 10 0.45 0.64 0.45 0.68 0.90 0.94
~32k 768 10 0.56 0.77 0.62 1.12 1.48 1.58
~50k 1685 3 0.82 1.14 1.00 1.41 2.87 3.32
~100k 3370 3 1.51 1.96 1.96 2.77 5.63 6.31
~200k 6740 3 2.51 2.96 4.05 6.37 11.67 13.76
~400k 13480 3 4.31 5.49 10.93 17.58 28.25 32.44
~800k 26960 3 8.36 11.35 34.72 53.78 69.24 83.73
~1600k 53920 3 15.88 20.77 119.54 180.32 180.76 240.50
My Vista machine is faster, but it has a quirk in that sometimes when the machine needs to invoke CMD.EXE there is a consistant .5 second delay that randomly creeps in. If you look at my times in my first post in this thread you can see what I am talking about. In these tests, Read1 for a 1k file was either ~.10 or ~.60 seconds. The variations are significant for small files, but virtually dissapear for large files.
The timings for the XP machine are slower, but much better clustered.
All methods are virtually equivalent for small files, but as the files grow, the methods really begin to differentiate.
The results surprised me initially, but I think I understand somewhat why aGerman's suggestion is slower.
Windows command shell does not have true pipes between processes or threads like Unix.
EDIT - Now I'm not so sure this is true. I'm pretty sure it was true with original DOS and COMMAND.COM. But some recent reading indicates CMD.EXE has true pipes after all. But there definitely seems to be some kind of buffering issue when piping large amounts of data Instead there is only one active process within a given session. So whenever Windows needs to pipe explicitly, or implicitly like when FOR /F executes a command, The spawned command shell must complete its job entirely and cache the results before it is sent to the next "process" in line. I think for small files everything is cached in memory and we don't see much degradation. But as the output grows, it has to cache to disk and this is what slows it down. Although this "disk caching" performance hit seems to be much worse for the 'command' within a FOR /F than it does for an explicit pipe. Perhaps the mechanisms are different.
I'm sure I have some inaccuracies in my explanation, but I think there is at least a grain of truth to the above.
The Read1 method never caches more than one line of data while determining the file size, so it is by far the fastest.
The Read2 method is nearly identical except it "caches" the entire file once for the left half of the pipe operation.
The Read3 method caches the entire file with line number prefixes, but it is now for the FOR /F command and not an explicit pipe. For some reason this becomes increasingly slow for large outputs.
The Read4 method does the same as Read3, plus it must cache the entire file an additional time for an explicit pipe.
Read5 and Read6 must cache the entire file plus the line number prefixes, plus they must invoke SETLOCAL/ENDLOCAL for each line. However I did some timings without SETLOCAL/ENDLOCAL (not shown) and there is some additional mechanism that makes this slower than Read3/Read4. I'm guessing it has something to do with the fact that Read3/Read4 only preserve the 1st token of each line within the FOR /F, whereas Read5/Read6 must preserve the entire line. Does this imply that the tokenised results are also "cached"?
The results seem to show that FIND is inherently slower than FINDSTR. It's too bad FINDSTR does not have the /C option that FIND has.
The methods that don't require caching of the FOR /F command are fairly linear. Each time the file size is doubled the timing is also doubled.
But methods that do require caching of a FOR /F command are worse than linear. Doubling the size of a large file increases the time by a factor of three or more.
So in the future I will be using either Read1 or Read2 method. Read1 is faster, but a bit more complex to write.
Dave Benham