Determining the number of lines in a file.

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Determining the number of lines in a file.

#16 Post by dbenham » 04 Jan 2012 09:42

Sorry - I had a stupid bug where I was interpretting the line offset as hexadecimal when in reality it is decimal. I was confused with another experiment where I was determining the actual EOL char(s) using FC /B. The code has been fixed and the post edited.

The code to determine EOL works by the way. I'm working on another version that uses this.

Dave Benham

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Determining the number of lines in a file.

#17 Post by Squashman » 04 Jan 2012 11:57

Here is some output. Using the Same input file. One with CRLF and one with just an LF. I am getting the output that I really need, but not sure why the last two variables come back as undefined when the file has a LF for the EOL. Getting back the correct Line Count, Line Length and EOL Size is really all I need.

Code: Select all

E:\batch files\HEAD>DaveLength.bat EST3_CRLF.txt
fSize=25712336
lnCnt=25012
lnLen=1026
EOLSize=2
finalEOL=1
EOFSize=0

E:\batch files\HEAD>DaveLength.bat EST3_LF.txt
fSize=25687324
lnCnt=25012
lnLen=1026
EOLSize=1
Environment variable finalEOL not defined
Environment variable EOFSize not defined

E:\batch files\HEAD>


Will start testing with a file with a EOF.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Determining the number of lines in a file.

#18 Post by dbenham » 04 Jan 2012 12:19

IF the EOL=LF then the algorithm can't differentiate between last line missing EOL but having EOF vs last line ending with EOL but no EOF.

Dave Benham

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Determining the number of lines in a file.

#19 Post by Squashman » 04 Jan 2012 12:30

Some more output using EOF this time. This one seems to be OK with the LF and an EOF. My biggest concern was that it took close to 20 seconds for each of these two tests. But overall this should still be quicker than using FIND.

Code: Select all

E:\batch files\HEAD>DaveLength.bat EOF_CRLF.txt
fSize=63016885
lnCnt=250067
lnLen=250
EOLSize=2
finalEOL=1
EOFSize=1

E:\batch files\HEAD>DaveLength.bat EOF_LF.txt
fSize=62766818
lnCnt=250067
lnLen=250
EOLSize=1
finalEOL=1
EOFSize=1

E:\batch files\HEAD>

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Determining the number of lines in a file.

#20 Post by dbenham » 04 Jan 2012 12:36

OK - I've got the ultimate text file attribute probe :D

It measures everything directly - it does not do any math based on assumptions. It is able to process a 250MB file in 16 seconds.

Theoretically it should work on a file up to 2GB.

The line length limit is dependent on the capabilities of FIND and FINDSTR. I'm not sure what the line length limit is for those utilities. I'm pretty sure it is at least 8k. I'm hoping it is greater than that.
EDIT - added check for possible fixed line length

Code: Select all

@echo off
setlocal
call :FileAttributes %1 || exit /b
set /a "fSize2=(lnCnt*(lnLen+eolSize))-((1-eolFinal)*eolSize)+eof"
if %fSize%==%fSize2% (set "fixedLen=Probably") else set "fixedLen=No"
echo File size     = %fSize%
echo Line count    = %lnCnt%
echo Line 1 length = %lnLen%
echo Line 1 EOL    = %eol%
echo EOL size      = %eolSize%
echo Final EOL     = %eolFinal%
echo Ending EOF    = %eof%
echo(
echo Fixed line length = %fixedLen%
exit /b


:FileAttributes File
::
:: Determines attributes of a text file
::
:: Lines can be terminated with <LF> or <CR><LF>
::
:: returns
::   fSize = file size in bytes
::   lnCnt = line count (disregarding any extra EOF line)
::   lnLen = length of first line in bytes (disregarding EOL)
::   eol = character(s) used to signify End Of Line for 1st line (in hex)
::   eolSize = number of characters used to signify EOL of 1st line
::   finalEOL = 0 if last line is missing EOL
::              1 if last line has EOL
::   eof = 0 if file is NOT terminated by EOF (0x1A)
::         1 if file IS terminated by EOF
::
:: Sets ERRORLEVEL > 0 if error
::   1 File not found
::   2 Too few lines (need at least 2)
::
  setlocal enableDelayedExpansion
  set tmpFile="%temp%\FileAttributes%random%.tmp"
  set tmpFile2="%temp%\FileAttributes2%random%.tmp"
 
  ::Determine file size
  set "fSize=%~z1"
  if not defined fSize >&2 echo ERROR: File "%~1" not found&exit /b 1
 
  ::Determine length of first line, including EOL char(s)
  ::Use a temp file instead of pipe because pipes are slow with large files
  >%tmpFile% findstr /o "^" "%~1"
  set "lnLen="
  for /f "usebackq skip=1 delims=:" %%L in (%tmpFile%) do (
    set "lnLen=%%L"&goto :break
  )
  :break
  if not defined lnLen >&2 echo ERROR: Too few lines (need at least 2)&del %tmpFile%&exit /b 2
 
  ::Determine hex offset of potential EOL char(s) of first line
  set /a "eol1=lnLen-2, eol2=lnLen-1"
  cmd /c exit %eol1%
  set "eol1=%=exitcode%:"
  cmd /c exit %eol2%
  set "eol2=%=exitcode%:"
 
  ::Build a dummy file that has size >= length of line 1
  <nul >%tmpFile% set /p ".=A"
  set dummySize=1
  for /l %%n in (1,1,32) do if !dummySize! lss !lnLen! set /a "dummySize*=2" & type %tmpFile% >>%tmpFile%
 
  ::Use FC /B to determine the EOL char(s), determine EOL size, and adjust lnLen
  set "eol="
  set EOLSize=0
  for /f "tokens=2 delims=: " %%A in ('fc /b "%~1" %tmpFile% ^| findstr /b /l "%eol1% %eol2%"') do (
    if "%%A"=="0A" set "eol=!eol!%%A"&set /a "EOLSize+=1, lnLen-=1"
    if "%%A"=="0D" set "eol=!eol!%%A"&set /a "EOLSize+=1, lnLen-=1"
  )
 
  ::use FIND to get the line count
  for /f %%A in ('find /c /v "" ^<"%~1"') do set lnCnt=%%A
 
  ::use FINDSTR to get the last line in a temp file
  ::don't use a pipe because it is slow with large files
  >%tmpFile% findstr /n "^" "%~1"
  >%tmpFile2% findstr /b "%lnCnt%:" %tmpFile%
 
  ::determine length of last line file
  for %%a in (%tmpFile2%) do set len=%%~za
 
  ::determine hex offset of last character
  set /a "end=len-1"
  cmd /c exit %end%
  set "end=%=exitcode%:"

  ::Build a dummy file that has size >= length of last line file
  <nul >%tmpFile% set /p ".=A"
  set dummySize=1
  for /l %%n in (1,1,32) do if !dummySize! lss !len! set /a "dummySize*=2" & type %tmpFile% >>%tmpFile%

  ::Use FC /B to determine if file ends with EOL or EOF
  set "eolFinal=0"
  set "eof=0"
  for /f "tokens=2 delims=: " %%A in ('fc /b %tmpFile2% %tmpFile% ^| findstr /b /l "%end%"') do (
    if "%%A"=="0A" set "eolFinal=1"
    if "%%A"=="0D" set "eolFinal=1"
    if "%%A"=="1A" set "eof=1"
  )
 
  ::If file ends with EOF then last real line must end with EOL if final line only contains EOF
  if %eof%==1 (
    setlocal disableDelayedExpansion
    for /f "usebackq delims=" %%L in (%tmpFile2%) do set "ln=%%L"
    setlocal enableDelayedExpansion
    if "!ln:~0,-1!"=="%lnCnt%:" (
      endlocal&endlocal
      set eolFinal=1
      set /a lnCnt=%lnCnt%-1
    ) else endlocal&endlocal
  )

  del %tmpFile%
  del %tmpFile2%

  endlocal&(
    set fSize=%fSize%
    set lnCnt=%lnCnt%
    set lnLen=%lnLen%
    set eol=%eol%
    set eolSize=%eolSize%
    set eolFinal=%eolFinal%
    set eof=%eof%
  )
exit /b 0


Dave Benham
Last edited by dbenham on 04 Jan 2012 13:32, edited 1 time in total.

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Determining the number of lines in a file.

#21 Post by Squashman » 04 Jan 2012 12:51

dbenham wrote:OK - I've got the ultimate text file attribute probe :D

It measures everything directly - it does not do any math based on assumptions. It is able to process a 250MB file in 16 seconds.


Ok your computer must be exponentially faster than mine.

30 seconds each.

Code: Select all

E:\batch files\HEAD>FileAttributes.bat EOF_CRLF.txt
File size     = 63016885
Line count    = 250067
Line 1 length = 250
Line 1 EOL    = 0D0A
EOL size      = 2
Final EOL     = 1
Ending EOF    = 1

E:\batch files\HEAD>FileAttributes.bat EOF_LF.txt
File size     = 62766818
Line count    = 250067
Line 1 length = 250
Line 1 EOL    = 0A
EOL size      = 1
Final EOL     = 1
Ending EOF    = 1

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Determining the number of lines in a file.

#22 Post by dbenham » 04 Jan 2012 13:37

Assuming the performance is linear with file size (big assumption), a factor of ~8 difference between our machines.

I suspect disk access is the limiting factor.

All my processing is on a local hard drive. (target file and temp location). Performance absolutely dies if I try to access a network drive at work (I think we are talking at least a factor of 100 slower :!: ).

Note - I edited my previous code to include a check for fixed line length.

Dave Benham

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Determining the number of lines in a file.

#23 Post by Squashman » 04 Jan 2012 15:03

Something different.

Code: Select all

@echo off

::Determine file size
set "fSize=%~z1"

for /f "delims=" %%a in ('findstr /n . "%~1" ^|findstr "^1:"') do call :strlen "%%~a" len & goto :break

:break
for /f "tokens=2 delims=:" %%L in ('findstr /n /o "^" "%~1" ^|findstr "^2:"') do set "lnLen=%%L"&goto :break2

:break2
SET /A EOLsize=%lnLen%-%len%
set /A lines=%fSize%/%lnLen%
set /A EOF=%fSize% %%% %lnLen%
echo %len%
echo %lnLen%
echo %EOLsize%
echo %lines%
echo %EOF%

exit /b

:strLen string len -- returns the length of a string
::                 -- string [in]  - variable name containing the string being measured for length
::                 -- len    [out] - variable to be used to return the string length
:: Many thanks to 'sowgtsoi', but also 'jeb' and 'amel27' dostips forum users helped making this short and efficient
:$created 20081122 :$changed 20101116 :$categories StringOperation
:$source http://www.dostips.com
(   SETLOCAL ENABLEDELAYEDEXPANSION
    set "str=A!%~1!"&rem keep the A up front to ensure we get the length and not the upper bound
                     rem it also avoids trouble in case of empty string
    set "len=0"
    for /L %%A in (12,-1,0) do (
        set /a "len|=1<<%%A"
        for %%B in (!len!) do if "!str:~%%B,1!"=="" set /a "len&=~1<<%%A"
    )
)
( ENDLOCAL & REM RETURN VALUES
    IF "%~2" NEQ "" SET /a %~2=%len%
)
EXIT /b


Output

Code: Select all

E:\batch files\HEAD>strlength.bat EST3_LF.txt
1026
1027
1
25012
0

E:\batch files\HEAD>strlength.bat EST3_CRLF.txt
1026
1028
2
25012
0

E:\batch files\HEAD>strlength.bat EOF_CRLF.txt
250
252
2
250067
1

E:\batch files\HEAD>strlength.bat EOF_LF.txt
250
251
1
250067
1

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Determining the number of lines in a file.

#24 Post by Squashman » 05 Jan 2012 15:04

I haven't ran into a file yet that this hasn't worked on. The Majority of time our clients send in clean fixed length data so it should work.
I should put in a check when doing the modulus for the EOF check. If it is Greater than 1 then we know there is something wrong with the file.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Determining the number of lines in a file.

#25 Post by dbenham » 05 Jan 2012 22:56

I developed a very fast way to determine the record lengh, and eol characters for the 1st line. As written it supports a line length up to 32kbytes. This limit can easily be made larger or smaller. The performance is largely independent of the file size :!: It works on a 1GB file in less than 1 second.

You should be able to incorporate this method into whatever math computations you want.

I still think you should allow for the last line to be missing EOL. Just because you haven't run into it yet, doesn't mean you won't see it in the future. And it doesn't take any longer to compute.

Code: Select all

@echo off
setlocal enableDelayedExpansion

::Build a dummy file with length 32kbytes to do a binary compare with
::This file could be made larger or smaller, depending on requirements
<nul set /p ".=A" >dummy.txt
for /l %%n in (1 1 15) do type dummy.txt >>dummy.txt

::Use FC /B to compare with dummy. Use FINDSTR to locate offset and hex representation of each CR or LF
::Use FOR /F to only look at the 1st two instances.
for /f "tokens=1,2 delims=: " %%A in ('fc /b "%~1" dummy.txt ^| findstr /r /c:": 0D 41$" /c:": 0A 41$"') do (
  if not defined eolOffset (
      set /a "eolOffset=0x%%A, next=eolOffset+1, eolSize=1, next=eolOffset+1"
      set "eol=%%B"
  ) else (
      set /a "eolOffset2=0x%%A
      if "!eolOffset2!"=="!next!" if "!eol!" neq "%%B" (
        set "eol=!eol!%%B"
        set /a "eolSize+=1"
      )
      goto :break
  )
)
:break
set /a "recordLen=eolOffset, lnLen=recordLen+eolSize"
echo Record 1 length = %recordLen%
echo             eol = %eol%
echo         eolSize = %eolSize%
echo   Line 1 length = %lnLen%


Dave Benham

alan_b
Expert
Posts: 357
Joined: 04 Oct 2008 09:49

Re: Determining the number of lines in a file.

#26 Post by alan_b » 06 Jan 2012 01:58

A missing EOL is of no consequence - unless it is devastating :evil:

I inherited a project that used 'C' code and was being ported / adapted to various hardware platforms.

Compilation failed for one specific target and it took a lot of effort to track down.
The problem was that one particular 'C' code module needed various common Header declaration files,
and unfortunately the last header that was stipulated had no E.O.L on the last line.

That Header file never gave problems before because other header files were always included AFTER this one,
and the system had no trouble dealing with the last line of this header file being concatenated with the first line of the next header;
but the world came to an end when this was the last header and its final line was, in effect, concatenated with the first line of the 'C' file.

@Squashman
At the beginning you said :-
We have to send a file upload request to our Media Librarian to get a file onto the mainframe. When I send this request I need to tell them what the job name is, the job number, what mail copy the file will be going to, what the file name is, what the quantity is and what the record length is.


Apparently you may disregard the problem of a missing EOL because you never ran into it.
Perhaps you Media Librarian has also not YET run into it and that is why it is missing from his detailed specification requirements.

If the Media Librarian loses access to his library because your client omitted an E.O.L.,
How far are wide will the excrement fly, and will you be one of the recipients ?

It may be worth getting written clarification of whether an E.O.L. may be omitted,
or a spare one appended,
or if there must be exactly one E.O.L. per record,
and also whether Clients may throw in any mixture of UNIX and or DOS style E.O.L.'s.

A few simple precautions can avert some awful consequences :D

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Determining the number of lines in a file.

#27 Post by Squashman » 06 Jan 2012 05:47

Hi Alan, Thanks for your input but I got that covered as well.
A missing EOL on the last line does not affect a ftp upload to our mainframe. No big deal at all.

Now I can say that in 15 years of working in this job I have not seen mixed EOL but I have already seen it talked about on this forum. That is also an easily solved problem on the pc side or when uploading on the mainframe. Again this doesn't affect the FTP to the mainframe.

On the mainframe everything is fixed length fixed block files.
When you ftp a file to the mainframe you have to send a quote with the record length.
The record length is based on the actual data size which does not include the CR or LF. Now there are some other details that go into that quote command but I don't want to go into detail on that because it has no relevance to this script.

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Determining the number of lines in a file.

#28 Post by Squashman » 06 Jan 2012 06:04

Dave I do see your point though on the last line not having an EOL. My math will not work if that happens.

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Determining the number of lines in a file.

#29 Post by Squashman » 06 Jan 2012 17:03

Wow Dave, you really outdid yourself. That is like instantaneous output on a 1.2GB file. Now I just need to Divide the Line Length into the File size and that should give the total lines.

I would have never thought of using FC and FINDSTR that way. In fact I just used FC for the first time just a few months ago

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Determining the number of lines in a file.

#30 Post by dbenham » 06 Jan 2012 17:25

Yes - I'm happy with the performance. :D jeb introduced me to using FC in binary mode.

I see a potential problem if you ever get a file > 2GB beause it exceeds the math capabilities of SET /A. If you ever get a file that large, I think your best bet (only native batch option?) will be the following to explicitly count the lines:

Code: Select all

for /f %%N in ('find /c /v "" file') do set lnCnt=%%N

You definitely do not want to use any kind of pipe on a file that large.

Dave Benham

Post Reply