Page 1 of 4

Determining the number of lines in a file.

Posted: 02 Jan 2012 01:37
by Squashman
Background:
I work with Fixed fielded Fixed Length text files. This basically means every single line in the text file is the same length. Each field in the line has a predefined starting and ending positions.
Name 1-30
Street 31-60
City 61-80
etc....
Most of the files I work with have a record length (line length) of hundreds bytes long and can have millions of records (lines) in the file.

Goal:
Quickly determine what the record (line) length is and how many records are in the file without using the FIND or FOR /F commands to parse the entire file. Using the /V and /C switches with the find command can take a long time and using the FOR /F command with a counter takes forever as well.

So my thought process is if I can read in the first line of a file using SET /P LINE1=<"%~1" and pass that off to the String Length function I can easily determine what the current line length is minus the CR/LF (more about that later). Now I can also get the file size by using the Variable Modifiers (%~z1) which should give me the total bytes of the file. Now that I know the line length and the file size I should be able to do some simple math to determine the total number of records in the file.
set /a records=%FileSize% / (%LineLength%+2)
I am adding 2 bytes to the line length for the CR/LF.

Issues:
1. SET /P as we all know doesn't play nice with files that are only LF terminated.
2. Is there a way to determine if the end of line is LF or CRLF? Will need to know this for the true line length. Do I add 1 or 2 bytes to the True Line Length. I know there is the trick to getting a CR into a variable but is there a way to use that as a search string and test the errorlevel to see if it is at the end of the line?
3. Been testing out the String Length function and it seems to be shorting me by one character.

Code: Select all

@echo off
setlocal enableDelayedExpansion
REM This string is 116 bytes
set "str=Mr. & Mrs. John & Nancy Thompson    123 Any Street     AnyWhere    IL60054-1234      Mr. & Mrs. Thompson!  KEY123   "
call :strlen str len
echo %len%
exit /b

:strLen string len -- returns the length of a string
::                 -- string [in]  - variable name containing the string being measured for length
::                 -- len    [out] - variable to be used to return the string length
:: Many thanks to 'sowgtsoi', but also 'jeb' and 'amel27' dostips forum users helped making this short and efficient
:$created 20081122 :$changed 20101116 :$categories StringOperation
:$source http://www.dostips.com
(   SETLOCAL ENABLEDELAYEDEXPANSION
    set "str=A!%~1!"&rem keep the A up front to ensure we get the length and not the upper bound
                     rem it also avoids trouble in case of empty string
    set "len=0"
    for /L %%A in (12,-1,0) do (
        set /a "len|=1<<%%A"
        for %%B in (!len!) do if "!str:~%%B,1!"=="" set /a "len&=~1<<%%A"
    )
)
( ENDLOCAL & REM RETURN VALUES
    IF "%~2" NEQ "" SET /a %~2=%len%
)
EXIT /b

Output

Code: Select all

C:\Users\Squash\batch files\String_Length>StrLength.bat
115

Now I don't often get data with Exclamation points in them. We can pretty much go with the Pareto principle and my 80/20 rule is more like 99/1. Exclamation points and most any other special characters are few and far between. Most of the data I get is just names and addresses and a few other number codes. If I change that exclamation point to a space after the salutation it correctly counts the string length as 116.

Figuring out the LF end of line problem would also help me with another batch file that I use but that will probably be its own thread.

Any help is greatly appreciated.

Re: Determining the number of lines in a file.

Posted: 02 Jan 2012 10:59
by dbenham
Squashman wrote:3. Been testing out the String Length function and it seems to be shorting me by one character.

The problem isn't strLen - it works reliably with any string whether delayed expansion is enabled or disabled. The problem is your test STR variable never receives the ! in the value because delayed expansion is enabled and the ! is not escaped in your SET command.

--------

Don't forget to take into account that the last line may or may not have the ending <CR><LF>

--------

I think this bit of code will give you what you want.

Code: Select all

@echo off
setlocal

call :fileAttributes %1 && (
  set fSize
  set lnCnt
  set lnLen
  set finalEOL
)
exit /b

:FileAttributes File
::
:: Determines attributes of a Windows text file with fixed line length
::
:: returns
::   fSize = file size in bytes
::   lnCnt = line count
::   lnLen = line length in bytes
::   finalEOL = 0 if last line is missing EOL
::              1 if last line has EOL
::
:: Sets ERRORLEVEL > 0 if error
::   1 File not found
::   2 File too small (must be > 1021 bytes long)
::   3 First line is empty
::   4 Line length too long or invalid line format
::   5 Invalid file format

setlocal disableDelayedExpansion
  set "fsize="
  for %%F in ("%~1") do set fsize=%%~zF
  if not defined fsize >&2 echo ERROR: File "%~1" not found&exit /b 1
  if %fsize% leq 1021 >&2 echo ERROR: File too small to test reliably&exit /b 2
  set ln=
  <"%~1" set /p "ln="
  if not defined ln >&2 echo ERROR: First line is empty&exit /b 3
  call :strLen ln lnLen
  if %lnLen% gtr 1021 >&2 echo ERROR: Line length >1021 or invalid line format&exit /b 4
  set /a "lnLen+=2, lnCnt=fSize/lnLen, fSize2=lnCnt*lnLen, fSize3=((lnCnt+1)*lnLen)-2"
  if %fSize2% neq %fSize% (
    if %fSize3% neq %fSize% >&2 echo ERROR: Invalid file format&exit /b 5
    set /a "lnCnt+=1, finalEOL=0"
  ) else set finalEOL=1
  endlocal & (
    set fSize=%fsize%
    set lnLen=%lnLen%
    set lnCnt=%lnCnt%
    set finalEOL=%finalEOL%
  )
exit /b 0

:strLen string len -- returns the length of a string
::                 -- string [in]  - variable name containing the string being measured for length
::                 -- len    [out] - variable to be used to return the string length
:: Many thanks to 'sowgtsoi', but also 'jeb' and 'amel27' dostips forum users helped making this short and efficient
:$created 20081122 :$changed 20101116 :$categories StringOperation
:$source http://www.dostips.com
(   SETLOCAL ENABLEDELAYEDEXPANSION
    set "str=A!%~1!"&rem keep the A up front to ensure we get the length and not the upper bound
                     rem it also avoids trouble in case of empty string
    set "len=0"
    for /L %%A in (12,-1,0) do (
        set /a "len|=1<<%%A"
        for %%B in (!len!) do if "!str:~%%B,1!"=="" set /a "len&=~1<<%%A"
    )
)
( ENDLOCAL & REM RETURN VALUES
    IF "%~2" NEQ "" SET /a %~2=%len%
)
EXIT /b


I'm curious what you are going to do with the files after you get this bit of info. If you need to read and/or manipulate such large files, I strongly recommend you use something other than batch. PowerShell, VB Script, 3rd party tools like GNU Utils for Win32... anything would be better than batch; especially if you are already uncomfortable with waiting for FIND to count the number of lines in the file.

Dave Benham

Re: Determining the number of lines in a file.

Posted: 02 Jan 2012 11:23
by Squashman
We actually have data processing software which is all done on our mainframe. But sometimes I like to do small tasks on the PC before sending the file to the mainframe. We have strict process control for getting files on and off the mainframe. We have to send a file upload request to our Media Librarian to get a file onto the mainframe. When I send this request I need to tell them what the job name is, the job number, what mail copy the file will be going to, what the file name is, what the quantity is and what the record length is. Just trying to automate this process a little more. Some times we can get up to 40+ input files for a job and sending out that file upload request can be a real pain in the keester.

Re: Determining the number of lines in a file.

Posted: 02 Jan 2012 11:28
by Squashman
Can't say I have ever had the issue with the last line not having the crlf but I do have two clients who like to send in their data with the last line being a End of File character. Can't recall what the hex code is for that character. Not at work today.

Re: Determining the number of lines in a file.

Posted: 02 Jan 2012 12:25
by Squashman
So if I change the line length comparison to a number larger than 1021 will this batch file not work? I probably should have been more specific in my initial post. The majority of my cients data is usually well over 1021 bytes for the record length.

Re: Determining the number of lines in a file.

Posted: 02 Jan 2012 13:07
by dbenham
Damn :evil:

SET /P can't read more than 1021 chars per line. So you are out of luck. You would have to use a FOR loop to read the 1st line, and it is fairly slow with large files. At this point you are probably better off using FIND to count the number of lines.

It's a shame, I just worked out a variation that will work with EOL=<LF> as long as line length <=1021.

It also handles EOF at end of file (though this hasn't been tested yet)

Code: Select all

@echo off
setlocal

call :fileAttributes %1 && (
  set fSize
  set lnCnt
  set lnLen
  set EOLSize
  set finalEOL
  set EOFSize
)
exit /b

:FileAttributes File
::
:: Determines attributes of a text file with fixed line length
::
:: Lines can be terminated with <LF>, <CR><LF>, or <LF><CR>
::
:: This script can only handle lines up to length 1021
::
:: returns
::   fSize = file size in bytes
::   lnCnt = line count
::   lnLen = line length in bytes
::   EOLSize = number of characters used to signify EOL
::   finalEOL = 0 if last line is missing EOL
::              1 if last line has EOL
::              undefined if not able to determine
::   EOFSize = 0 if file is NOT terminated by EOF
::             1 if file IS terminated by EOF
::             undefined if not able to determine
::
:: Sets ERRORLEVEL > 0 if error
::   1 File not found
::   2 File too small to test reliably (must be >1021)
::   3 First line is empty
::   4 Line length > 1021
::   5 Invalid file format
  setlocal disableDelayedExpansion
  set lf=^


  ::above 2 blank lines are required - do not remove
  set xlf=^^^%lf%%lf%^%lf%%lf%
  set "fsize="
  for %%F in ("%~1") do set fsize=%%~zF
  if not defined fsize >&2 echo ERROR: File "%~1" not found&exit /b 1
  if %fsize% leq 1021 >&2 echo ERROR: File too small to test reliably&exit /b 2
  set "ln="
  <"%~1" set /p "ln="
  if not defined ln >&2 echo ERROR: First line is empty&exit /b 3
  set "ln2="
  setlocal enableDelayedExpansion
  for /f eol^=%xlf%%xlf%^ delims^= %%L in ("!ln!") do if "!!"=="" endlocal&set "ln2=%%L"
  if not defined ln2 >&2 echo ERROR: First line is empty&exit /b 3
  setlocal enableDelayedExpansion
  if "!ln!" neq "!ln2!" (
    set EOLSize=1
    set "ln=!ln2!"
  ) else set EOLSize=2
  call :strLen ln lnLen
  if %lnLen% gtr 1021 >&2 echo ERROR: Line length >1021&exit /b 4
  set /a "lnLen+=EOLSize, lnCnt=fSize/lnLen, fSize2=lnCnt*lnLen, fSize3=((lnCnt+1)*lnLen)-EOLSize, fSize2a=fSize2+1, fSize3a=fSize3+1"
  if %fSize2%==%fSize% (
      if %EOLSize%==2 (
          set finalEOL=1
          set EOFSize=0
      ) else (
          set "finalEOL="
          set "EOFSize="
      )
  ) else if %fSize2a%==%fSize% (
      set finalEOL=1
      set EOFSize=1
  ) else if %fSize3%==%fSize% (
      set /a "lnCnt+=1"
      set finalEOL=0
      set EOFSize=0
  ) else if %fSize3a%==%fSize% (
      set /a "lnCnt+=1"
      set finalEOL=0
      set EOFSize=1
  ) else >&2 echo ERROR: Invalid file format&exit /b 5
  endlocal & endlocal & (
    set fSize=%fsize%
    set lnLen=%lnLen%
    set lnCnt=%lnCnt%
    set EOLSize=%EOLSize%
    set finalEOL=%finalEOL%
    set EOFSize=%EOFSize%
  )
exit /b 0

:strLen string len -- returns the length of a string
::                 -- string [in]  - variable name containing the string being measured for length
::                 -- len    [out] - variable to be used to return the string length
:: Many thanks to 'sowgtsoi', but also 'jeb' and 'amel27' dostips forum users helped making this short and efficient
:$created 20081122 :$changed 20101116 :$categories StringOperation
:$source http://www.dostips.com
(   SETLOCAL ENABLEDELAYEDEXPANSION
    set "str=A!%~1!"&rem keep the A up front to ensure we get the length and not the upper bound
                     rem it also avoids trouble in case of empty string
    set "len=0"
    for /L %%A in (12,-1,0) do (
        set /a "len|=1<<%%A"
        for %%B in (!len!) do if "!str:~%%B,1!"=="" set /a "len&=~1<<%%A"
    )
)
( ENDLOCAL & REM RETURN VALUES
    IF "%~2" NEQ "" SET /a %~2=%len%
)
EXIT /b


Dave Benham

Re: Determining the number of lines in a file.

Posted: 02 Jan 2012 13:13
by Squashman
Dave, Thanks for your hard work! Sorry I wasn't up front about my requirements.

Re: Determining the number of lines in a file.

Posted: 02 Jan 2012 19:38
by Squashman
What if we did something like this in a function.

Code: Select all

findstr /n . "testfile.txt" | findstr "^1:" &GOTO :EOF

When you execute this code it outputs the first line pretty quickly and then exits. Not sure if this can be implemented into a For Loop and still get the first line captured. I was testing with a 1025 byte record length and 100,000 lines in the file.

EDIT:
This seems to work. It pauses briefly but gets the 1st line. Have not tested with Unix text file yet though.

Code: Select all

@echo off
for /f "delims=" %%a in ('findstr /n . "%~1" ^| findstr "^1:"') do set record=%%a & goto :break

:break
set record=%record:~2%
echo %record%

Re: Determining the number of lines in a file.

Posted: 03 Jan 2012 13:59
by Squashman
Getting a little further along now that I am at work

I am now testing with a file of 1026 bytes and it seems to work regardless of the EOL being a CRLF or just a LF. Problem is I am never stripping the Number and Colon that the FINDSTR command adds yet my length count comes out to 1026. I have no clue why it does that. I would think it should come out to 1028?

I haven't really completely read thru Dave's code to see if he figured out how to tell if the EOL is CRLF or LF to figure out the true line length so that we can get the total number of lines in the file based on the files size in bytes.

Code: Select all

@echo off

for /f "delims=" %%a in ('findstr /n . "%~1" ^|findstr "^1:"') do call :strlen "%%~a" len & goto :break

:break
echo %len%
exit /b

:strLen string len -- returns the length of a string
::                 -- string [in]  - variable name containing the string being measured for length
::                 -- len    [out] - variable to be used to return the string length
:: Many thanks to 'sowgtsoi', but also 'jeb' and 'amel27' dostips forum users helped making this short and efficient
:$created 20081122 :$changed 20101116 :$categories StringOperation
:$source http://www.dostips.com
(   SETLOCAL ENABLEDELAYEDEXPANSION
    set "str=A!%~1!"&rem keep the A up front to ensure we get the length and not the upper bound
                     rem it also avoids trouble in case of empty string
    set "len=0"
    for /L %%A in (12,-1,0) do (
        set /a "len|=1<<%%A"
        for %%B in (!len!) do if "!str:~%%B,1!"=="" set /a "len&=~1<<%%A"
    )
)
( ENDLOCAL & REM RETURN VALUES
    IF "%~2" NEQ "" SET /a %~2=%len%
)
EXIT /b

Re: Determining the number of lines in a file.

Posted: 03 Jan 2012 15:40
by alan_b
Squashman wrote:Can't say I have ever had the issue with the last line not having the crlf but I do have two clients who like to send in their data with the last line being a End of File character. Can't recall what the hex code is for that character. Not at work today.


Data controlled by clients is dodgy.

I have been working on a multi-user collaboration with nearly 1000 entries
Most of the entries were CR/LF DOS terminated
but some were Unix and that could not be processed.

My solution was to de-Unix the original with

Code: Select all

REM Convert Unix (0x0A) contributions to DOS (0x0d,0x0A)
MORE %FILE% > #_DOS_%FILE%
MORE #_DOS_%FILE% > %FILE%


I do not know what MORE would do to E.O.F. (Ctrl Z)

I remember the horror of using EDLIN under DOS 3.3?
and the joy when I got Word Star.

Re: Determining the number of lines in a file.

Posted: 03 Jan 2012 15:50
by dbenham
MORE appends <CR><LF> after any terminating Ctrl-Z

Dave Benham

Re: Determining the number of lines in a file.

Posted: 03 Jan 2012 15:53
by dbenham
If you don't know the length of EOL then you can't determine the line count when all you know is the total file size and the (line length - EOL length).

If you also use FIND /C to get the total number of lines then you should be able to determine if the EOL is LF or CRLF. But if EOL is LF then you can't tell the difference between a file that is missing the final LF but is terminated with EOF, vs a file that has the final LF and is not terminated with EOF.

Dave Benham

Re: Determining the number of lines in a file.

Posted: 03 Jan 2012 18:35
by dbenham
OK - here is a version that uses FOR /F instead of SET /P. It supports a maximum line length of 8191 (as does any other file processing in Windows batch).

It could have been written without using a temp file, but that would require a pipe, and using a temp file is significantly faster than a pipe when working with large files.

Obviously it can give a false result if the line lengths are not constant and the math happens to balance. Other than that I think it should be reliable.Edit - Fixed stupid bug (interpretting the line offset as hexadecimal when really decimal) and added an error check for line length < 4.

Code: Select all

@echo off
setlocal

call :fileAttributes %1 && (
  set fSize
  set lnCnt
  set lnLen
  set EOLSize
  set finalEOL
  set EOFSize
)
exit /b

:FileAttributes File
::
:: Determines attributes of a text file with fixed line length
::
:: Lines can be terminated with <LF> or <CR><LF>
::
:: This script can only handle lines up to length 8191
::
:: returns
::   fSize = file size in bytes
::   lnCnt = line count
::   lnLen = line length in bytes
::   EOLSize = number of characters used to signify EOL
::   finalEOL = 0 if last line is missing EOL
::              1 if last line has EOL
::              undefined if not able to determine
::   EOFSize = 0 if file is NOT terminated by EOF
::             1 if file IS terminated by EOF
::             undefined if not able to determine
::
:: Sets ERRORLEVEL > 0 if error
::   1 File not found
::   2 First line is empty
::   3 Too few lines (need at least 2)
::   4 Line length too small (must be at least 4)
::   5 Invalid file format
::
  setlocal disableDelayedExpansion
  set "fsize="
  for %%F in ("%~1") do set fsize=%%~zF
  if not defined fsize >&2 echo ERROR: File "%~1" not found&exit /b 1
  set tmpFile="%temp%\FileAttributes%random%.tmp"
  >%tmpFile% findstr /o "^" "%~1"
  set "ln1="
  set "ln2="
  for /f "usebackq delims=" %%L in (%tmpFile%) do (
    if not defined ln1 (set "ln1=%%L") else set "ln2=%%L"&goto :break
  )
  :break
  del %tmpfile%
  setlocal enableDelayedExpansion
  set "ln1=!ln1:*:=!"
  if not defined ln1 >&2 echo ERROR: First line is empty&exit /b 2
  if not defined ln2 >&2 echo ERROR: Too few lines (need at least 2)&exit /b 3
  call :strLen ln1 lnLen
  if %lnLen% lss 4 >&2 echo ERROR: Line length to small (must at least 4)&exit /b 4
  for /f "delims=:" %%A in ("!ln2!") do set /a "EOLSize=%%A-lnLen"
  set /a "lnLen2=lnLen+EOLSize, lnCnt=fSize/lnLen2, fSize2=lnCnt*lnLen2, fSize3=((lnCnt+1)*lnLen2)-EOLSize, fSize2a=fSize2+1, fSize3a=fSize3+1"
  if %fSize2%==%fSize% (
      if %EOLSize%==2 (
          set finalEOL=1
          set EOFSize=0
      ) else (
          set "finalEOL="
          set "EOFSize="
      )
  ) else if %fSize2a%==%fSize% (
      set finalEOL=1
      set EOFSize=1
  ) else if %fSize3%==%fSize% (
      set /a "lnCnt+=1"
      set finalEOL=0
      set EOFSize=0
  ) else if %fSize3a%==%fSize% (
      set /a "lnCnt+=1"
      set finalEOL=0
      set EOFSize=1
  ) else (
      >&2 echo ERROR: Invalid file format&exit /b 5
  )
  endlocal & endlocal & (
    set "fSize=%fsize%"
    set "lnLen=%lnLen%"
    set "lnCnt=%lnCnt%"
    set "EOLSize=%EOLSize%"
    set "finalEOL=%finalEOL%"
    set "EOFSize=%EOFSize%"
  )
exit /b 0

:strLen string len -- returns the length of a string
::                 -- string [in]  - variable name containing the string being measured for length
::                 -- len    [out] - variable to be used to return the string length
:: Many thanks to 'sowgtsoi', but also 'jeb' and 'amel27' dostips forum users helped making this short and efficient
:$created 20081122 :$changed 20101116 :$categories StringOperation
:$source http://www.dostips.com
(   SETLOCAL ENABLEDELAYEDEXPANSION
    set "str=A!%~1!"&rem keep the A up front to ensure we get the length and not the upper bound
                     rem it also avoids trouble in case of empty string
    set "len=0"
    for /L %%A in (12,-1,0) do (
        set /a "len|=1<<%%A"
        for %%B in (!len!) do if "!str:~%%B,1!"=="" set /a "len&=~1<<%%A"
    )
)
( ENDLOCAL & REM RETURN VALUES
    IF "%~2" NEQ "" SET /a %~2=%len%
)
EXIT /b


Dave Benam

Re: Determining the number of lines in a file.

Posted: 03 Jan 2012 22:03
by Squashman
Hi Dave,
I will give this a try at work tomorrow. Lots of code to look at and I wan't to make sure I understand it.

I was just thinking about using the Modulus to figure out if the record length is correct.
set /A numlines=%filesize%%%%(%linelength%+2)
example: set /A temp=100000%(998+2)
This would set the value of temp to 0.
If the Value is Zero then we would know the EOL was a CRLF

example: set /A temp=100001%(998+2)
This would set the value of temp to 1.
If the Value of temp is 1 then we would know that the EOL was a CRLF and their is an EOF character at the end of the file as well.

Just something that popped into my brain. Not even sure if this is fool proof or not.

Re: Determining the number of lines in a file.

Posted: 04 Jan 2012 07:34
by Squashman
I gave it a try with a very small file (in my world) of 25,012 lines and a line length of 1026. I ran it with an EOL of LF and received the ERROR: Invalid file format. I then ran the file through an EOL Converter and changed the EOL to CRLF. Ran the batch file again and received the same ERROR: Invalid file format.

Should also note that this test file does not have an EOF which should be a good thing and the last line does have an EOL.

I will try and debug it later on today.