Processing text files with very large lines via FOR /F command
Posted: 27 Mar 2017 00:07
Operation rules of FOR /F command about "tokens=..." option:
The method used to determine previous rules was explained with detail at this post; you should read it before continue with this topic.
The FOR /F operation rules indicate that it is possible to process a text file that contain very large lines, up to approximately 261,000 characters. However, in order to make good use of this ability it is necessary to make certain adjustments to the data file, because it is unlikely that the data have the required FOR /F format in its standard way. In this topic there are several examples of the required management that allows to store such an amount of data in each line of a text file. The same approach may be used in any other aplication as long as the required modifications can be applied to the data file.
The program example consists of a file, called "books.txt", that store a "book" of up to 260,800 bytes in each line. This is the procedure that achieve such a management:
In schematic form:
The program performs all the conversions required to manage this file format. For example, to consult a "book" the program read one line from books.txt file, separate all book lines into an individual file and open it with Notepad, so the user may review and edit it. If the book was modified, the program separate the lines from the individual file and store they in a line of books.txt; the original book is updated, so its original position is preserved. A new book may also be inserted.
To start using this program, create the books.txt file with just a simple book. The name of the book is stored in the first line. For example.
books.txt
This program is just a proof of concept; it lacks multiple details that are necessary to convert it in a fully working and robust application.
Antonio
- The maximum number of tokens in a FOR /F command is 32, including the "rest of tokens" last one: "tokens=1-31*".
- The maximum length of the tokens is limited by the maximum length of the command-line that use each token, that is 8191 bytes. Shorter commands allows to use larger tokens.
- The maximum number of tokens in the lines of a text file is equal to 4126, when all tokens have just one character and the command that process the "rest of tokens" last token is not too large. If the tokens after the 31th one are larger, then its maximum number decrease accordingly, so the length of the "rest of tokens" last token must always fit in its 8191 bytes command-line.
- The maximum length of the lines of a text file may be near to 261000 bytes, when the 32 possible tokens have they all a length near to 8191 bytes.
The method used to determine previous rules was explained with detail at this post; you should read it before continue with this topic.
The FOR /F operation rules indicate that it is possible to process a text file that contain very large lines, up to approximately 261,000 characters. However, in order to make good use of this ability it is necessary to make certain adjustments to the data file, because it is unlikely that the data have the required FOR /F format in its standard way. In this topic there are several examples of the required management that allows to store such an amount of data in each line of a text file. The same approach may be used in any other aplication as long as the required modifications can be applied to the data file.
The program example consists of a file, called "books.txt", that store a "book" of up to 260,800 bytes in each line. This is the procedure that achieve such a management:
- The original lines of each "book" are grouped in "fields". Each line is separated from the next one with the ASCII character 254 (þ). Any number of lines may be grouped in one field up to the limit of 8150 bytes per field.
- The fields of the book are stored in one line of books.txt file, separated with the ASCII character 255 (ÿ). May be up to 32 fields in each line of the text file; this means that the maximum number of bytes per book is 260800, including the separators between lines and fields.
- Each book (physical line) in books.txt file is terminated with a <CR><LF> characters pair, as usual. The maximum number of books in the file is limited by the maximum file size specified by the OS (2 GB).
In schematic form:
Code: Select all
Field 1: Line one.þLine two.þLine three.þEt cetera. Up to 8150 bytes
Book 1: Field 1ÿField 2ÿField 3ÿEt cetera Up to 32 fields
books.txt Book 1<CR><LF>Book 2<CR><LF>Etc<CR><LF> Up to 2GB size
The program performs all the conversions required to manage this file format. For example, to consult a "book" the program read one line from books.txt file, separate all book lines into an individual file and open it with Notepad, so the user may review and edit it. If the book was modified, the program separate the lines from the individual file and store they in a line of books.txt; the original book is updated, so its original position is preserved. A new book may also be inserted.
Code: Select all
@echo off
setlocal EnableDelayedExpansion
:nextBook
rem Show available books and lets the user to select one
cls
echo/
echo Available books:
echo/
set i=1
for /F "delims=þ" %%a in (books.txt) do (
echo !i!- %%a
set "book[!i!]=%%a"
set /A i+=1
)
echo/
set /P "book=Enter book number (%i% to add a new book): "
if errorlevel 1 goto :EOF
if "%book%" equ "%i%" (
set /P "book[%book%]=Enter name of new book: "
echo !book[%book%]!þEnter new book contents here>> books.txt
)
if not defined book[%book%] echo No such book & goto endBook
rem Extract the book into an individual file and show it in Notepad
set "name=!book[%book%]!"
set /A skip=book-1
if %skip% gtr 0 (set "skip=skip=%skip%") else set "skip="
(for /F "%skip% tokens=1-31* delims=ÿ" %%@ in (books.txt) do call :ReadBook & goto continue) > "%name%.txt"
:continue
echo/
echo -^> Editing book: "%name%"
(
copy "%name%.txt" "%name%.bak"
notepad "%name%.txt" | pause
fc "%name%.txt" "%name%.bak"
set "errLevel=!errorlevel!"
del "%name%.bak"
) > NUL
if %errLevel% equ 0 goto endBook
rem Copy all books in books.txt file, update this one
< NUL (
rem Process all lines in books.txt file, with 32 possible tokens each
set "i=0"
for /F "tokens=1-31* delims=ÿ" %%@ in (books.txt) do (
set /A i+=1
if !i! neq %book% (
rem Copy up to 32 fields of other books
set "tokens=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_"
set "field=x"
for /L %%i in (1,1,32) do if defined field (
call :ReadField field="!tokens:~0,1!"
if defined field set /P "=!field!ÿ"
set "tokens=!tokens:~1!"
)
echo/
) else (
rem Save the modified book in it's original place
ECHO Splitting file in fields . . . > CON
set "field=þ!book[%book%]!"
call :strLen field
set /A "currentLen=len-1"
SET "j=1"
SET /P "=Field #!j!: !currentLen!, " > CON
for /F "usebackq delims=" %%a in ("%name%.txt") do (
set "fieldNew=þ%%a"
call :strLen fieldNew
set /A "newLen=currentLen+len"
if !newLen! lss 8150 (
set "field=!field!!fieldNew!"
set /A "currentLen+=Len"
SET /P "=!currentLen!, " > CON
) else (
set /P "=!field:~1!ÿ"
set "field=!fieldNew!"
set /A "currentLen=len"
ECHO/> CON
ECHO/> CON
SET /A j+=1
SET /P "=Field #!j!: !currentLen!, " > CON
)
)
echo(!field:~1!
ECHO/> CON
)
)
) > books.tmp
(
del "%name%.txt"
move /Y books.tmp books.txt
) > NUL
echo/
echo Modified book stored in books.txt file
:endBook
echo/
pause
goto nextBook
:strLen strvar
set "str=0!%~1!"
set "len=0"
for /L %%a in (12,-1,0) do (
set /A "newLen=len+(1<<%%a)"
for %%b in (!newLen!) do if "!str:~%%b,1!" neq "" set "len=%%b"
)
exit /B
:ReadField field="token"
for %%. in (.) do set "%1=%%%~2"
exit /B
:ReadBook
setlocal EnableDelayedExpansion
set "tokens=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_"
set "firstLine=1"
SET /P "=Reading book fields: " < NUL > CON
:nextField
SET /P "=%tokens:~0,1%, " < NUL > CON
for %%. in (.) do set "field=%%%tokens:~0,1%"
if not defined field goto endFields
:nextLine
for /F "tokens=1* delims=þ" %%a in ("!field!") do (
if not defined firstLine echo(%%a
set "firstLine="
set "field=%%b"
)
if defined field goto nextLine
set "tokens=%tokens:~1%"
if defined tokens goto nextField
:endFields
ECHO/> CON
ECHO/> CON
exit /B
To start using this program, create the books.txt file with just a simple book. The name of the book is stored in the first line. For example.
books.txt
Code: Select all
Name of bookþFirst line of book.þSecond line of book.
This program is just a proof of concept; it lacks multiple details that are necessary to convert it in a fully working and robust application.
Antonio