TruncateFile.exe is another one of my auxiliary programs of this type. When TruncateFile.exe is combined with FilePointer.exe, they may be used to solve a wider range of problems.
I wrote SplitFile.bat based on such method; this program split a large text file in several smaller parts of a given number of lines each. SplitFile.bat is surprisingly fast because it uses FilePointer.exe and TruncateFile.exe auxiliary programs just to delimit the data to be copied, and FINDSTR command to perform the copy part.
Code: Select all
@echo off
rem SplitFile.bat: Split a large text file in parts of a given number of lines
rem Antonio Perez Ayala - 2015/01/31
rem This program requires FilePointer.exe and TruncateFile.exe auxiliary programs
rem Download they from: https://www.dropbox.com/sh/k7w69m4u8mhp3yg/AAAuluzR34AIpKA1rxXjNN8Sa?dl=0
if "%~2" neq "" goto begin
echo Split a large text file in parts of a given number of lines
echo/
echo SplitFile filename.ext numberOfLines
echo/
echo After the file was splitted, it can be recovered with this command:
echo COPY filename_*.ext filename.ext /B
echo/
:rightNumber
echo The number of lines must have non-zero digits followed by zero digits,
echo like: 5000, 11000, 20000, etc.
echo/
echo/
echo ATTENTION! This program is *destructive*: it will remove the original file!
echo ========== You should copy the file before split it with this program.
goto :EOF
:begin
setlocal EnableDelayedExpansion
if not exist %1 echo File not found & goto :EOF
rem Get from numOfLines: modBlock=digits != 0 at left, modLen=number of digits == 0 at right
set /A numOfLines=%2, modLen=0, continue=1
set "modBlock="
for /L %%i in (0,1,9) do if defined continue (
set "digit=!numOfLines:~%%i,1!"
if "!digit!" equ "" (
set "continue="
) else if "!digit!" neq "0" (
if !modLen! neq 0 goto badNumber
set "modBlock=!modBlock!!digit!"
) else (
set /A modLen+=1
)
)
if "%modBlock%" equ "" goto badNumber
if %modLen% gtr 0 goto getModSize
:badNumber
echo Wrong number of lines
goto rightNumber
:getModSize
set "modSize=!numOfLines:~-%modLen%!"
rem Get the offsets of the lines placed at start of each part
set "start=%time%"
set /P "=Obtaining limits of all parts..." < NUL
set "lastOffset=1"
for /F "tokens=1,2 delims=:" %%a in ('findstr /N /O "^" %1 ^| findstr "%modSize%:"') do (
set "lineNum=%%a"
if "!lineNum:~-%modLen%!" equ "%modSize%" (
set /A "mod=!lineNum:~0,-%modLen%! %% %modBlock%"
if !mod! equ 0 (
set /A lastOffset+=1
set "offset[!lastOffset!]=%%b"
)
)
)
echo done.
set "split=%time%"
rem Open a code-block to process the file via redirected Stdin and Stdout
< %1 (
rem Extract the parts in last-to-first order
for /L %%i in (%lastOffset%,-1,2) do (
rem Move Stdin file pointer to the start of this part
FilePointer 0 !offset[%%i]!
rem Copy from this point up to EOF to its own part file
set "part=00%%i"
set /P "=Creating part %~N1_!part:~-3!%~X1..." < NUL > CON
findstr "^" > "%~N1_!part:~-3!%~X1"
echo done.> CON
rem Move Stdout file pointer to the start of this part
FilePointer 1 !offset[%%i]!
rem Truncate the file at this point: make it the new EOF
TruncateFile 1
)
) >> %1
rem Rename the last (first) part
set /P "=Creating part %~N1_001%~X1..." < NUL
ren %1 "%~N1_001%~X1"
echo done.
echo/
echo Start process at: %start%
echo Start splitting at: %split%
echo End splitting at: %time%
These are the timing results I obtained when I tested this program with some large files:
Code: Select all
File size (lines) Lines in part (parts) Elapsed time in Min:Sec.cc
(get limits + split = total)
100 MB (3,003,000) 20,000 (151) 0:08.30 + 0:12.23 = 0:20.53
100 MB (3,003,000) 100,000 (31) 0:07.95 + 0:04.09 = 0:12.04
100 MB (3,003,000) 500,000 (7) 0:07.86 + 0:02.39 = 0:10.25
300 MB (9,009,000) 20,000 (451) 0:27.22 + 0:43.66 = 1:10.88
300 MB (9,009,000) 100,000 (91) 0:24.38 + 0:12.73 = 0:37.11
300 MB (9,009,000) 500,000 (19) 0:24.32 + 0:08.08 = 0:32.40
600 MB (18,018,000) 20,000 (901) 0:56.59 + 1:30.43 = 2:27.02
600 MB (18,018,000) 100,000 (181) 0:54.00 + 0:36.71 = 1:30.71
600 MB (18,018,000) 500,000 (37) 0:54.31 + 0:19.21 = 1:13.52
1.2 GB (36,036,000) 500,000 (73) 2:10.31 + 0:54.61 = 3:04.92
Yes, previous times are correct. This Batch file takes a little more than 3 minutes to split a 1.2 GB size file in 73 parts when it run on my not-too-fast laptop computer!
The limits for this solution are: 2 GB size file because it is the maximum offset allowed by FilePointer.exe auxiliary program, and lines up to the maximum length allowed by FINDSTR command (I don't know the specific value).
I'll appreciate it if you may test this program and post the result, specially if you have large files that currently you split with other program, so we can compare the timings of both.
You may download FilePointer.exe and TruncateFile.exe auxiliary files from this site.
Antonio