SplitFile.bat: Split large text files in a very fast way

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

SplitFile.bat: Split large text files in a very fast way

#1 Post by Aacini » 01 Feb 2015 14:16

My FilePointer.exe auxiliary program may be used in several problems to achieve certain specific tasks in a simpler and faster way when compared vs. a pure Batch solution. An example of such method is this solution posted in SO.

TruncateFile.exe is another one of my auxiliary programs of this type. When TruncateFile.exe is combined with FilePointer.exe, they may be used to solve a wider range of problems.

I wrote SplitFile.bat based on such method; this program split a large text file in several smaller parts of a given number of lines each. SplitFile.bat is surprisingly fast because it uses FilePointer.exe and TruncateFile.exe auxiliary programs just to delimit the data to be copied, and FINDSTR command to perform the copy part.

Code: Select all

@echo off

rem SplitFile.bat: Split a large text file in parts of a given number of lines
rem Antonio Perez Ayala - 2015/01/31

rem This program requires FilePointer.exe and TruncateFile.exe auxiliary programs
rem Download they from: https://www.dropbox.com/sh/k7w69m4u8mhp3yg/AAAuluzR34AIpKA1rxXjNN8Sa?dl=0

if "%~2" neq "" goto begin
echo Split a large text file in parts of a given number of lines
echo/
echo SplitFile filename.ext numberOfLines
echo/
echo After the file was splitted, it can be recovered with this command:
echo     COPY filename_*.ext filename.ext /B
echo/
:rightNumber
echo The number of lines must have non-zero digits followed by zero digits,
echo like: 5000, 11000, 20000, etc.
echo/
echo/
echo ATTENTION!  This program is *destructive*: it will remove the original file!
echo ==========  You should copy the file before split it with this program.
goto :EOF

:begin
setlocal EnableDelayedExpansion

if not exist %1 echo File not found & goto :EOF

rem Get from numOfLines: modBlock=digits != 0 at left, modLen=number of digits == 0 at right
set /A numOfLines=%2, modLen=0, continue=1
set "modBlock="
for /L %%i in (0,1,9) do if defined continue (
   set "digit=!numOfLines:~%%i,1!"
   if "!digit!" equ "" (
      set "continue="
   ) else if "!digit!" neq "0" (
      if !modLen! neq 0 goto badNumber
      set "modBlock=!modBlock!!digit!"
   ) else (
      set /A modLen+=1
   )
)
if "%modBlock%" equ "" goto badNumber
if %modLen% gtr 0 goto getModSize
:badNumber
echo Wrong number of lines
goto rightNumber
:getModSize
set "modSize=!numOfLines:~-%modLen%!"

rem Get the offsets of the lines placed at start of each part
set "start=%time%"
set /P "=Obtaining limits of all parts..." < NUL
set "lastOffset=1"
for /F "tokens=1,2 delims=:" %%a in ('findstr /N /O "^" %1 ^| findstr "%modSize%:"') do (
   set "lineNum=%%a"
   if "!lineNum:~-%modLen%!" equ "%modSize%" (
      set /A "mod=!lineNum:~0,-%modLen%! %% %modBlock%"
      if !mod! equ 0 (
         set /A lastOffset+=1
         set "offset[!lastOffset!]=%%b"
      )
   )
)
echo   done.
set "split=%time%"

rem Open a code-block to process the file via redirected Stdin and Stdout
< %1 (

   rem Extract the parts in last-to-first order
   for /L %%i in (%lastOffset%,-1,2) do (

      rem Move Stdin file pointer to the start of this part
      FilePointer 0 !offset[%%i]!

      rem Copy from this point up to EOF to its own part file
      set "part=00%%i"
      set /P "=Creating part %~N1_!part:~-3!%~X1..." < NUL > CON
      findstr "^" > "%~N1_!part:~-3!%~X1"
      echo   done.> CON

      rem Move Stdout file pointer to the start of this part
      FilePointer 1 !offset[%%i]!

      rem Truncate the file at this point: make it the new EOF
      TruncateFile 1

   )

) >> %1

rem Rename the last (first) part
set /P "=Creating part %~N1_001%~X1..." < NUL
ren %1 "%~N1_001%~X1"
echo   done.

echo/
echo Start process   at: %start%
echo Start splitting at: %split%
echo End   splitting at: %time%

These are the timing results I obtained when I tested this program with some large files:

Code: Select all

File size (lines)       Lines in part (parts)   Elapsed time in Min:Sec.cc
                                                (get limits + split = total)

100 MB (3,003,000)       20,000 (151)           0:08.30 + 0:12.23 = 0:20.53
100 MB (3,003,000)      100,000 (31)            0:07.95 + 0:04.09 = 0:12.04
100 MB (3,003,000)      500,000 (7)             0:07.86 + 0:02.39 = 0:10.25

300 MB (9,009,000)       20,000 (451)           0:27.22 + 0:43.66 = 1:10.88
300 MB (9,009,000)      100,000 (91)            0:24.38 + 0:12.73 = 0:37.11
300 MB (9,009,000)      500,000 (19)            0:24.32 + 0:08.08 = 0:32.40

600 MB (18,018,000)      20,000 (901)           0:56.59 + 1:30.43 = 2:27.02
600 MB (18,018,000)     100,000 (181)           0:54.00 + 0:36.71 = 1:30.71
600 MB (18,018,000)     500,000 (37)            0:54.31 + 0:19.21 = 1:13.52

1.2 GB (36,036,000)     500,000 (73)            2:10.31 + 0:54.61 = 3:04.92

Yes, previous times are correct. This Batch file takes a little more than 3 minutes to split a 1.2 GB size file in 73 parts when it run on my not-too-fast laptop computer! 8)

The limits for this solution are: 2 GB size file because it is the maximum offset allowed by FilePointer.exe auxiliary program, and lines up to the maximum length allowed by FINDSTR command (I don't know the specific value).

I'll appreciate it if you may test this program and post the result, specially if you have large files that currently you split with other program, so we can compare the timings of both.

You may download FilePointer.exe and TruncateFile.exe auxiliary files from this site.

Antonio
Last edited by Aacini on 02 Feb 2015 07:27, edited 1 time in total.

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: SplitFile.bat: Split large text files in a very fast way

#2 Post by foxidrive » 02 Feb 2015 06:17

Aacini wrote:Antonio

PS - I am trying out my new Dropbox site, but I don't know if this link gives you full access to the 3 public folders I just created there. Could someone test this point and post the result?


It works fine Antonio.

Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: SplitFile.bat: Split large text files in a very fast way

#3 Post by Aacini » 02 Feb 2015 07:29

Thanks, foxi! :D

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: SplitFile.bat: Split large text files in a very fast way

#4 Post by foxidrive » 02 Feb 2015 15:43

I am just commenting again - there are 4 files in the exe folder and the other two folders are empty.

It occurred to me that you may have batch files there but they do not have public permissions applied.

The download button works to zip up all folders into a single file - which is brilliant for people to get all your tools easily,
and the exe file can be downloaded separately.

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: SplitFile.bat: Split large text files in a very fast way

#5 Post by Samir » 11 Feb 2015 22:06

Aacini wrote:Yes, previous times are correct. This Batch file takes a little more than 3 minutes to split a 1.2 GB size file in 73 parts when it run on my not-too-fast laptop computer! 8)
Fascinating. What's really interesting is that it seems the time is actually more dependent on the number of destination files being written vs the overall original file size.

The 1.2GB file into 73 pieces is split at ~6.66MB/sec whereas the 600MB file split in 37 pieces is at 18.52MB/sec. And the same 600MB file is split at only 4.08MB/sec when the number of pieces is increased to 901.

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: SplitFile.bat: Split large text files in a very fast way

#6 Post by Squashman » 12 Feb 2015 07:59

Samir wrote:What's really interesting is that it seems the time is actually more dependent on the number of destination files being written vs the overall original file size.

The 1.2GB file into 73 pieces is split at ~6.66MB/sec whereas the 600MB file split in 37 pieces is at 18.52MB/sec. And the same 600MB file is split at only 4.08MB/sec when the number of pieces is increased to 901.


Yes I would agree with that. In my line of work we have tested several ways to split large data files. I work with huge data files. We normally split on predefined variables within a fixed field on the data. Our tests have shown that on large files it is quicker to sort the file by the field you are splitting on first and then split the file into the output files you need.

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: SplitFile.bat: Split large text files in a very fast way

#7 Post by Samir » 12 Feb 2015 10:35

Squashman wrote:
Samir wrote:What's really interesting is that it seems the time is actually more dependent on the number of destination files being written vs the overall original file size.

The 1.2GB file into 73 pieces is split at ~6.66MB/sec whereas the 600MB file split in 37 pieces is at 18.52MB/sec. And the same 600MB file is split at only 4.08MB/sec when the number of pieces is increased to 901.


Yes I would agree with that. In my line of work we have tested several ways to split large data files. I work with huge data files. We normally split on predefined variables within a fixed field on the data. Our tests have shown that on large files it is quicker so sort the file by the field you are splitting on first and then split the file into the output files you need.
Interesting. What computer hardware have you found really improves split times? More CPU power? Faster HDs? Memory?

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: SplitFile.bat: Split large text files in a very fast way

#8 Post by foxidrive » 12 Feb 2015 13:25

Samir wrote:What's really interesting is that it seems the time is actually more dependent on the number of destination files being written vs the overall original file size.


This can be shown in a copy process too as the time taken to create files is quite significant.

If you start a copy process to copy 10 files of 1 MB
and then do the same to copy 10000 files of 1 KB
The process is basically copying the same amount of data but the time taken to copy and create the small files will be enormously higher.

Here's a proof of concept using Dave's GetTimeStamp.bat to calculate the elapsed time
and the results here on a fairly capable machine:

0 days 00:00:00.059
0 days 00:00:05.782
Press any key to continue . . .


and you can see that the small files are orders of magnitude higher.
Copy heaps of them over a LAN and watch the time blow out...

Code: Select all

@echo off
for /l %%a in (1,1,10) do fsutil file createnew largefile%%a.txt 1000000
for /l %%a in (1,1,10000) do fsutil file createnew smallfile%%a.txt 1000

::batch get elapsed time

@echo off
call getTimestamp -f {ums} -r t1

md large
copy large* large >nul

call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u

::batch get elapsed time

@echo off
call getTimestamp -f {ums} -r t1

md small
copy small* small >nul

call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: SplitFile.bat: Split large text files in a very fast way

#9 Post by Squashman » 12 Feb 2015 13:32

Samir wrote:Interesting. What computer hardware have you found really improves split times? More CPU power? Faster HDs? Memory?


[eVil Laugh]
I will ask our system admins how many MIPS we are at right now.

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: SplitFile.bat: Split large text files in a very fast way

#10 Post by Squashman » 12 Feb 2015 16:01

Didn't have to ask the sysadmin. Just ran a rexx script to get the information.
We are at 267.77 MIPS (Million Instructions Per Second). That is running on 3 cores of the processor. The other cores are reserved for system use.

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: SplitFile.bat: Split large text files in a very fast way

#11 Post by Squashman » 12 Feb 2015 16:13

Foxidrive, you are forcing me to upgrade to Windows 8 so I can run fsutil without elevated privileges!!!!
Damn You Microsoft!!! :evil:

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: SplitFile.bat: Split large text files in a very fast way

#12 Post by foxidrive » 12 Feb 2015 17:23

Squashman wrote:Foxidrive, you are forcing me to upgrade to Windows 8 so I can run fsutil without elevated privileges!!!!


oops!

Some of us run admin accounts, in violation of the limited-user-only treaty of 2010.
So I'm not sure I would have noticed...

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: SplitFile.bat: Split large text files in a very fast way

#13 Post by Samir » 13 Feb 2015 11:49

foxidrive wrote:
Samir wrote:What's really interesting is that it seems the time is actually more dependent on the number of destination files being written vs the overall original file size.


This can be shown in a copy process too as the time taken to create files is quite significant.

If you start a copy process to copy 10 files of 1 MB
and then do the same to copy 10000 files of 1 KB
The process is basically copying the same amount of data but the time taken to copy and create the small files will be enormously higher.

Here's a proof of concept using Dave's GetTimeStamp.bat to calculate the elapsed time
and the results here on a fairly capable machine:

0 days 00:00:00.059
0 days 00:00:05.782
Press any key to continue . . .


and you can see that the small files are orders of magnitude higher.
Copy heaps of them over a LAN and watch the time blow out...

Code: Select all

@echo off
for /l %%a in (1,1,10) do fsutil file createnew largefile%%a.txt 1000000
for /l %%a in (1,1,10000) do fsutil file createnew smallfile%%a.txt 1000

::batch get elapsed time

@echo off
call getTimestamp -f {ums} -r t1

md large
copy large* large >nul

call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u

::batch get elapsed time

@echo off
call getTimestamp -f {ums} -r t1

md small
copy small* small >nul

call getTimestamp -f {ums} -r t2
call getTimestamp -d %t2%-%t1% -f "{ud} days {hh}:{nn}:{ss}.{fff}" -u
pause
Yep, this is where seek time and wiring to a file allocation table kills drive speeds. This is also why a flash drive or ssd can be quicker than a traditional hard drive since the seek time is in ns vs ms. Ramdrives are also awesome for testing pure algorithm speed.

Samir
Posts: 384
Joined: 16 Jul 2013 12:00
Location: HSV
Contact:

Re: SplitFile.bat: Split large text files in a very fast way

#14 Post by Samir » 13 Feb 2015 11:49

Squashman wrote:Didn't have to ask the sysadmin. Just ran a rexx script to get the information.
We are at 267.77 MIPS (Million Instructions Per Second). That is running on 3 cores of the processor. The other cores are reserved for system use.
That's some serious horsepower. 8)

Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: SplitFile.bat: Split large BINARY files in a very fast w

#15 Post by Aacini » 21 Jul 2015 11:16

The method presented in this thread can not be used on binary files because FINDSTR command fail with binary data. To solve this problem, I wrote a new auxiliary program named ReadFile.exe:

ReadFile.exe help wrote:Read bytes from a redirected input file handle and output they to StdOut.

ReadFile handle numOfBytes

If numOfBytes is zero, read all bytes from current FP position until EOF.

At end, the number of output bytes is returned in ERRORLEVEL.

I used this auxiliary program to write a new SplitFileBySizes.bat file that allows to combine several files of any type in just one large binary file via "COPY /B FILE1+FILE2+... LARGE.BIN" command, and then extract the original parts. This program must run faster than the former one based on FINDSTR, but I have not time to complete some timing tests now.

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem SplitFileBySizes.bat: Split a large file in parts of given sizes
rem Antonio Perez Ayala - 2015/07/21

rem This program requires FilePointer.exe, ReadFile.exe and TruncateFile.exe auxiliary programs
rem Download they from: https://www.dropbox.com/sh/k7w69m4u8mhp3yg/AAAuluzR34AIpKA1rxXjNN8Sa?dl=0

if "%~2" neq "" goto begin

echo Split a large file in parts of given sizes
echo/
echo SplitFile.bat filename.ext size2 size3 ...
echo/
echo Note that sizes are given from the *SECOND* part onwards; the size of the
echo first part is not given. The file is split in NumberOfSizes+1 parts.
goto :EOF

:begin
if not exist %1 echo File not found: %1 & goto :EOF

rem Store sizes in an array
set n=0
for %%a in (%*) do (
   set /A n+=1
   set "size[!n!]=%%~a"
)

rem Create a duplicate of input file, that will become the first part:
copy %1 "%~N1-1%~X1" > NUL

rem Open a code-block to process the file via redirected Stdin and Stdout
< "%~N1-1%~X1" (

   rem Extract the parts in last-to-first order
   for /L %%i in (%n%,-1,2) do (

      rem Move Stdin file pointer to the start of this part
      FilePointer 0 -!size[%%i]! /E

      rem Copy from this point up to EOF in its own part file
      ReadFile 0 0 > "%~N1-%%i%~X1"

      rem Move Stdout file pointer to the start of this part again
      FilePointer 1 -!size[%%i]! /E

      rem Truncate the file at this point: make it the new EOF
      TruncateFile 1

   )

) >> "%~N1-1%~X1"

I just completed a very simple test of previous program:

Code: Select all

C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test
>dir
 El volumen de la unidad C no tiene etiqueta.
 El número de serie del volumen es: 0895-160E

 Directorio de C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test


21/07/2015  10:10 a. m.    <DIR>          .
21/07/2015  10:10 a. m.    <DIR>          ..
21/07/2015  09:59 a. m.             1,536 FilePointer.exe
21/07/2015  08:47 a. m.             1,536 ReadFile.exe
21/07/2015  10:08 a. m.             1,374 SplitFileBySizes.bat
09/11/2014  10:36 p. m.             1,536 TruncateFile.exe
               4 archivos          5,982 bytes
               2 dirs  422,138,630,144 bytes libres

C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test
>copy /B *.exe AllProgs.bin
FilePointer.exe
ReadFile.exe
TruncateFile.exe
        1 archivo(s) copiado(s).

C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test
>SplitFileBySizes.bat AllProgs.bin 1536 1536

C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test
>dir
 El volumen de la unidad C no tiene etiqueta.
 El número de serie del volumen es: 0895-160E

 Directorio de C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test


21/07/2015  10:11 a. m.    <DIR>          .
21/07/2015  10:11 a. m.    <DIR>          ..
21/07/2015  10:11 a. m.             1,536 AllProgs-1.bin
21/07/2015  10:11 a. m.             1,536 AllProgs-2.bin
21/07/2015  10:11 a. m.             1,536 AllProgs-3.bin
21/07/2015  10:10 a. m.             4,608 AllProgs.bin
21/07/2015  09:59 a. m.             1,536 FilePointer.exe
21/07/2015  08:47 a. m.             1,536 ReadFile.exe
21/07/2015  10:08 a. m.             1,374 SplitFileBySizes.bat
09/11/2014  10:36 p. m.             1,536 TruncateFile.exe
               8 archivos         15,198 bytes
               2 dirs  422,138,609,664 bytes libres

C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test
>fc /B AllProgs-1.bin FilePointer.exe
Comparando archivos AllProgs-1.bin y FILEPOINTER.EXE
FC: no se han encontrado diferencias


C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test
>fc /B AllProgs-2.bin ReadFile.exe
Comparando archivos AllProgs-2.bin y READFILE.EXE
FC: no se han encontrado diferencias


C:\Users\Antonio\Documents\ASMB\MASM32 Assembler for Windows\test
>fc /B AllProgs-3.bin TruncateFile.exe
Comparando archivos AllProgs-3.bin y TRUNCATEFILE.EXE
FC: no se han encontrado diferencias

The current buffer size used in ReadFile program is 256 KB, that is a multiple of both 4 KB and 32 KB (the most common cluster sizes). I wonder if the efficiency of this program should increase if the buffer size would be larger. On the other hand, a very large buffer will cause problems in computers with little memory. I'll appreciate any advise on this point.

You may download the new ReadFile.exe program from the same site that contain FilePointer.exe and Truncate.exe auxiliary programs.

Enjoy! :D

Antonio
Last edited by Aacini on 25 Jul 2016 13:21, edited 1 time in total.

Post Reply