Looking to improve batch file to count and grep the data to separate files.

Message

rcmpayne · #1 Post by **rcmpayne** » 08 Jan 2018 10:43

Hello All,

I have a batch file that does two things and it works but on a 9GB file, it takes forever. Does anyone have any ideas how i can improve this from a batch file? Currently using cut, grep, awk, sed from GNU.

The %File_IN% is a 9GB server log where i want to end up with two new files.

%Results_COUNT% is the first file with a count of each log using the class file name

Code: Select all

     85  com.platform.mdm.core.identity.impl.EnterpriseIdentitySyncService 
      1  com.platform.mdm.tcp.client.InProcessMethodInvocationInitiatedTcpTunnelInstanceHandler 
2058567  com.platform.mdm.tcp.messaging.WorkflowBasedIncomingMessageRouterFactory 
    630  com.platform.transaction.TransactionImpl 
    630  com.platform.workflow.io.AbstractLinkOutputSource

%Results_ALL_New% is the second file(s) with all all the raw loglines from %File_IN% that match the patern like com.platform.mdm.tcp.messaging.WorkflowBasedIncomingMessageRouterFactory

Batch file

Code: Select all

echo ********************************************************************************
echo ********************* Creating count summary of loglines   *********************
echo ********************************************************************************
echo.
echo Get a count of each log line and add results to file
echo.
cut -f6 -d"-" %File_IN% | sort | uniq -c > %Results_COUNT%
Echo Done!
echo.

echo ********************************************************************************
echo ********************* Creating separate file with raw loglines   *********************
echo ********************************************************************************
echo.
echo Scans Count file for logs above set value (default is 10000) and output the complete line to its own file
echo.
for /f "tokens=1 " %%d in ('cat %Results_COUNT% ^| awk "$1>10000" ^| sed -e "s/^[ \t]*//" ^| awk -F "( )" "{print $3}"') do (grep %%d %Results_ALL_New% > %%d.log)
Echo Done!
echo.

#2 Post by **penpen** » 08 Jan 2018 15:19

On files that big, it might fasten up the speed if you avoid pipes, writing intermediate results to temporary files.
In addition you should create a result file for the command processed by "for/f" and work on that instead of the command output.

penpen

rcmpayne · #3 Post by **rcmpayne** » 09 Jan 2018 08:40

AH, thanks for the help. The first part of the script is done and its dropped the time from 24 hours to ~13 mins. Starting to work on the second part now but let me know if you see any room for improvement? increasing sort above -S 1G does not seem to do anything even-though i have 8GB memory available. also, --parallel=xx does not seem available in GnuWin.

Results:

Code: Select all

Get a count of each log line and add results to file
Main Start: 2018-01-09-10_53_50.541
Starting cut.exe 10:53:50.63
Starting sort.exe 10:55:02.78
Starting uniq.exe 11:02:41.29
Starting sort.exe 11:03:06.46
Deleting Temp files
Main End: 2018-01-09-11_03_07.026
Press any key to continue . . .

Batch File

Code: Select all

echo ********************************************************************************
echo ********************* Creating count summary of loglines   *********************
echo ********************************************************************************
echo.
echo Get a count of each log line and add results to file
echo.
echo Main Start: %DateTime%
echo.
echo.
set LC_ALL=C.UTF-8
echo Starting cut.exe %DateTime%
cut -f6 -d"-" %File% > tmpcut.log
echo Starting sort.exe %DateTime%
sort -S1G tmpcut.log -o tmpsort.log
echo Starting uniq.exe %DateTime%
uniq -c tmpsort.log tmpuniqsorted.log
echo Starting sort.exe %DateTime%
sort -r tmpuniqsorted.log Core_Logline_Count.log
echo deleting Temp files
del tmp*.log
echo.
echo.
echo Main End: %DateTime%

#4 Post by **penpen** » 09 Jan 2018 13:28

If split.exe (with -l option) is part of your gnuwin tools, then you could split the big file to smaller parts (and merge the resulting files with another for/f loop adding the integers), so sort would speed up:
But it depends on the algorithm of the gnuwin sort.exe if that will help speeding up your code.

penpen

rcmpayne · #5 Post by **rcmpayne** » 10 Jan 2018 08:22

Thanks, its available in GnuWin32

Code: Select all

C:\Tools>split.exe --help
Usage: split.exe [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.

Report bugs to <bug-coreutils@gnu.org>.

DosTips.com

Looking to improve batch file to count and grep the data to separate files.

Looking to improve batch file to count and grep the data to separate files.

Re: Looking to improve batch file to count and grep the data to separate files.

Re: Looking to improve batch file to count and grep the data to separate files.

Re: Looking to improve batch file to count and grep the data to separate files.

Re: Looking to improve batch file to count and grep the data to separate files.