Parsing taking FOREVER - possible JSCRIPT solution?

Message

SIMMS7400 · #1 Post by **SIMMS7400** » 02 Dec 2018 05:01

Hi Folks -

I have a text file that consists of 85k rows. I have a need to parse that file and extract from token 2 only strings that begin with "C-" and spool those results to a file. From there, I need to remove all duplicates from that file. THe end result should be a file containing NO duplicates.

I'm using the following solution which is taking a considerable amount of time:

Code: Select all

@ECHO OFF

for /f "tokens=2 delims=|" %%A in (FDRII_outline.txt) do (
    ECHO "%%~A" | FINDSTR /C:"C-" >Nul 2>&1 && ECHO %%~A>>"out.txt"
)

jsort out.txt /u >out.txt.new
move /y out.txt.new out.txt >nul

Could the first portion of my code be replaced with another JSCRIPT solution? I could also leverage a VB script and use a dictionary but figured I'd ask first if anyone had any more efficient ways than my current solution.

Thank you!

EDIT:

What I mean when I say remove duplicates is that I need to remove the duplicate value AS WELL AS the original value. Essentially, the final file should be all instances that NEVER had a duplicate to begin with.

ShadowThief · #2 Post by **ShadowThief** » 02 Dec 2018 08:16

Does the order of the lines matter, or can I sort the strings alphabetically in order to make it easier for me to detect and remove duplicates?
Is the leftmost column a fixed length?
Any poison characters I need to look out for in the "C-" section?

SIMMS7400 · #3 Post by **SIMMS7400** » 02 Dec 2018 08:40

HI Shadow -

Nope, Sorting doesn't matter at all as long as the final file result has removed the duplicate(s) and the original.

Example:

Before:

1
2
3
1
1
4
5

After:

2
3
4
5

Thanks!

#4 Post by **Aacini** » 02 Dec 2018 08:41

You have not given a single description of the desired values: How long they are? Contains they special characters? How many unique values could be expected in the 85K rows? All these points are needed in order to create an efficient solution. I invite you to carefully read the first post in this forum...

With no info about the values, I just could write the simplest solution that I think could run fast:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Count desired values
for /F "tokens=2 delims=|" %%a in (FDRII_outline.txt) do (
   set "value=%%~a"
   if "!value:~0,2!" equ "C-" set /A "row[!value:~2!]+=1"
)

rem Output unique values
(for /F "tokens=2,3 delims=[]=" %%a in ('set row[') do (
   if %%b equ 1 echo C-%%a
)) > out.txt

This method fail if the values contain special characters that are SET /A arithmetic operators (other than the minus sign at second position).

This method run every time slower if the values are very long or there are a large amount of unique values. Anyway, I am pretty sure that this method will run much faster than the original code.

Ah! And this method output the values in sorted order.

Obviously, I could not test this code because you have not posted a segment of the input file...

Antonio

#5 Post by **Aacini** » 02 Dec 2018 09:04

Another method that have not the restrictions of my previous one...

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Extract desired values
(for /F "tokens=2 delims=|" %%a in (FDRII_outline.txt) do (
   set "value=%%~a"
   if "!value:~0,2!" equ "C-" echo !value!
)) > out1.txt

rem Sort desired values (this is faster than do the SORT into the FOR)
sort out1.txt > out2.txt

rem Output unique values
set "last="
set "count=0
(for /F "delims=" %%a in (out2.txt) do (
   if "%%a" equ "!last!" (
      set /A count+=1
   ) else (
      if !count! equ 1 echo !last!
      set count=1
   )
   set "last=%%a"
)) > out.txt

This method only fails if the values have an exclamation mark. This point can be fixed, if needed...

Antonio

SIMMS7400 · #6 Post by **SIMMS7400** » 02 Dec 2018 09:28

Aacini wrote: ↑

02 Dec 2018 09:04

Another method that have not the restrictions of my previous one...

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Extract desired values
(for /F "tokens=2 delims=|" %%a in (FDRII_outline.txt) do (
   set "value=%%~a"
   if "!value:~0,2!" equ "C-" echo !value!
)) > out1.txt

rem Sort desired values (this is faster than do the SORT into the FOR)
sort out1.txt > out2.txt

rem Output unique values
set "last="
set "count=0
(for /F "delims=" %%a in (out2.txt) do (
   if "%%a" equ "!last!" (
      set /A count+=1
   ) else (
      if !count! equ 1 echo !last!
      set count=1
   )
   set "last=%%a"
)) > out.txt

This method only fails if the values have an exclamation mark. This point can be fixed, if needed...

Antonio

Antonio -

That worked like an absolute charm!!!!!

Thank you so much! Here is the end result:

C-12527631
C-12527651
C-12527800
C-12527825
C-12527844
C-12527883
C-12527892
C-12527902
C-12527904
C-12527906
C-12527907
C-12527908
C-12527911
C-12527913
C-12527914
C-12527920
C-12527925
C-12527926
C-12527930
C-12527934
C-12527945
C-12527947
C-12527950
C-12527955
C-12527956
C-12527960
C-12527961
C-12527965
C-12527977
C-12527978
C-12527980
C-12527983
C-12527987
C-12527989
C-12527990
C-12527991
C-12527992
C-12527993
C-12527996
C-12527997
C-12528000
C-12528004
C-12528006

#7 Post by **Aacini** » 02 Dec 2018 10:41

I suggest you to also test my first code. You may get the time that both methods takes and post they (and also the time that your original code takes).

IMHO this topic is about efficiency. Isn't it? So this point should take full attention (not just to get the correct result).

I am very interested in this type of tests!

Antonio

DosTips.com

Parsing taking FOREVER - possible JSCRIPT solution?

Parsing taking FOREVER - possible JSCRIPT solution?

Re: Parsing taking FOREVER - possible JSCRIPT solution?

Re: Parsing taking FOREVER - possible JSCRIPT solution?

Re: Parsing taking FOREVER - possible JSCRIPT solution?

Re: Parsing taking FOREVER - possible JSCRIPT solution?

Re: Parsing taking FOREVER - possible JSCRIPT solution?

Re: Parsing taking FOREVER - possible JSCRIPT solution?