OMG
The standard Windows SORT command supports the /UNIQUE option, at least on Win 10, even though it is not documented - I had no idea
Code: Select all
D:\test>sort /?
SORT [/R] [/+n] [/M kilobytes] [/L locale] [/REC recordbytes]
[[drive1:][path1]filename1] [/T [drive2:][path2]]
[/O [drive3:][path3]filename3]
/+n Specifies the character number, n, to
begin each comparison. /+3 indicates that
each comparison should begin at the 3rd
character in each line. Lines with fewer
than n characters collate before other lines.
By default comparisons start at the first
character in each line.
/L[OCALE] locale Overrides the system default locale with
the specified one. The ""C"" locale yields
the fastest collating sequence and is
currently the only alternative. The sort
is always case insensitive.
/M[EMORY] kilobytes Specifies amount of main memory to use for
the sort, in kilobytes. The memory size is
always constrained to be a minimum of 160
kilobytes. If the memory size is specified
the exact amount will be used for the sort,
regardless of how much main memory is
available.
The best performance is usually achieved by
not specifying a memory size. By default the
sort will be done with one pass (no temporary
file) if it fits in the default maximum
memory size, otherwise the sort will be done
in two passes (with the partially sorted data
being stored in a temporary file) such that
the amounts of memory used for both the sort
and merge passes are equal. The default
maximum memory size is 90% of available main
memory if both the input and output are
files, and 45% of main memory otherwise.
/REC[ORD_MAXIMUM] characters Specifies the maximum number of characters
in a record (default 4096, maximum 65535).
/R[EVERSE] Reverses the sort order; that is,
sorts Z to A, then 9 to 0.
[drive1:][path1]filename1 Specifies the file to be sorted. If not
specified, the standard input is sorted.
Specifying the input file is faster than
redirecting the same file as standard input.
/T[EMPORARY]
[drive2:][path2] Specifies the path of the directory to hold
the sort's working storage, in case the data
does not fit in main memory. The default is
to use the system temporary directory.
/O[UTPUT]
[drive3:][path3]filename3 Specifies the file where the sorted input is
to be stored. If not specified, the data is
written to the standard output. Specifying
the output file is faster than redirecting
standard output to the same file.
D:\test>
I'm glad you posted your question with your code
I don't think it has much impact on performance, but there is no need to store %%G in a variable, you can echo %%G directly. So that also eliminates the need for enabled expansion.
Most references to %%G should be quoted in case the file name contains spaces.
Also, the [^...] regex could give the wrong result because the batch CALL statement doubles all quoted ^ characters, so ^ becomes ^^, and the first ^ is interpreted as negation as you want, but the second is a literal ^ character. One solution is to use the \XSEQ option along with the non-standard \c escape sequence. Another option is to store the find and replace strings in environment variables and use the /V option.
The speed of JREPL is relative. Compared to pure batch solutions, it is very fast, in addition to being much more powerful. But it is still using a script to do most of the work (JScript). Compared to a compiled program, it is very slow.
You could achieve much faster results with a compiled utility like the unix sed utility - you can find that for Windows any number of places.
But your JREPL solution could be optimized and made
MUCH faster (more than 100 times faster).
Your first find/replace needs to stay pretty much the same, except for using \c with /XSEQ to prevent doubling the caret.
The /J option in your 3rd CALL must dynamically execute the toLowerCase() function via eval() for every replacement, which is very costly. Using /JQ is a bit more tedious to type, but much faster because it is able to dynamically create a replace function once via eval(), and then call it normally for all of the replacements.
But it is possible to use /JMATCHQ instead to reduce the three remaining calls into a single one. Simply search for each word of length 4 or longer and write the lowercase form on a new line via the /JMATCHQ option.
This has not been tested, but I believe it will work
Code: Select all
@echo off
for %%F in (*.txt) do (
echo %%F
rem Remove non-alphanumeric characters that aren't whitespace
call jrepl "[\ca-zA-Z0-9\s]+" "" /xseq /f "%%F" /o -
rem Write each remaining word >=4 characters as lowercase on a new line
call jrepl "\S{4,}" "$txt=$0.toLowerCase()" /jmatchq /f "%%F" /o -
rem Reduce to sorted list of unique words
sort /unique "%%F" /o "%%F"
)
pause
Dave Benham