Page 1 of 1

Sort tokens within a string & Disable FOR /F EOL option

Posted: 11 May 2011 15:06
by dbenham
2011-07-10: Changed title to reference discussion of FOR /F EOL option which begins in 3rd post on this thread

While working on a "universal" %DATE% parser it became necessary to sort tokens within a string. A few techniques were briefly bandied about that all relied on a fixed small number of tokens within the string. I thought it might be useful to have a generic function that can efficiently handle any number of tokens.

The first function I developed relies on a pipe to the SORT command, and is therefore case insensitive. I'm very happy that it does not require any explicit temporary files. (The SORT command can internally create a temporary file, but I doubt tokens within a single string could ever cause that to happen.) The performance is good, and it is virtually uneffected by the length of the string or the number of tokens.

Code: Select all

@echo off
setlocal
set "str=red blue yellow white black grey green purple orange"
echo:  unsorted str = %str%
call :sortStrTokensI str
echo: ascending str = %str%
call :sortStrTokensI str /r
echo:descending str = %str%
echo:
set str=I i e E A a À Á Â Ã Ä Å Æ È É Ê Ë Ì Í Î Ï à á â ã ä å æ è é ê ë ì í î ï
echo:         unsorted = %str%
call :sortStrTokensI str
echo:      native sort = %str%
set str=I i e E A a À Á Â Ã Ä Å Æ È É Ê Ë Ì Í Î Ï à á â ã ä å æ è é ê ë ì í î ï
call :sortStrTokensI str "/l C"
echo:^>=128 binary sort = %str%
exit /b

:sortStrTokensI StrVar ["sort options"]
::
:: Perform a case insensitive sort of tokens within the string contained
:: by variable StrVar.
::
:: By default the tokens are sorted using the local collating sequence
:: in ascending order. All sorts are case insensitive.
::
:: The following sort options can over-ride default behaviour
::
::   /R     Specifies a descending sort.
::
::   "/L C" Characters greater than ASCII 127 are sorted according to
::          their binary encoding.
::
:: Multiple options should be enclosed by a single pair of quotes
::
:: This function does not properly handle tokens containing * or ?
::
  setlocal enableDelayedExpansion
  set "str=!%~1!"
  set "sorted="
  for /f %%a in ('^(for %%t in ^(!str!^) do @echo %%t^)^|sort %~2') do set "sorted=!sorted! %%a"
  (endlocal
    set "%~1=%sorted:~1%"
  )
exit /b

Here are the sortStrTokensI test results:

Code: Select all

  unsorted str = red blue yellow white black grey green purple orange
 ascending str = black blue green grey orange purple red white yellow
descending str = yellow white red purple orange grey green blue black

         unsorted = I i e E A a À Á Â Ã Ä Å Æ È É Ê Ë Ì Í Î Ï à á â ã ä å æ è é ê ë ì í î ï
      native sort = ï Ä Í É À È Ã Æ Ì Â Ë Á Ï Ê Å Î æ ì a A E e I i á à â ë î é ã ä å è í ê
>=128 binary sort = A a E e i I À Á Â Ã Ä Å Æ È É Ê Ë Ì Í Î Ï à á â ã ä å æ è é ê ë ì í î ï


My next attempt is a case sensitive version that does not rely on SORT or pipes. Performance is three times faster for a string with only 3 tokens. But performance dramatically suffers as the number of tokens grows. The function could be extended to support a case insensitive option, but I prefer the performance profile of the first function.

Code: Select all

@echo off
setlocal
set "str=red blue yellow white black grey green purple orange"
echo:  unsorted str = %str%
call :sortStrTokens str
echo: ascending str = %str%
call :sortStrTokens str /r
echo:descending str = %str%
echo:
set str=I i e E A a À Á Â Ã Ä Å Æ È É Ê Ë Ì Í Î Ï à á â ã ä å æ è é ê ë ì í î ï
echo:  unsorted str = %str%
call :sortStrTokens str
echo: ascending str = %str%
call :sortStrTokens str /r
echo:descending str = %str%
exit /b

:sortStrTokens StrVar [/R]
::
:: Perform a case sensitive sort of tokens within the string contained
:: by variable StrVar.
::
:: By default the tokens are sorted using the local collating sequence
:: in ascending order.
::
:: The case insenstive /R option specifies a descending sort
::
:: This function does not properly handle tokens containing * or ?
::
  setlocal enableDelayedExpansion
  set "str=!%~1!"
  set "sorted="
  if /i "%~2"=="/R" (set comp=geq) else set comp=leq
  for %%t in (!str!) do (
    if not defined sorted (set "sorted=%%t") else (
      set "sorted2="
      set placed=
      for %%a in (!sorted!) do (
        if not defined placed if %%t %comp% %%a (
          set "sorted2=!sorted2! %%t"
          set placed=true
        )
        set "sorted2=!sorted2! %%a"
      )
      if not defined placed set "sorted2=!sorted2! %%t"
      set "sorted=!sorted2:~1!"
    )
  )
  (endlocal
    set "%~1=%sorted%"
  )
exit /b

Here are the sortStrTokens test results:

Code: Select all

  unsorted str = red blue yellow white black grey green purple orange
 ascending str = black blue green grey orange purple red white yellow
descending str = yellow white red purple orange grey green blue black

  unsorted str = I i e E A a À Á Â Ã Ä Å Æ È É Ê Ë Ì Í Î Ï à á â ã ä å æ è é ê ë ì í î ï
 ascending str = ï Ä Í É À È Ã Æ Ì Â Ë Á Ï Ê Å Î æ ì a A e E i I á à â ë î é ã å ä í è ê
descending str = ê è í ä å ã é î ë â à á I i E e A a ì æ Î Å Ê Ï Á Ë Â Ì Æ Ã È À É Í Ä ï



Here is a comparison of the performance profile of the two functions

Code: Select all

Token   Seconds to Perform 100 Iterations
Count     sortStrTokensI  sortStrTokens
-----     --------------  -------------
   3           2.7             0.8
  10           2.9             1.4
  20           2.7             2.4
  30           2.8             4.1
  50           2.9             9.6


As always I'm interested if anyone can point out problems, optimizations, or entirely new solutions.

Dave Benham

Re: Sorting tokens within a string

Posted: 11 May 2011 23:11
by amel27
Hi, dbenham. Some reasons on the subject.

Simple FOR command is not good method of substrings enumeration if wildcards presents in string, because this method design for files enum. More secure method - inserting <LF> (thanks jeb) and enumerate via extend "FOR /F" command.

A bit changed :sortStrTokens via SORT:

Code: Select all

:sortStrTokens StrVar ["sort options"]
  setlocal enableDelayedExpansion
  set "sorted="& set ^"str=!%~1: =^

!"
  for /f %%a in ('cmd/v:on/c "echo ^!str^!"^|sort %~2') do set "sorted=!sorted! %%a"
  endlocal& set "%~1=%sorted:~1%"
exit /b

Case insensitive accending order :sortStrTokens via native SET command:
("=" and "!" chars not supported)

Code: Select all

:sortStrTokens StrVar
  setlocal enableDelayedExpansion
  set "sorted="& set ^"str=!%~1: =^

!"
  for /f %%a in ("!str!") do set "$_%%a=="
  for /f "delims==" %%a in ('set $_') do (
    set "$a=%%a"
    set "sorted=!sorted! !$a:~2!"
  )
  endlocal& set "%~1=%sorted:~1%"
exit /b

Re: Sorting tokens within a string

Posted: 12 May 2011 22:37
by dbenham
amel27 wrote:Simple FOR command is not good method of substrings enumeration if wildcards presents in string, because this method design for files enum. More secure method - inserting <LF> (thanks jeb) and enumerate via extend "FOR /F" command.

Brilliant! I was aware of all the components involved with this technique, but I never dreamed of applying them in this combination for this particular purpose. Thanks amel27 and Jeb. It opens up lots of possibilities. For example, we can add a function option to control the token delimiter(s).

We need to be careful about the implicit pesky "eol" option with the FOR /F loop. I've seen some claims that "eol=" will disable it, but it actually sets the eol character to a quote. However I discovered just today that you can set the eol character to the same as your delimiter, and the delimiter functionality takes precedence, which effectively disables the "eol" option. For example "eol= delims= " will use space as a delimiter and will never skip a line that starts with a space.

I think it should be more efficient to use

Code: Select all

for /f %%a in ('^(for /f %%t in ^(!str!^) do @echo %%t^)^|sort %~2') do set "sorted=!sorted! %%a"

instead of

Code: Select all

for /f %%a in ('cmd/v:on/c "echo ^!str^!"^|sort %~2') do set "sorted=!sorted! %%a"

On my Vista machine at home it takes nearly 0.5 seconds to launch cmd.exe. (I wish I new why it is so slow. At work it is 10 times faster) The pipe is already slowing the function enough!


amyl27 wrote:Case insensitive accending order :sortStrTokens via native SET command:
("=" and "!" chars not supported)

Oh I like this! :D. It should be screaming fast, even on my home Vista machine with the cmd.exe slowness problem!

A few observations:
  • Just to be safe, we should initialize by clearing all existing variables starting with $_
  • As written, duplicate tokens will be stripped, but they can optionally be preserved in the variable values.
  • tokens with ! could be supported if we pre- and post- process the tokens with appropriate substitutions. The same might even be done for = if we use one of the techniques proposed in How to replace "=", "*", ":" in a variable

Thanks amyl27 for your excellent ideas. I'll work with them over the next few days and see what I can come up with.

Dave Benham

Re: Sorting tokens within a string

Posted: 13 May 2011 01:34
by amel27
dbenham wrote:We need to be careful about the implicit pesky "eol" option with the FOR /F loop. I've seen some claims that "eol=" will disable it, but it actually sets the eol character to a quote. However I discovered just today that you can set the eol character to the same as your delimiter, and the delimiter functionality takes precedence, which effectively disables the "eol" option. For example "eol= delims= " will use space as a delimiter and will never skip a line that starts with a space.
Yes, "eol" very strange option, I can't understand when it works when not. Any information will be useful.

dbenham wrote:I think it should be more efficient to use

Code: Select all

for /f %%a in ('^(for /f %%t in ^(!str!^) do @echo %%t^)^|sort %~2') do set "sorted=!sorted! %%a"
instead of

Code: Select all

for /f %%a in ('cmd/v:on/c "echo ^!str^!"^|sort %~2') do set "sorted=!sorted! %%a"
Your variant doesn't approach for <LF> enters, because !STR! should expand in nested CMD, not current... But, as by default DelayedExpansion disabled, we should launch CMD.EXE twice.

dbenham wrote:On my Vista machine at home it takes nearly 0.5 seconds to launch cmd.exe. (I wish I new why it is so slow. At work it is 10 times faster) The pipe is already slowing the function enough!
I think, you should try "%comspec%" (with full path) instead of short "CMD" command.

Re: Sorting tokens within a string

Posted: 13 May 2011 13:30
by dbenham
amel27 wrote:Your variant doesn't approach for <LF> enters, because !STR! should expand in nested CMD, not current... But, as by default DelayedExpansion disabled, we should launch CMD.EXE twice.

I'm don't fully understand the explanation, but I agree my suggestion doesn't work. I tried lots of variations on the theme but to no avail. I'm a bit mystified why my original :sortStrTokensI works, but this suggestion doesn't.


amel27 wrote:I think, you should try "%comspec%" (with full path) instead of short "CMD" command.

No difference in performance, unfortunately. :cry:


amel27 wrote:Yes, "eol" very strange option, I can't understand when it works when not. Any information will be useful.

I think it is fairly straight forward, but terribly documented. See if the following examples help:

Code: Select all

@echo off
setlocal enableDelayedExpansion
cls
set lf=^


set "testLines=1)Hello;world!lf!;2)Hello;world!lf!  3)Hello;world!lf!"4)Hello" "world""
echo !lf!testLines are:!lf!!lf!!testLines!
echo !lf!============================!lf!
set /a n=0
call :test "tokens=1,2"
call :test "tokens=1,2 eol="
call :test "tokens=1,2 eol= delims=;"
call :test "tokens=1,2 eol= "
call :test "tokens=1,2 delims=;"
call :test "tokens=1,2 eol= delims=; "
exit /b

:test
set /a n+=1
echo Call %n% options: %1
echo:
for /f %1 %%a in ("!testLines!") do (
  echo a=%%a  b=%%b
)
echo !lf!============================!lf!
exit /b

results:

Code: Select all

testLines are:

1)Hello;world
;2)Hello;world
  3)Hello;world
"4)Hello" "world"

============================

Call 1 options: "tokens=1,2"

a=1)Hello;world  b=
a=3)Hello;world  b=
a="4)Hello"  b="world"

============================

Call 2 options: "tokens=1,2 eol="

a=1)Hello;world  b=
a=;2)Hello;world  b=
a=3)Hello;world  b=

============================

Call 3 options: "tokens=1,2 eol= delims=;"

a=1)Hello  b=world
a=2)Hello  b=world
a="4)Hello" "world"  b=

============================

Call 4 options: "tokens=1,2 eol= "

a=1)Hello;world  b=
a=;2)Hello;world  b=
a=3)Hello;world  b=
a="4)Hello"  b="world"

============================

Call 5 options: "tokens=1,2 delims=;"

a=1)Hello  b=world
a=2)Hello  b=world
a=  3)Hello  b=world
a="4)Hello" "world"  b=

============================

Call 6 options: "tokens=1,2 eol= delims=; "

a=1)Hello  b=world
a=2)Hello  b=world
a=3)Hello  b=world
a="4)Hello"  b="world"

============================

Notes:
Call 1: default eol=<semicolon>, default delims=<space><tab>
Line 2) starting with <semicolon> stripped

Call 2: eol=<quote>, default delims=<space><tab>
Very interesting that the <quote> both sets the eol option and terminates the argument string!
Line 4) starting with <quote> stripped

Call 3: eol=<space>, delims=<semicolon>
Line 3) starting with <space> stripped

Call 4: eol=<space>, default delims=<space><tab>
eol matces one of the delimiters, so it is disabled and all lines preserved!

Call 5: default eol=<semicolon>, delims=<semicolon>
eol matces one of the delimiters, so it is disabled and all lines preserved!

Call 6: eol=<space>, delims=<semicolon><space>
eol matces one of the delimiters, so it is disabled and all lines preserved!

I was not able to figure out how to specify <quote> as a delimiter, but I don't think I really care either.


Dave Benham

Re: Sorting tokens within a string

Posted: 14 May 2011 01:44
by orange_batch
Without reading the entire thread, my first thoughts were to sort via sort or set, but I see you guys have that covered. 8)

I programmed a nifty function that will sort an array of data based on a matching array containing numerical values (with which to sort by). So for example, you want to sort a log of paths based on their length, or file size, or whatever, it'll do that. It does not use sort either (sort doesn't work with variable-digit numbers anyways), so it's quite speedy.

Re: Sorting tokens within a string

Posted: 15 May 2011 07:35
by amel27
Many thanks, dbenham for exhaustive information. :)

Re: Sorting tokens within a string

Posted: 16 May 2011 15:41
by dbenham
orange_batch wrote:(sort doesn't work with variable-digit numbers anyways)

Sort works with numbers if you convert them to hex first using :num2Hex :wink:
It even works with a mixture of negative and positive numbers if you prefix the hex with the sign: - if <0, + if >=0

Re: Sorting tokens within a string

Posted: 17 May 2011 07:06
by jeb
Thanks to dbenham,

I didn't know that you can disable the eol-character, if it is one of the delims-characters.

But I suppose I found a way to disable the eol even if the delims is empty.

Code: Select all

setLocal EnableDelayedExpansion
set lf=^


for /F ^"eol^=^

delims^="" %%a in ("^^caret!lf!;semicolon!lf! space!lf!""quote") do echo '%%a'

It is obvious that the eol should be a <LF>, but as the FOR /F splits at each <LF>, it is the same as eol would be empty.

jeb

Re: Sorting tokens within a string

Posted: 17 May 2011 13:06
by dbenham
Jeb wrote:But I suppose I found a way to disable the eol even if the delims is empty.


Code: Select all

setLocal EnableDelayedExpansion
set lf=^


for /F ^"eol^=^

delims^="" %%a in ("^^caret!lf!;semicolon!lf! space!lf!""quote") do echo '%%a'

Most excellent Jeb! :D I had thought of setting eol=<LF>, but I could not figure out a syntax that worked. I'm glad you figured it out.

I experimented with Jeb's syntax and discovered all <space> and <equal> must be escaped when adding additional options. The two quotes ("") at the end confuse me. I found that you can use ^" at the end as well. Finally, don't let Jeb's example fool you - you do not need delayed expansion to disable eol.

Code: Select all

@echo off
setlocal disableDelayedExpansion
set "str=skip 1st line,^caret,;semicolon, space,"quote,!exclamation"
setLocal EnableDelayedExpansion
set ^"str=!str:,=^

!"
echo str=!str!
echo:----------------
for /F ^"usebackq^ skip^=1^ eol^=^

delims^=^" %%a in ('!str!') do (
  setlocal disableDelayedExpansion
  echo '%%a'
  endlocal
)
echo:---------------
setlocal disableDelayedExpansion
for /f ^"eol^=^

delims^=^" %%a in (";eol disabled so line preserved") do echo %%a

Summary guidelines for simplest way to disable eol:

1) if delims is enabled then set eol to one of the delimiters

2) if delims is disabled then use Jeb's syntax

Dave Benham