Using many "tokens=..." in FOR /F command in a simple way
Moderator: DosItHelp
Re: Using many "tokens=..." in FOR /F command in a simple way
Basing on this result i guess, that you both use codepage 850 and the the order depends on UTF-32/UTF-16/UCS2:
- CP_850(173) = CP_850(0xAD) = U+00A1
- CP_850(189) = CP_850(0xBD) = U+00A2
- CP_850(156) = CP_850(0x9C) = U+00A3
- CP_850(207) = CP_850(0xCF) = U+00A4
- (i haven't checked more)
penpen
Edit: Corrected ome flaws.
Edit2: Added the other two possibilities.
- CP_850(173) = CP_850(0xAD) = U+00A1
- CP_850(189) = CP_850(0xBD) = U+00A2
- CP_850(156) = CP_850(0x9C) = U+00A3
- CP_850(207) = CP_850(0xCF) = U+00A4
- (i haven't checked more)
penpen
Edit: Corrected ome flaws.
Edit2: Added the other two possibilities.
Re: Using many "tokens=..." in FOR /F command in a simple way
Yes Antonio. It works for me, too.
@penpen
This sounds quite logical (even if it's hard to believe that the cmd works with UTF-32 rather than UTF-16 ). What would be your suggestion then? Having a UTF-32-encoded file and somehow read the characters out of it and convert them to the current code page?
Steffen
Code: Select all
Thread 1:
173 189 156 207 190 221 245 249 184 166 174 170 240 169 238 248 241 253 252 239 230 244 250
247 251 167 175 172 171 243 168 183 181 182 199 142 143 146 128 212 144 210 211 222 214 215
216 209 165 227 224 226 229 153 158 157 235 233 234 154 237 232 225 133 160 131 198 132 134
145 135 138 130 136 137 141 161 140 139 208 164 149 162 147 228 148 246 155 151 163 150 129
236 231 152
Thread 2:
159
Thread 3:
176 177 178
Thread 4:
179
Thread 5:
180
Thread 6:
185
Thread 7:
186
Thread 8:
187
Thread 9:
188
Thread 10:
191
Thread 11:
192
Thread 12:
193
Thread 13:
194
Thread 14:
195
Thread 15:
196
Thread 16:
197
Thread 17:
200
Thread 18:
201
Thread 19:
202
Thread 20:
203
Thread 21:
204
Thread 22:
205
Thread 23:
206
Thread 24:
213
Thread 25:
217
Thread 26:
218
Thread 27:
219
Thread 28:
220
Thread 29:
223
Thread 30:
242
Thread 31:
254
Thread 32:
255
Code: Select all
tokens=1,20,45,75,120
A1 A20 A45 A75 A120
B1 B20 B45 B75 B120
C1 C20 C45 C75 C120
tokens=30,28-32,170-165
A30 A28 A29 A30 A31 A32 A170 A169 A168 A167 A166 A165
B30 B28 B29 B30 B31 B32 B170 B169 B168 B167 B166 B165
C30 C28 C29 C30 C31 C32 C170 C169 C168 C167 C166 C165
tokens=
@penpen
This sounds quite logical (even if it's hard to believe that the cmd works with UTF-32 rather than UTF-16 ). What would be your suggestion then? Having a UTF-32-encoded file and somehow read the characters out of it and convert them to the current code page?
Steffen
Re: Using many "tokens=..." in FOR /F command in a simple way
Yes, UTF32 is hard to believe... i've added UTF-16/UCS-2 above.aGerman wrote:This sounds quite logical (even if it's hard to believe that the cmd works with UTF-32 rather than UTF-16 ).
The most probable now is UCS-2.
I'm unsure... depends on the purpose i think.aGerman wrote:What would be your suggestion then? Having a UTF-32-encoded file and somehow read the characters out of it and convert them to the current code page?
If you want to check this for all values, then i would probably use java to create the sourcefile and would use codepage 65001 within the source to create all needed codepoints.
penpen
Re: Using many "tokens=..." in FOR /F command in a simple way
penpen wrote:i've added UTF-16/UCS-2 above.
The most probable now is UCS-2.
I stick with UTF-16. But that shouldn't make any difference here.
penpen wrote:depends on the purpose i think.
Well the purpose is to find the order of FOR variable names for the current OEM code page. My idea is to somehow work with TYPE and CMD /U.
Steffen
Re: Using many "tokens=..." in FOR /F command in a simple way
Theoretically there might be a difference:aGerman wrote:penpen wrote:i've added UTF-16/UCS-2 above.
The most probable now is UCS-2.
I stick with UTF-16. But that shouldn't make any difference here.
The sort order of characters within UTF-16 depends on the UTF-32 codepoint value, while the sort order of UCS-2 depends on their (16 bit) index.
So for two indices c, and (c+1) the characters UCS_2(c) and UCS_2(c+1) are successive in UCS-2, but in UTF-16 UTF_16(c) and UTF_16(c+1) are not.
But i don't know, if this applies to any OEM codepage.
I have no idea how to deal with double/multibyte codepages.aGerman wrote:Well the purpose is to find the order of FOR variable names for the current OEM code page. My idea is to somehow work with TYPE and CMD /U.
If you use single byte coddepages, then i would create a table with ansi values 01-255 (table.dat), and load it whenever you change the codepage using:
Code: Select all
set "table="
set /P "table=" < "table.dat"
Then just sort by index using "sort +4".
(Somehow like that or similar.)
penpen
Re: Using many "tokens=..." in FOR /F command in a simple way
Hi Antonio,
I've used your latest ""FOR-F with many tokens - SP.bat" file and
change the codepage to 850 and it works flawlessly. Don't know why.
tokens=170-180
A170 A171 A172 A173 A174 A175 A176 A177
B170 B171 B172 B173 B174 B175 B176 B177
C170 C171 C172 C173 C174 C175 C176 C177
tokens=180-170
A177 A176 A175 A174 A173 A172 A171 A170
B177 B176 B175 B174 B173 B172 B171 B170
C177 C176 C175 C174 C173 C172 C171 C170
tokens=
Whereas my default codepage is 437 and it has some weird characters appears.
tokens=170-180
A86 %∞ %τ
B86 %∞ %τ
C86 %∞ %τ
tokens=180-170
%τ %∞ A86
%τ %∞ B86
%τ %∞ C86
tokens=
I'm using Windows 8.1 Pro 64-bit US version
I've used your latest ""FOR-F with many tokens - SP.bat" file and
change the codepage to 850 and it works flawlessly. Don't know why.
tokens=170-180
A170 A171 A172 A173 A174 A175 A176 A177
B170 B171 B172 B173 B174 B175 B176 B177
C170 C171 C172 C173 C174 C175 C176 C177
tokens=180-170
A177 A176 A175 A174 A173 A172 A171 A170
B177 B176 B175 B174 B173 B172 B171 B170
C177 C176 C175 C174 C173 C172 C171 C170
tokens=
Whereas my default codepage is 437 and it has some weird characters appears.
tokens=170-180
A86 %∞ %τ
B86 %∞ %τ
C86 %∞ %τ
tokens=180-170
%τ %∞ A86
%τ %∞ B86
%τ %∞ C86
tokens=
I'm using Windows 8.1 Pro 64-bit US version
Re: Using many "tokens=..." in FOR /F command in a simple way
penpen wrote:But i don't know, if this applies to any OEM codepage.
Maybe for Chinese. I don't know. I'm quite familiar with UTF-16 and its surrogate concept. The reason why I assume that the cmd deals with UTF-16 is that 1) Windows deals with it 2) I found the references of MultiByteToWideChar and WideCharToMultiByte API functions in cmd.exe.
penpen wrote:If you use single byte coddepages, then i would create a table with ansi values 01-255 (table.dat), and load it whenever you change the codepage using:Then i would use "cmd /u" to output the all characters to a file, and use "fc /b" against a null bytes to get their indices (storing each for example in environment variable 'set "c_<2 byte index in array>=<cmd /u index>"').Code: Select all
set "table="
set /P "table=" < "table.dat"
Then just sort by index using "sort +4".
(Somehow like that or similar.)
Yes that's basically what I try to do. It's a bit tricky because of the Little Endianness of the wide characters.
Steffen
Re: Using many "tokens=..." in FOR /F command in a simple way
Is this something you could work with Antonio?
The code creates file "u-a.txt" with the unicode values (hex) and the related OEM char code (dec).
Steffen
//EDIT Similar, prints the characters...
The code creates file "u-a.txt" with the unicode values (hex) and the related OEM char code (dec).
Steffen
Code: Select all
@echo off &setlocal
:: create base64 code
>"tmp1" echo(gIGCg4SFhoeIiYqLjI2Oj5CRkpOUlZaXmJmam5ydnp+goaKjpKWmp6ipqqusra6vsLGys7S1tre4ubq7vL2+v8
>>"tmp1" echo(DBwsPExcbHyMnKy8zNzs/Q0dLT1NXW19jZ2tvc3d7f4OHi4+Tl5ufo6err7O3u7/Dx8vP09fb3+Pn6+/z9/v8=
:: decode to bytes 128 to 255
>nul certutil.exe -f -decode "tmp1" "tmp2"
:: convert the characters represented by these bytes to unicode
>"tmp1" cmd /q /u /c "type "tmp2""
:: create a file with 256 'A's for comparisons using FC
>"tmp2" (for /l %%i in (1 1 4) do <nul set /p "=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")
:: create the HEX dump
setlocal EnableDelayedExpansion
set "X=1"
>"dmp" (
for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "tmp1" "tmp2"^|findstr /vbi "FC:"') do (
set /a "Y=0x%%i"
for /l %%k in (!X! 1 !Y!) do echo 41
set /a "X=Y+2"
echo %%j
)
)
del "tmp1"
:: combine hex values in BE order with char codes of the related character of the OEM code page
<"dmp" >"tmp2" (
for /l %%i in (128 1 255) do (
set /p "low=" &set /p "high="
echo !high!!low! %%i
)
)
del "dmp"
:: sort
sort "tmp2" /o "u-a.txt"
del "tmp2"
//EDIT Similar, prints the characters...
Code: Select all
@echo off &setlocal
:: create base64 code
>"tmp1" echo(gIGCg4SFhoeIiYqLjI2Oj5CRkpOUlZaXmJmam5ydnp+goaKjpKWmp6ipqqusra6vsLGys7S1tre4ubq7vL2+v8
>>"tmp1" echo(DBwsPExcbHyMnKy8zNzs/Q0dLT1NXW19jZ2tvc3d7f4OHi4+Tl5ufo6err7O3u7/Dx8vP09fb3+Pn6+/z9/v8=
:: decode to bytes 128 to 255
>nul certutil.exe -f -decode "tmp1" "tmp2"
:: save them in a variable
<"tmp2" set /p "chars="
:: convert the characters represented by these bytes to unicode
>"tmp1" cmd /q /u /c "type "tmp2""
:: create a file with 256 'A's for comparisons using FC
>"tmp2" (for /l %%i in (1 1 4) do <nul set /p "=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")
:: create the HEX dump
setlocal EnableDelayedExpansion
set "X=1"
>"dmp" (
for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "tmp1" "tmp2"^|findstr /vbi "FC:"') do (
set /a "Y=0x%%i"
for /l %%k in (!X! 1 !Y!) do echo 41
set /a "X=Y+2"
echo %%j
)
)
del "tmp1"
:: combine hex values in BE order with indexes of the related OEM characters in %chars%
<"dmp" >"tmp2" (
for /l %%i in (0 1 127) do (
set /p "low=" &set /p "high="
echo !high!!low! %%i
)
)
del "dmp"
:: sort
sort "tmp2" /o "index.txt"
del "tmp2"
:: print the characters
for /f "usebackq tokens=2" %%i in ("index.txt") do echo "!chars:~%%i,1!"
pause
Re: Using many "tokens=..." in FOR /F command in a simple way
@aGerman:
You shouldn't use "type" to convert those characters to UTF-16/UCS-2:
There may be undefined characters in a codepage, that might be (depending on the font used) mapped (unexpectedly) to a surrogate pair.
Also the A file might be too short (in worst case 512 bytes are needed for the 128 characters if all are surrogate pairs).
Although this is slower, you better should do it characterwise (requires table.dat file with bytes [20, 01 : FF]):
penpen
Edit: Corrected some flaws.
Edit: Corrected the byte order of the surrogate pair: Thanks to aGerman for finding this bug.
You shouldn't use "type" to convert those characters to UTF-16/UCS-2:
There may be undefined characters in a codepage, that might be (depending on the font used) mapped (unexpectedly) to a surrogate pair.
Also the A file might be too short (in worst case 512 bytes are needed for the 128 characters if all are surrogate pairs).
Although this is slower, you better should do it characterwise (requires table.dat file with bytes [20, 01 : FF]):
Code: Select all
@echo off
cls
setlocal enableExtensions disableDelayedExpansion
if not "%~1" == "" ( set "codepage=%~1" ) else set "codepage=850"
for /f "tokens=2 delims=:." %%a in ('chcp') do set "cp=%%~a"
>nul fsutil file createnew "zero.txt" 4
>nul chcp %codepage%
set "table="
<"table.dat" set /P "table="
setlocal enableDelayedExpansion
for /l %%a in (0x20, 1, 0xFF) do (
set /a "index=1000+%%~a"
set "index=!index:~1!"
cmd /e:ON /v:ON /d /u /c">"dummy.txt" echo(^!table:~%%~a,1^!"
for /l %%b in (0, 1, 3) do set "b_0000000%%~b=00"
for %%b in ("dummy.txt") do set /a "bytes=%%~zb-4"
for /f "tokens=1,3 delims=: " %%b in ('fc /b "zero.txt" "dummy.txt" ^| findstr "0" ') do if %%~b lss !bytes! set "b_%%~b=%%~c"
set "char_!index!=!table:~%%~a,1!"
if !bytes! == 2 ( set "cp_!index!=0x!b_00000001!!b_00000000!"
) else set "cp_!index!=0x!b_00000001!!b_00000000!,0x!b_00000003!!b_00000002!"
)
:: Result:
:: =======
:: Basic Multilingual Plane characters; single code units:
set "cp_" | 2>nul findstr /V "\," | sort /+7
::
:: Supplementary characters; surrogate pairs:
set "cp_" | 2>nul findstr "\," | sort /+7
::
:: referenced characters
:: set "char_"
:: creating order (with "holes")
:: 32 spaces
set "order= "
for /f "tokens=2 delims=_=" %%a in ('^(set "cp_" ^| 2^>nul findstr /V "\," ^| sort /+7^)^&^(set "cp_" ^| 2^>nul findstr "\," ^| sort /+7^)') do (
set "order=!order!!char_%%~a!"
)
set order
endlocal
del "zero.txt", "dummy.txt"
>nul chcp %cp%
endlocal
penpen
Edit: Corrected some flaws.
Edit: Corrected the byte order of the surrogate pair: Thanks to aGerman for finding this bug.
Re: Using many "tokens=..." in FOR /F command in a simple way
So are you saying we should take care about surrogates
I see your point though. I agree that we should rather compare 4 bytes each. Although I'm a little confused. Surrogates are pairs of two bytes each. Looking at your code it seems you revert all 4 bytes. Shouldn't it read
set "cp_!index!=0x!b_00000001!!b_00000000!,0x!b_00000003!!b_00000002!"
Maybe I'm missing something...
Steffen
I see your point though. I agree that we should rather compare 4 bytes each. Although I'm a little confused. Surrogates are pairs of two bytes each. Looking at your code it seems you revert all 4 bytes. Shouldn't it read
set "cp_!index!=0x!b_00000001!!b_00000000!,0x!b_00000003!!b_00000002!"
Maybe I'm missing something...
Steffen
Re: Using many "tokens=..." in FOR /F command in a simple way
Well, i don't know if one of the oem codepages is using one surrogate pair, so one better should take care just in case letters outer the BMP are in use.aGerman wrote:So are you saying we should take care about surrogates
Also the code could more easily be extended to any codepages (if ever needed).
You're absolutely right!aGerman wrote:Shouldn't it read
set "cp_!index!=0x!b_00000001!!b_00000000!,0x!b_00000003!!b_00000002!"
Actually it is too late... tonight.
penpen
Re: Using many "tokens=..." in FOR /F command in a simple way
I tried to improve your code a little. (Without much success though.)
First of all I used CERTUTIL in order to create table.dat from scratch. The MAKECAB technique works great and is downward-compatible but it is terribly slow. The same for zero.txt because FSUTIL requires elevation on Win7 downwards.
I kept the leading 1 of the index in order to avoid unnecessary string manipulations.
I removed the FINDSTR filter for the output of FC.
Even if the bytes read don't represent a surrogate pair it would be okay to leave zero bytes at the end of the hex string. That way you don't need to distiguish between BMP and surrogates.
However the repeated calls of CMD and FC still take a lot of time I wonder if there is a real risk for an appearance of surrogates...
Steffen
First of all I used CERTUTIL in order to create table.dat from scratch. The MAKECAB technique works great and is downward-compatible but it is terribly slow. The same for zero.txt because FSUTIL requires elevation on Win7 downwards.
I kept the leading 1 of the index in order to avoid unnecessary string manipulations.
I removed the FINDSTR filter for the output of FC.
Even if the bytes read don't represent a surrogate pair it would be okay to leave zero bytes at the end of the hex string. That way you don't need to distiguish between BMP and surrogates.
However the repeated calls of CMD and FC still take a lot of time I wonder if there is a real risk for an appearance of surrogates...
Steffen
Code: Select all
@echo off
cls
setlocal enableExtensions disableDelayedExpansion
>"dummy.txt" (
echo(IAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIjJCUmJygpKissLS4vMDEyMzQ1Njc4OTo7PD0+P0
echo(BBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWltcXV5fYGFiY2RlZmdoaWprbG1ub3BxcnN0dXZ3eHl6e3x9fn+A
echo(gYKDhIWGh4iJiouMjY6PkJGSk5SVlpeYmZqbnJ2en6ChoqOkpaanqKmqq6ytrq+wsbKztLW2t7i5uru8vb6/wM
echo(HCw8TFxsfIycrLzM3Oz9DR0tPU1dbX2Nna29zd3t/g4eLj5OXm5+jp6uvs7e7v8PHy8/T19vf4+fr7/P3+/w==
)
>nul certutil.exe -f -decode "dummy.txt" "table.dat"
>"dummy.txt" echo(AAAAAA==
>nul certutil.exe -f -decode "dummy.txt" "zero.txt"
if not "%~1" == "" ( set "codepage=%~1" ) else set "codepage=850"
for /f "tokens=2 delims=:." %%a in ('chcp') do set "cp=%%~a"
>nul chcp %codepage%
set "table="
<"table.dat" set /P "table="
setlocal enableDelayedExpansion
for /l %%a in (0x20, 1, 0xFF) do (
set /a "index=1000+%%~a"
cmd /e:ON /v:ON /d /u /c ">"dummy.txt" echo(^!table:~%%~a,1^!"
for /l %%b in (0, 1, 3) do set "b_0000000%%~b=00"
for %%b in ("dummy.txt") do set /a "bytes=%%~zb-4"
for /f "skip=1 tokens=1,3 delims=: " %%b in ('fc /b "zero.txt" "dummy.txt"') do if %%~b lss !bytes! set "b_%%~b=%%~c"
set "char_!index!=!table:~%%~a,1!"
set "cp_!index!=!b_00000001!!b_00000000!!b_00000003!!b_00000002!"
)
:: Result:
:: =======
:: set "cp_" | sort /+8
::
:: referenced characters
:: set "char_"
:: creating order (with "holes")
:: 32 spaces
set "order= "
for /f "tokens=2 delims=_=" %%a in ('set "cp_" ^| sort /+8') do (
set "order=!order!!char_%%~a!"
)
set order
endlocal
del "zero.txt", "dummy.txt", "table.dat"
>nul chcp %cp%
endlocal
pause
Re: Using many "tokens=..." in FOR /F command in a simple way
Yes, there is a risk, and indeed i have seen some custom codepages where someone used SYMBOL G CLEF (U+1D11E)... .aGerman wrote:I wonder if there is a real risk for an appearance of surrogates...
But to be honest it is recommended to use the REPLACEMENT CHARACTER (which is in the BMP) for such cases - so this risk might not be that big for codepages created by Microsoft.
A higher risk should be, that characters get lost, if an undefine code unit is detected in a multibyte character set (which actually is not your goal - so this shouldn't happen).
I also rethought your usage of "type" to convert multiple characters at once to UCS-2/UTF-16LE, and reread into surrogate pairs.
It is not that bad that i thought in the first place:
You could detect surrogate pairs (although i've always avoided assuming anything on UTF-16 characters, so i didnt remembered it - sorry for that).
(Maybe it was also too late yesterday... same holds for now... so gn8 .)
Code: Select all
:isSurrogate
:: %~1 contains the code unit in hex (example "0xDF12")
:: @returns 0 if %~1 is no surrogate, 1 if it is low surrogate and 3 if it is a high surrogate code unit.
if 0xD800 leq %~1 (
if %~1 leq 0xDBFF ( exit /b 3
) else if %~1 leq 0xDFFF exit /b 1
)
exit /b 0
And you could use your above method (although still not recommended for multibyte character sets).
penpen
Re: Using many "tokens=..." in FOR /F command in a simple way
penpen wrote:You could detect surrogate pairs
I don't know if you read the source code of my CONVERTCP utility. There I have to detect surrogates as well in oder to make sure the whole pair is in a chunk of data read.
penpen wrote:(Maybe it was also too late yesterday... same holds for now... so gn8 .)
Don't worry. That happens to me every day. I'm a night owl. You don't want to see me getting up in the morning
May you have a look at this code. I think a text comparison with DC for the high byte should be sufficient.
Steffen
Code: Select all
@echo off &setlocal enableExtensions disableDelayedExpansion
:: 0x20 ... 0xFF
>"dummy.txt" (
echo(ICEiIyQlJicoKSorLC0uLzAxMjM0NTY3ODk6Ozw9Pj9AQUJDREVGR0hJSktMTU5PUFFSU1RVVld
echo(YWVpbXF1eX2BhYmNkZWZnaGlqa2xtbm9wcXJzdHV2d3h5ent8fX5/gIGCg4SFhoeIiYqLjI2Oj5
echo(CRkpOUlZaXmJmam5ydnp+goaKjpKWmp6ipqqusra6vsLGys7S1tre4ubq7vL2+v8DBwsPExcbHy
echo(MnKy8zNzs/Q0dLT1NXW19jZ2tvc3d7f4OHi4+Tl5ufo6err7O3u7/Dx8vP09fb3+Pn6+/z9/v8=
)
>nul certutil -f -decode "dummy.txt" "table.dat"
:: 896 zero bytes
>"dummy.txt" (
for /l %%i in (1 1 18) do echo(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
echo(AAAAAAA=
)
>nul certutil -f -decode "dummy.txt" "zero.dat"
set "table="
<"table.dat" set /P "table="
>"dummy.txt" cmd /d /q /u /c "type "table.dat""
:: number of double bytes
for %%b in ("dummy.txt") do set /a "i=%%~zb>>1"
setlocal enableDelayedExpansion
>"dump.txt" (
for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "dummy.txt" "zero.dat"^|findstr /vbi "FC:"') do (
set /a "Y=0x%%i"
for /l %%k in (!X! 1 !Y!) do echo 00
set /a "X=Y+2"
echo %%j
)
echo 00
)
set /a "n=1032"
<"dump.txt" (
for /l %%i in (1 1 %i%) do (
set /p "low=" &set /p "high="
if !high! geq DC ( REM second double byte of a surrogate pair
for %%j in (!n!) do set "cp_%%j=!cp_%%j:~,4!!high!!low!"
) else ( REM BMP or first double byte of a surrogate pair
set "cp_!n!=!high!!low!0000"
set /a "idx=n-1032"
for %%j in (!idx!) do set "char_!n!=!table:~%%j,1!"
set /a "n+=1"
)
)
)
:: Result:
:: =======
set "cp_" | sort /+8
:: creating order (with "holes")
:: 32 spaces
set "order= "
for /f "tokens=2 delims=_=" %%a in ('set "cp_" ^| sort /+8') do (
set "order=!order!!char_%%~a!"
)
set order
del "dummy.txt" "table.dat" "zero.dat" "dump.txt"
pause
Re: Using many "tokens=..." in FOR /F command in a simple way
No, i didn't up to now.aGerman wrote:I don't know if you read the source code of my CONVERTCP utility. There I have to detect surrogates as well in oder to make sure the whole pair is in a chunk of data read.
Mo, definitely not; according to the java documentation UTF-16 is a little bit "ugly" there.aGerman wrote:May you have a look at this code. I think a text comparison with DC for the high byte should be sufficient.
Code: Select all
1.no surrogate in [0x0000 : 0xD7FF]
high surrogate in [0xD800 : 0xDBFF]
low surrogate in [0xDC00 : 0xDFFF]
2.no surrogate in [0xE000 : 0xFFFF]
Your assignment in code may fail on such code units (example: "FULLWIDTH NOT SIGN" U+FFE2).
It is also true that "non surrogates < surrogate pairs" so you should first list and sort all non surrogates, and then all surrogate pairs (if the order depends on UTF-16; i don't know exactly how UCS-2 is sorted according to surrogate pairs).
This part may be risky, if the the first hex value not equals "00"; also the last "00" may be unneeded.aGerman wrote:Code: Select all
>"dump.txt" (
for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "dummy.txt" "zero.dat"^|findstr /vbi "FC:"') do (
set /a "Y=0x%%i"
for /l %%k in (!X! 1 !Y!) do echo 00
set /a "X=Y+2"
echo %%j
)
echo 00
)
So may suggestion is something like that (hopefully i haven't messed anything up):
Code: Select all
@echo off
setlocal enableExtensions disableDelayedExpansion
:: 0x20 ... 0xFF
>"dummy.txt" (
echo(ICEiIyQlJicoKSorLC0uLzAxMjM0NTY3ODk6Ozw9Pj9AQUJDREVGR0hJSktMTU5PUFFSU1RVVld
echo(YWVpbXF1eX2BhYmNkZWZnaGlqa2xtbm9wcXJzdHV2d3h5ent8fX5/gIGCg4SFhoeIiYqLjI2Oj5
echo(CRkpOUlZaXmJmam5ydnp+goaKjpKWmp6ipqqusra6vsLGys7S1tre4ubq7vL2+v8DBwsPExcbHy
echo(MnKy8zNzs/Q0dLT1NXW19jZ2tvc3d7f4OHi4+Tl5ufo6err7O3u7/Dx8vP09fb3+Pn6+/z9/v8=
)
>nul certutil -f -decode "dummy.txt" "table.dat"
:: 896 zero bytes
>"dummy.txt" (
for /l %%i in (1 1 18) do echo(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
echo(AAAAAAA=
)
>nul certutil -f -decode "dummy.txt" "zero.dat"
set "table="
<"table.dat" set /P "table="
>"dummy.txt" cmd /d /q /u /c "type "table.dat""
:: number of double bytes
for %%b in ("dummy.txt") do set /a "i=%%~zb>>1"
setlocal enableDelayedExpansion
cls
>"dump.txt" (
set "X=1"
for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "dummy.txt" "zero.dat"^|findstr /vbi "FC:"') do (
set /a "Y=0x%%i"
for /l %%k in (!X! 1 !Y!) do echo 00
set /a "X=Y+2"
echo %%j
)
set /A "Y=i<<1"
for /l %%k in (!X! 1 !Y!) do echo 00
)
set /a "n=1032"
<"dump.txt" (
for /l %%i in (1 1 %i%) do (
set /p "low=" &set /p "high="
if 0xDC leq 0x!high! (
if 0x!high! leq 0xDF ( set "isLowSurogate=1"
) else set "isLowSurogate="
) else set "isLowSurogate="
if defined isLowSurogate ( REM second double byte of a surrogate pair
for %%j in (!n!) do set "cp_%%j=!cp_%%j!!high!!low!"
) else ( REM BMP or first double byte of a surrogate pair
set "cp_!n!=!high!!low!"
set /a "idx=n-1032"
for %%j in (!idx!) do set "char_!n!=!table:~%%j,1!"
set /a "n+=1"
)
)
)
:: Result:
:: =======
:: Basic Multilingual Plane characters; single code units:
set "cp_" | 2>nul findstr /V "\=........" | sort /+8
:: Supplementary characters; surrogate pairs:
set "cp_" | 2>nul findstr "\=........" | sort /+8
:: creating order (with "holes")
:: 32 spaces
set "order= "
for /f "tokens=2 delims=_=" %%a in ('^(set "cp_" ^| findstr /V "\=........" ^| sort /+8^)^&^(set "cp_" ^| 2^>nul findstr "\=........" ^| sort /+8^)') do (
set "order=!order!!char_%%~a!"
)
set order
rem del "dummy.txt" "table.dat" "zero.dat" "dump.txt"
pause
If you want to do that for any other codepage, too, then we need to find out, how to list all character units in a codepage.
(Sad to say, actually i only have a rough idea for old DOS codepages, how to one could get such information.)
penpen