Synthesizing Unicode strings in Windows 7 batch
Posted: 13 Apr 2014 23:31
Given the newfound UTF-8 codepage 65001 support in Windows 7, it has become possible to effectively build arbitrary Unicode strings in batch code using the hex representation of their UTF-8 encoding. This did not - and to the best I can guess, couldn't possibly - work in XP, and it's an interesting new ability with some potential. For example, one could remove active codepage dependencies for extended ASCII, display multi language text at the same prompt, use the full range of box/drawing characters and symbols, and so on.
Below is the $chrU.cmd that does the actual conversion and assignment. First, though, the prerequisites are (a) Windows 7 - this will not work in XP, then (b) a cmd prompt set to Lucida Console or another Unicode/TT (not raster) font, and (c) my auxiliary $cpChars.cmd batch that can be downloaded from https://db.tt/laqBZ7Dv. The latter is a dropbox shortened link, and that batch just generates a 256-chars variable where offset 0 is not used, and 1-255 hold the character with the respective code in the active codepage (file is provided as a link to a .zip since it contains some control chars and extended ASCII that make it difficult to copy/paste directly). EDIT: Updated $chrU code to fix ^! return to disableDelayedExpansion context.
And this is a batch file using $chrU, with the following output copied from a win7x64.sp1 cmd prompt using the Lucida Console font.
Some closing notes:
- the two key tricks that make it work are the new Windows 7 support for parsing batch code under codepage 65001, and the codepage handling around for/f loops - which hasn't changed since XP, but could not be put to good UTF-8 use until now;
- anyone not fond of my helper batch $cpChars.cmd can use any other code to generate the same "character map" of the active codepage, instead, and there have been several ways to do it posted on dostips before;
- $chrU doesn't use temp files, and only one for/f-command loop running the internal 'echo';
- the code is not particularly optimized, and I tried to keep it reasonably clean - only concession being the extra lines dealing with the traditional '^!' problem characters.
Liviu
P.S. As to the question of where to get the UTF-8 encoding of a given string... If the string comes from a UTF-8 encoded text file, then viewing the file in a hex viewer will show the bytes. If it comes from another document, or a web page, copying/pasting to any number of online tools (such as http://rishida.net/tools/conversion/) will show the corresponding UTF-8.
Or, save the following as $ascU.cmd (which FWIW works under XP too, not just Windows 7). EDIT #2: Updated $ascU, $ascX code below to call '%comspec%' instead of hardcoded 'cmd', plus minor/cosmetic changes.And this as $ascX.cmd
Then enter or paste any string to see its encoding in UTF16, UTF-8 and codepage(s) - note that the '==' also indicates whether a string converts losslessly to a given codepage.
Below is the $chrU.cmd that does the actual conversion and assignment. First, though, the prerequisites are (a) Windows 7 - this will not work in XP, then (b) a cmd prompt set to Lucida Console or another Unicode/TT (not raster) font, and (c) my auxiliary $cpChars.cmd batch that can be downloaded from https://db.tt/laqBZ7Dv. The latter is a dropbox shortened link, and that batch just generates a 256-chars variable where offset 0 is not used, and 1-255 hold the character with the respective code in the active codepage (file is provided as a link to a .zip since it contains some control chars and extended ASCII that make it difficult to copy/paste directly). EDIT: Updated $chrU code to fix ^! return to disableDelayedExpansion context.
Code: Select all
:: $chrU [out,ref] str = [in,val] hex#1 .. hex#N ______________ 14.04.17 __
::
:: decodes sequence of 'hex#' utf-8 bytes into variable 'str'
::
:: e.g. $chrU str = 22 CE B1 22 -- sets 'str' to a quoted greek alpha u+03B1
::
:: rem requires win7 or later, won't work under xp or earlier
:: rem control chars 0x00-1F not supported
:: rem undefined behavior if the input is not valid utf-8
:: rem the '=' equal sign is only used as a delimiter, any of '= ,;' work too
:: ____________________________________________________________________________
@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal & if "!"=="" (set "isEdx=1") else (set "isEdx=")
:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"
:: save controls + ascii + extended chars to 'chrA' table
call $cpChars chrA
:setU [out,ref] str [in,val] hex#1 .. hex#N --------------------------------
setlocal enableDelayedExpansion
set "u=" & for /f "tokens=1,* delims== " %%U in ("%*") do for %%W in (%%V) do (
set/a "x=0x%%W" & for %%X in (!x!) do set "u=!u!!chrA:~%%X,1!")
@rem escape '%^<>|&!' before final for/f-echo-set conversion
if defined isEdx (set "u=!u:^=^^^^^^^^!") else (set "u=!u:^=^^^^!")
set "u=!u:%%=%%chrA:~37,1%%!"
for %%Q in ("<" ">" "|" "&") do set "u=!u:%%~Q=^^%%~Q!"
set ^"u=!u:"=""!^"
if defined isEdx (set "u=%u:!=^^^^^!%"!) else (set "u=%u:!=^^^!%"!)
set ^"u=!u:""=^^"!^"
@rem 2nd next must stay on one line, and 'cp' must be the 'chrA' codepage [1]
chcp 65001 >nul
chcp %cp% >nul & for /f delims^=^ eol^= %%V in ('echo(!u!') do (
endlocal & endlocal & set "%~1=%%V"!)
exit /b 0
:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
set "z=%%~a" & setlocal enableDelayedExpansion
if not "!z:~0,1!"==":" endlocal & exit /b 0
(if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error: bad syntax '%~0 %*'))
exit /b 1 .....................................................................
:: [1] '!w!' expands to a double-byte sequence equivalent to the utf-8 encoding
:: then 'echo' narrows it down to single-byte per inner 'cp' codepage (a)
:: then '%%u' expands it back to double-byte per outer codepage 65001 (b)
:: effectively decoding the utf-8 byte sequence into the unicode string (*)
:: (a) the 'in' command of a 'for/f' loop runs in the codepage which is active
:: at the time the nested 'cmd' executes the command - in this case 'cp'
:: (b) the 'for/f' evaluates the loop variables according to the original
:: codepage in effect at the time the loop was parsed - in this case 65001
:: (*) works in win7, but not xp - because chcp 65001 stops batch parsing in xp
:: ____________________________________________________________________________
And this is a batch file using $chrU, with the following output copied from a win7x64.sp1 cmd prompt using the Lucida Console font.
Code: Select all
@echo off & setlocal enableDelayedExpansion
call $chrU u = E2 80 B9 CE B1 C3 9F C2 A9 E2 88 82 E2 82 AC E2 80 BA
echo blend !u!
call $chrU u = C3 A0 C3 A1 C3 A2 C4 81 C4 83 C4 85 C7 BB
echo latin !u!
call $chrU u = CE B1 CE B2 CE B3 CE B4 CE B5 CE B6 CE B7
echo greek !u!
call $chrU u = D0 B0 D0 B1 D0 B2 D0 B3 D0 B4 D0 B5 D0 B6
echo cyrillic !u!
call $chrU u = E2 86 90 E2 86 91 E2 86 92 E2 86 93 E2 86 94 E2 86 95 E2 86 A8
echo arrows !u!
call $chrU u = E2 96 8C E2 97 84 E2 96 B2 E2 97 8B E2 96 BC E2 96 BA E2 96 90
echo drawing !u!
call $chrU u = C2 A2 C2 A3 C2 A4 C2 A5 E2 82 A3 E2 82 A4 E2 82 AC
echo currency !u!
call $chrU u = C2 B1 C3 97 E2 88 82 E2 88 86 E2 88 8F E2 88 91 E2 88 92
echo math !u!
call $chrU u = C2 AB C2 A1 C2 BF C2 A9 C2 AE C2 A7 E2 80 A0
echo punct !u!
call $chrU u = C2 BC E2 85 9B C2 B9 E2 99 A0 E2 99 A3 E2 99 A5 E2 99 A6
echo misc !u!
call $chrU u = 3B 25 63 64 25 28 21 63 64 21 5E 22 5E 5E 22 22 21 3C 3F 26 5E
echo ascii !u!
endlocal & goto :eof
Code: Select all
C:\tmp>$chrU.test
blend ‹αß©∂€›
latin àáâāăąǻ
greek αβγδεζη
cyrillic абвгдеж
arrows ←↑→↓↔↕↨
drawing ▌◄▲○▼►▐
currency ¢£¤¥₣₤€
math ±×∂∆∏∑−
punct «¡¿©®§†
misc ¼⅛¹♠♣♥♦
ascii ;%cd%(!cd!^"^^""!<?&^
Some closing notes:
- the two key tricks that make it work are the new Windows 7 support for parsing batch code under codepage 65001, and the codepage handling around for/f loops - which hasn't changed since XP, but could not be put to good UTF-8 use until now;
- anyone not fond of my helper batch $cpChars.cmd can use any other code to generate the same "character map" of the active codepage, instead, and there have been several ways to do it posted on dostips before;
- $chrU doesn't use temp files, and only one for/f-command loop running the internal 'echo';
- the code is not particularly optimized, and I tried to keep it reasonably clean - only concession being the extra lines dealing with the traditional '^!' problem characters.
Liviu
P.S. As to the question of where to get the UTF-8 encoding of a given string... If the string comes from a UTF-8 encoded text file, then viewing the file in a hex viewer will show the bytes. If it comes from another document, or a web page, copying/pasting to any number of online tools (such as http://rishida.net/tools/conversion/) will show the corresponding UTF-8.
Or, save the following as $ascU.cmd (which FWIW works under XP too, not just Windows 7). EDIT #2: Updated $ascU, $ascX code below to call '%comspec%' instead of hardcoded 'cmd', plus minor/cosmetic changes.
Code: Select all
:: $ascU [in,ref] str /U [out,ref,opt] utf-8-bytes-hex ________ 14.05.23 __
:: /W [out,ref,opt] utf16-words-hex
:: /A [out,ref,opt] ext-asc-bytes-hex [out,ref,opt] strA
::
:: returns encoding of 'str' as utf-8, utf16, or 8-bit active codepage
:: and optionally for '/A' the translation of 'str' in the given codepage
::
:: rem '/U' is assumed by default if no '/' specified
:: rem control chars not supported in the input string
:: rem '/A' with 'strA' must be called from disableDelayedExpansion context
:: in order for '^!' to be returned correctly in 'strA'
::
:: 14.05.23 replaced 'cmd' with '%comSpec%' in nested calls
:: 14.04.19 checked ok under xp.sp3, win7x64.sp1
:: ____________________________________________________________________________
@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal enableDelayedExpansion
set "str=!%~1!" & shift
set "enc=U" & for %%E in (U W A) do if /i "%~1"=="/%%E" set "enc=%%E" & shift
set "hex=%~1" & set "asc=%~2"
:: quick exit for empty string
if not defined str (
if not defined hex (echo() else set "%hex%=" & if defined asc set "%asc%="
endlocal & exit /b 0
)
:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"
:: define BS backspace
for /f %%b in ('"prompt $H & for %%b in (1) do rem"') do set "BS=%%b"
set "f0=%temp%\%time::=%%random%.tmp" & set "f1=%temp%\%time::=%%random%.tmp"
set ^"echoA="%comSpec%" /a/v/c echo^" & set ^"echoW="%comSpec%" /u/v/c echo^"
set "hX=" & if defined asc (set "aX=" & call :asc%enc% str hX aX
) else (call :asc%enc% str hX)
2>nul del "%f1%" "%f0%"
if not defined hex (endlocal & echo %hX% & exit /b 0)
if not defined asc (endlocal & set "%hex%=%hX%" & exit /b 0)
for /f delims^=^ eol^= %%X in ("!aX!") do (
endlocal & set "%hex%=%hX%" & set "%asc%=%%X")
exit /b 0
:ascU [in,ref] str [out,ref] utf-8-bytes-hex ...............................
chcp 65001 >nul & (>"%f1%" %echoA%(^^!%1^^!) & chcp %cp% >nul
call :hexA %2 & goto :eof
:ascW [in,ref] str [out,ref] utf16-words-hex ...............................
(>"%f1%" %echoW%(^^!%1^^!) & call :hexW %2 & goto :eof
:ascA [in,ref] str [out,ref] ext-asc-bytes-hex [out,ref,opt] ext-asc-str ..
(>"%f1%" %echoA%(^^!%1^^!) & call :hexA %2 & if "%3"=="" goto :eof
@rem escape '%^<>|&!' before final for/f-echo-set conversion
set "u=!%1!" & set "PCT=%%"
for %%Q in ("%%=%%PCT%%" "^=^^^^^^^^") do set "u=!u:%%~Q!"
for %%Q in ("<" ">" "|" "&") do set "u=!u:%%~Q=^^%%~Q!"
set ^"u=!u:"=""!^"
set "u=%u:!=^^^^^!%"!
set ^"u=!u:""=^^"!^"
for /f delims^=^ eol^= %%V in ('echo(!u!') do set "%3=%%V"!
goto :eof
:hexA [out,ref] hex-bytes -- 'f1' = narrow string + narrow <cr><lf> .......
set "%1=" & for %%X in ("%f1%") do set/a len=%%~zX-2
set "z=" & for /l %%N in (1 1 !len!) do set "z=!z!!BS!"
>"%f0%" %echoA%(!z!
for /f "skip=1 tokens=2 delims=: " %%U in ('fc /b "%f1%" "%f0%"') do (
(if defined %1 set "%1=!%1! ") & set "%1=!%1!%%U")
goto :eof
:hexW [out,ref] hex-bytes -- 'f1' = wide string + wide <cr><lf> ...........
set "%1=" & set "u=" & for %%X in ("%f1%") do set/a len=%%~zX-4
set "z=" & for /l %%N in (1 1 !len!) do set "z=!z!!BS!"
>"%f0%" "%comSpec%" /a/c ^<nul set/p "=!z!" & >>"%f0%" "%comSpec%" /u/c echo(
for /f "skip=1 tokens=2 delims=: " %%U in ('fc /b "%f1%" "%f0%"') do (
if not defined u (set "u=%%U") else (
(if defined %1 set "%1=!%1! ") & set "%1=!%1!%%U!u!" & set "u="))
goto :eof
:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
set "z=%%~a" & setlocal enableDelayedExpansion
if not "!z:~0,1!"==":" endlocal & goto :eof
(if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error: bad syntax '%~0 %*'))
exit /b 1 .....................................................................
Code: Select all
:: $ascX [in,ref] var [in,val,opt] cp#1 .. cp#N ______________ 14.04.19 __
:: $ascX [in,val] "str" [in,val,opt] cp#1 .. cp#N
::
:: displays 'string' encoding as utf16, utf-8, and 8-bit codepage(s)
:: with 'string' either passed by reference in 'var'
:: or passed by value as '"str"' inside quotes
::
:: rem 'cp' lines show '==' if string converts losslessly, '->' otherwise
:: rem control chars not supported in the input string
:: rem active codepage is included by default, and displayed first
::
:: 14.04.19 checked ok under xp.sp3, win7x64.sp1
:: ____________________________________________________________________________
@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal disableDelayedExpansion
:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"
:: first argument is either quoted string value, or name of string variable
(if '%1'=='"%~1"' (set "sZ=%~1" & set "sX=sZ") else (set "sX=%~1")) & shift
set "xW=" & call $ascU %sX% /W xW & set "xU=" & call $ascU %sX% /U xU
set "cx= %~1 %~2 %~3 %~4 %~5 %~6 %~7 %~8 %~9 "
setlocal enableDelayedExpansion
echo '!%sX%!' = [u+] !xW! = [utf-8] !xU!
set "cx=!cx: %cp% = !"
endlocal & set "cx=%cx%"
for %%p in (%cp% %cx%) do (
chcp %%p >nul
set "xA=" & set "sA=" & call $ascU %sX% /A xA sA
setlocal enableDelayedExpansion
set "p=cp %%~p " & set "p=!p:~0,8!"
if "!sA!"=="!%sX%!" (echo !p! == '!sA!' = !xA!) else (
set "xW=" & call $ascU sA /W xW
echo !p! -^> '!sA!' = !xA! = [u+] !xW!
)
endlocal
)
chcp %cp% >nul & endlocal & exit /b 0
:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
set "z=%%~a" & setlocal enableDelayedExpansion
if not "!z:~0,1!"==":" endlocal & goto :eof
(if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error: bad syntax '%~0 %*'))
exit /b 1 _____________________________________________________________________
Then enter or paste any string to see its encoding in UTF16, UTF-8 and codepage(s) - note that the '==' also indicates whether a string converts losslessly to a given codepage.
Code: Select all
C:\tmp>$ascX ‹.ß©.€› 850 437 1252 28591
'‹.ß©.€›' = [u+] 2039 002E 00DF 00A9 002E 20AC 203A = [utf-8] E2 80 B9 2E C3 9F C2 A9 2E E2 82 AC E2 80 BA
cp 437 -> '<.ßc.?>' = 3C 2E E1 63 2E 3F 3E = [u+] 003C 002E 00DF 0063 002E 003F 003E
cp 850 -> '<.ß©.?>' = 3C 2E E1 B8 2E 3F 3E = [u+] 003C 002E 00DF 00A9 002E 003F 003E
cp 1252 == '‹.ß©.€›' = 8B 2E DF A9 2E 80 9B
cp 28591 -> '<.ß©.?>' = 3C 2E DF A9 2E 3F 3E = [u+] 003C 002E 00DF 00A9 002E 003F 003E