Below is the $chrU.cmd that does the actual conversion and assignment. First, though, the prerequisites are (a) Windows 7 - this will not work in XP, then (b) a cmd prompt set to Lucida Console or another Unicode/TT (not raster) font, and (c) my auxiliary $cpChars.cmd batch that can be downloaded from https://db.tt/laqBZ7Dv. The latter is a dropbox shortened link, and that batch just generates a 256-chars variable where offset 0 is not used, and 1-255 hold the character with the respective code in the active codepage (file is provided as a link to a .zip since it contains some control chars and extended ASCII that make it difficult to copy/paste directly). EDIT: Updated $chrU code to fix ^! return to disableDelayedExpansion context.
Code: Select all
:: $chrU [out,ref] str = [in,val] hex#1 .. hex#N ______________ 14.04.17 __
::
:: decodes sequence of 'hex#' utf-8 bytes into variable 'str'
::
:: e.g. $chrU str = 22 CE B1 22 -- sets 'str' to a quoted greek alpha u+03B1
::
:: rem requires win7 or later, won't work under xp or earlier
:: rem control chars 0x00-1F not supported
:: rem undefined behavior if the input is not valid utf-8
:: rem the '=' equal sign is only used as a delimiter, any of '= ,;' work too
:: ____________________________________________________________________________
@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal & if "!"=="" (set "isEdx=1") else (set "isEdx=")
:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"
:: save controls + ascii + extended chars to 'chrA' table
call $cpChars chrA
:setU [out,ref] str [in,val] hex#1 .. hex#N --------------------------------
setlocal enableDelayedExpansion
set "u=" & for /f "tokens=1,* delims== " %%U in ("%*") do for %%W in (%%V) do (
set/a "x=0x%%W" & for %%X in (!x!) do set "u=!u!!chrA:~%%X,1!")
@rem escape '%^<>|&!' before final for/f-echo-set conversion
if defined isEdx (set "u=!u:^=^^^^^^^^!") else (set "u=!u:^=^^^^!")
set "u=!u:%%=%%chrA:~37,1%%!"
for %%Q in ("<" ">" "|" "&") do set "u=!u:%%~Q=^^%%~Q!"
set ^"u=!u:"=""!^"
if defined isEdx (set "u=%u:!=^^^^^!%"!) else (set "u=%u:!=^^^!%"!)
set ^"u=!u:""=^^"!^"
@rem 2nd next must stay on one line, and 'cp' must be the 'chrA' codepage [1]
chcp 65001 >nul
chcp %cp% >nul & for /f delims^=^ eol^= %%V in ('echo(!u!') do (
endlocal & endlocal & set "%~1=%%V"!)
exit /b 0
:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
set "z=%%~a" & setlocal enableDelayedExpansion
if not "!z:~0,1!"==":" endlocal & exit /b 0
(if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error: bad syntax '%~0 %*'))
exit /b 1 .....................................................................
:: [1] '!w!' expands to a double-byte sequence equivalent to the utf-8 encoding
:: then 'echo' narrows it down to single-byte per inner 'cp' codepage (a)
:: then '%%u' expands it back to double-byte per outer codepage 65001 (b)
:: effectively decoding the utf-8 byte sequence into the unicode string (*)
:: (a) the 'in' command of a 'for/f' loop runs in the codepage which is active
:: at the time the nested 'cmd' executes the command - in this case 'cp'
:: (b) the 'for/f' evaluates the loop variables according to the original
:: codepage in effect at the time the loop was parsed - in this case 65001
:: (*) works in win7, but not xp - because chcp 65001 stops batch parsing in xp
:: ____________________________________________________________________________
And this is a batch file using $chrU, with the following output copied from a win7x64.sp1 cmd prompt using the Lucida Console font.
Code: Select all
@echo off & setlocal enableDelayedExpansion
call $chrU u = E2 80 B9 CE B1 C3 9F C2 A9 E2 88 82 E2 82 AC E2 80 BA
echo blend !u!
call $chrU u = C3 A0 C3 A1 C3 A2 C4 81 C4 83 C4 85 C7 BB
echo latin !u!
call $chrU u = CE B1 CE B2 CE B3 CE B4 CE B5 CE B6 CE B7
echo greek !u!
call $chrU u = D0 B0 D0 B1 D0 B2 D0 B3 D0 B4 D0 B5 D0 B6
echo cyrillic !u!
call $chrU u = E2 86 90 E2 86 91 E2 86 92 E2 86 93 E2 86 94 E2 86 95 E2 86 A8
echo arrows !u!
call $chrU u = E2 96 8C E2 97 84 E2 96 B2 E2 97 8B E2 96 BC E2 96 BA E2 96 90
echo drawing !u!
call $chrU u = C2 A2 C2 A3 C2 A4 C2 A5 E2 82 A3 E2 82 A4 E2 82 AC
echo currency !u!
call $chrU u = C2 B1 C3 97 E2 88 82 E2 88 86 E2 88 8F E2 88 91 E2 88 92
echo math !u!
call $chrU u = C2 AB C2 A1 C2 BF C2 A9 C2 AE C2 A7 E2 80 A0
echo punct !u!
call $chrU u = C2 BC E2 85 9B C2 B9 E2 99 A0 E2 99 A3 E2 99 A5 E2 99 A6
echo misc !u!
call $chrU u = 3B 25 63 64 25 28 21 63 64 21 5E 22 5E 5E 22 22 21 3C 3F 26 5E
echo ascii !u!
endlocal & goto :eof
Code: Select all
C:\tmp>$chrU.test
blend ‹αß©∂€›
latin àáâāăąǻ
greek αβγδεζη
cyrillic абвгдеж
arrows ←↑→↓↔↕↨
drawing ▌◄▲○▼►▐
currency ¢£¤¥₣₤€
math ±×∂∆∏∑−
punct «¡¿©®§†
misc ¼⅛¹♠♣♥♦
ascii ;%cd%(!cd!^"^^""!<?&^
Some closing notes:
- the two key tricks that make it work are the new Windows 7 support for parsing batch code under codepage 65001, and the codepage handling around for/f loops - which hasn't changed since XP, but could not be put to good UTF-8 use until now;
- anyone not fond of my helper batch $cpChars.cmd can use any other code to generate the same "character map" of the active codepage, instead, and there have been several ways to do it posted on dostips before;
- $chrU doesn't use temp files, and only one for/f-command loop running the internal 'echo';
- the code is not particularly optimized, and I tried to keep it reasonably clean - only concession being the extra lines dealing with the traditional '^!' problem characters.
Liviu
P.S. As to the question of where to get the UTF-8 encoding of a given string... If the string comes from a UTF-8 encoded text file, then viewing the file in a hex viewer will show the bytes. If it comes from another document, or a web page, copying/pasting to any number of online tools (such as http://rishida.net/tools/conversion/) will show the corresponding UTF-8.
Or, save the following as $ascU.cmd (which FWIW works under XP too, not just Windows 7). EDIT #2: Updated $ascU, $ascX code below to call '%comspec%' instead of hardcoded 'cmd', plus minor/cosmetic changes.
Code: Select all
:: $ascU [in,ref] str /U [out,ref,opt] utf-8-bytes-hex ________ 14.05.23 __
:: /W [out,ref,opt] utf16-words-hex
:: /A [out,ref,opt] ext-asc-bytes-hex [out,ref,opt] strA
::
:: returns encoding of 'str' as utf-8, utf16, or 8-bit active codepage
:: and optionally for '/A' the translation of 'str' in the given codepage
::
:: rem '/U' is assumed by default if no '/' specified
:: rem control chars not supported in the input string
:: rem '/A' with 'strA' must be called from disableDelayedExpansion context
:: in order for '^!' to be returned correctly in 'strA'
::
:: 14.05.23 replaced 'cmd' with '%comSpec%' in nested calls
:: 14.04.19 checked ok under xp.sp3, win7x64.sp1
:: ____________________________________________________________________________
@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal enableDelayedExpansion
set "str=!%~1!" & shift
set "enc=U" & for %%E in (U W A) do if /i "%~1"=="/%%E" set "enc=%%E" & shift
set "hex=%~1" & set "asc=%~2"
:: quick exit for empty string
if not defined str (
if not defined hex (echo() else set "%hex%=" & if defined asc set "%asc%="
endlocal & exit /b 0
)
:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"
:: define BS backspace
for /f %%b in ('"prompt $H & for %%b in (1) do rem"') do set "BS=%%b"
set "f0=%temp%\%time::=%%random%.tmp" & set "f1=%temp%\%time::=%%random%.tmp"
set ^"echoA="%comSpec%" /a/v/c echo^" & set ^"echoW="%comSpec%" /u/v/c echo^"
set "hX=" & if defined asc (set "aX=" & call :asc%enc% str hX aX
) else (call :asc%enc% str hX)
2>nul del "%f1%" "%f0%"
if not defined hex (endlocal & echo %hX% & exit /b 0)
if not defined asc (endlocal & set "%hex%=%hX%" & exit /b 0)
for /f delims^=^ eol^= %%X in ("!aX!") do (
endlocal & set "%hex%=%hX%" & set "%asc%=%%X")
exit /b 0
:ascU [in,ref] str [out,ref] utf-8-bytes-hex ...............................
chcp 65001 >nul & (>"%f1%" %echoA%(^^!%1^^!) & chcp %cp% >nul
call :hexA %2 & goto :eof
:ascW [in,ref] str [out,ref] utf16-words-hex ...............................
(>"%f1%" %echoW%(^^!%1^^!) & call :hexW %2 & goto :eof
:ascA [in,ref] str [out,ref] ext-asc-bytes-hex [out,ref,opt] ext-asc-str ..
(>"%f1%" %echoA%(^^!%1^^!) & call :hexA %2 & if "%3"=="" goto :eof
@rem escape '%^<>|&!' before final for/f-echo-set conversion
set "u=!%1!" & set "PCT=%%"
for %%Q in ("%%=%%PCT%%" "^=^^^^^^^^") do set "u=!u:%%~Q!"
for %%Q in ("<" ">" "|" "&") do set "u=!u:%%~Q=^^%%~Q!"
set ^"u=!u:"=""!^"
set "u=%u:!=^^^^^!%"!
set ^"u=!u:""=^^"!^"
for /f delims^=^ eol^= %%V in ('echo(!u!') do set "%3=%%V"!
goto :eof
:hexA [out,ref] hex-bytes -- 'f1' = narrow string + narrow <cr><lf> .......
set "%1=" & for %%X in ("%f1%") do set/a len=%%~zX-2
set "z=" & for /l %%N in (1 1 !len!) do set "z=!z!!BS!"
>"%f0%" %echoA%(!z!
for /f "skip=1 tokens=2 delims=: " %%U in ('fc /b "%f1%" "%f0%"') do (
(if defined %1 set "%1=!%1! ") & set "%1=!%1!%%U")
goto :eof
:hexW [out,ref] hex-bytes -- 'f1' = wide string + wide <cr><lf> ...........
set "%1=" & set "u=" & for %%X in ("%f1%") do set/a len=%%~zX-4
set "z=" & for /l %%N in (1 1 !len!) do set "z=!z!!BS!"
>"%f0%" "%comSpec%" /a/c ^<nul set/p "=!z!" & >>"%f0%" "%comSpec%" /u/c echo(
for /f "skip=1 tokens=2 delims=: " %%U in ('fc /b "%f1%" "%f0%"') do (
if not defined u (set "u=%%U") else (
(if defined %1 set "%1=!%1! ") & set "%1=!%1!%%U!u!" & set "u="))
goto :eof
:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
set "z=%%~a" & setlocal enableDelayedExpansion
if not "!z:~0,1!"==":" endlocal & goto :eof
(if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error: bad syntax '%~0 %*'))
exit /b 1 .....................................................................
Code: Select all
:: $ascX [in,ref] var [in,val,opt] cp#1 .. cp#N ______________ 14.04.19 __
:: $ascX [in,val] "str" [in,val,opt] cp#1 .. cp#N
::
:: displays 'string' encoding as utf16, utf-8, and 8-bit codepage(s)
:: with 'string' either passed by reference in 'var'
:: or passed by value as '"str"' inside quotes
::
:: rem 'cp' lines show '==' if string converts losslessly, '->' otherwise
:: rem control chars not supported in the input string
:: rem active codepage is included by default, and displayed first
::
:: 14.04.19 checked ok under xp.sp3, win7x64.sp1
:: ____________________________________________________________________________
@echo off & if "%~1"=="/?" (goto :help) else if "%~1"=="" (goto :errs)
setlocal disableDelayedExpansion
:: save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do set/a "cp=%%~a"
:: first argument is either quoted string value, or name of string variable
(if '%1'=='"%~1"' (set "sZ=%~1" & set "sX=sZ") else (set "sX=%~1")) & shift
set "xW=" & call $ascU %sX% /W xW & set "xU=" & call $ascU %sX% /U xU
set "cx= %~1 %~2 %~3 %~4 %~5 %~6 %~7 %~8 %~9 "
setlocal enableDelayedExpansion
echo '!%sX%!' = [u+] !xW! = [utf-8] !xU!
set "cx=!cx: %cp% = !"
endlocal & set "cx=%cx%"
for %%p in (%cp% %cx%) do (
chcp %%p >nul
set "xA=" & set "sA=" & call $ascU %sX% /A xA sA
setlocal enableDelayedExpansion
set "p=cp %%~p " & set "p=!p:~0,8!"
if "!sA!"=="!%sX%!" (echo !p! == '!sA!' = !xA!) else (
set "xW=" & call $ascU sA /W xW
echo !p! -^> '!sA!' = !xA! = [u+] !xW!
)
endlocal
)
chcp %cp% >nul & endlocal & exit /b 0
:help .........................................................................
echo(
@rem dump :: comment lines at the top of the file, skip ::: lines
for /f "usebackq delims=" %%a in ("%~f0") do (
set "z=%%~a" & setlocal enableDelayedExpansion
if not "!z:~0,1!"==":" endlocal & goto :eof
(if not "!z:~2,1!"==":" echo !z!) & endlocal
)
exit /b 0
:errs
call :help
>&2 ((echo() & (echo ** error: bad syntax '%~0 %*'))
exit /b 1 _____________________________________________________________________
Then enter or paste any string to see its encoding in UTF16, UTF-8 and codepage(s) - note that the '==' also indicates whether a string converts losslessly to a given codepage.
Code: Select all
C:\tmp>$ascX ‹.ß©.€› 850 437 1252 28591
'‹.ß©.€›' = [u+] 2039 002E 00DF 00A9 002E 20AC 203A = [utf-8] E2 80 B9 2E C3 9F C2 A9 2E E2 82 AC E2 80 BA
cp 437 -> '<.ßc.?>' = 3C 2E E1 63 2E 3F 3E = [u+] 003C 002E 00DF 0063 002E 003F 003E
cp 850 -> '<.ß©.?>' = 3C 2E E1 B8 2E 3F 3E = [u+] 003C 002E 00DF 00A9 002E 003F 003E
cp 1252 == '‹.ß©.€›' = 8B 2E DF A9 2E 80 9B
cp 28591 -> '<.ß©.?>' = 3C 2E DF A9 2E 3F 3E = [u+] 003C 002E 00DF 00A9 002E 003F 003E