I've looked up prior art on this topic but haven't found much, except hints that it's not possible. Just thought I'd ask in case I missed something somewhere...
For a quick recap, cmd input is fully unicode, and both 'set' and 'set /p' will take a unicode string and assign it correctly, regardless of the active codepage. Once set, a variable can be safely copied with another 'set', again regardless of the chcp in effect. ( Side note, I am aware that "unicode string" is the wrong term technically, and only using it here as a shortcut for "text containing characters outside the individual codepages used during the exercise". )
The open question, however, seems to be how to get such a unicode string into a variable to begin with - other than typing it in, or pasting it interactively at the prompt. ( Another side note, the question is about an arbitrary user defined string, as opposed to an existing file/directory name which can be retrieved with the appropriate 'for' loop. )
Assuming the respective text exists in an external file (say, UTF-8 or UTF-16), Jeb's trick can read it correctly (last post at http://www.dostips.com/forum/viewtopic.php?f=3&t=1462&start=0) but for redirection purposes, only. As noted in his post, simple echo fails, and incidentally so does any attempt to 'set' it to a variable.
I have tried a number of variations on the theme, with '<file set /p', 'type | set /p' and combinations of chcp and 'cmd /u' but haven't hit the right note yet. Any further pointers welcome.
Liviu
setting variable to value from utf-encoded file
Moderator: DosItHelp
Re: setting variable to value from utf-encoded file
Hi Liviu,
you can copy any content from one variable to another with delayed expansion, this is always safe.
You can get any content (without the <NUL> character) from a file with a FOR/F loop or the SET/p technic (has some quirks with control characters at the line end).
You can echo any content with echo( and DelayedExpansion.
This sample doesn't handle empty lines or lines beginning with ";" (EOL) but it can be easily solved.
jeb
you can copy any content from one variable to another with delayed expansion, this is always safe.
You can get any content (without the <NUL> character) from a file with a FOR/F loop or the SET/p technic (has some quirks with control characters at the line end).
You can echo any content with echo( and DelayedExpansion.
Code: Select all
setlocal DisableDelayedExpansion
for /f "delims=" %%a in (myFile.txt) do (
set "line=%%a"
setlocal EnableDelayedExpansion
set "newVar=!line!"
(echo(!newVar!)
endlocal
)
)
This sample doesn't handle empty lines or lines beginning with ";" (EOL) but it can be easily solved.
jeb
Re: setting variable to value from utf-encoded file
Thanks, Jeb, but I don't think either works when the file is UTF-8 or UTF-16, which is the difficult point here.jeb wrote:You can get any content (without the <NUL> character) from a file with a FOR/F loop or the SET/p technic
Since my phrasing of the question was a bit elliptic, here it is in more detail... I can set a variable to unicode text interactively with no problems, and once set I can echo/copy/use it fine.
C:\tmp>chcp & set "ucs2=‹αß©∂€›" & set ucs2
Active code page: 437
ucs2=‹αß©∂€›
C:\tmp>set "ucs2=" & set /p "ucs2=" & set ucs2
‹αß©∂€›
ucs2=‹αß©∂€›
C:\tmp>echo %ucs2%
‹αß©∂€›
I can save the contents of the variable to a UTF-8 file...
...or a UTF-16-LE (no BOM) file...C:\tmp>chcp 65001
Active code page: 65001
C:\tmp>echo %ucs2%>utf8.txt
C:\tmp>chcp 437
...or a UTF-16-LE (with BOM) file.C:\tmp>cmd /u /c echo %ucs2%>utf16le.txt
C:\tmp>chcp 1252
Active code page: 1252
C:\tmp>(set /p =ÿþ) <nul >utf16.txt 2>nul
C:\tmp>cmd /u /c echo %ucs2%>>utf16.txt
C:\tmp>chcp 437
The byte-by-byte binary contents of the files are...
Code: Select all
utf8.txt:
00000000 E2 80 B9 CE B1 C3 9F C2 A9 E2 88 82 E2 82 AC E2
00000010 80 BA 0D 0A
utf16le.txt:
00000000 39 20 B1 03 DF 00 A9 00 02 22 AC 20 3A 20 0D 00
00000010 0A 00
My question is: if the ucs2 variable did not exist, but I had one of those 3 text files (with the utf-encoded text), how could ucs2 be re-created from the file?
The technique you showed in the old post works for just _copying_ the UTF-8 file, but not setting a variable from it (well, you could for example 'set ucs2new' in the 'for' loop, but the resulting variable won't match the original i.e. "%ucs2%"=="%ucs2new%" fails).
Liviu
Re: setting variable to value from utf-encoded file
Just for the record, here is a a tentative answer to the title question... The following demonstrates a way to convert hardcoded or read-from-file UTF-8 strings to UTF-16 and store them in a regular, usable variable. On one hand, the code is not pretty and the conversion is painfully slow. On the other hand, it does actually work (tried under xp.sp3 and win7.sp1), and only uses reg.exe and wmic.exe which are builtins as of xp+. As far as I can tell, it's not been attempted this way before. Maybe this inspires someone to come up with a neater, faster, pure-batch solution.
Basic idea was fairly straightforward:
1. Get the string somehow merged into the registry under HKCU\Environment.
2. Pick up the newly registered environment variable from the registry, use it happily ever after
Difficulties along the way:
1.a. The UTF-8 string can be saved as either UTF-8 or UTF-16LE to an external file using known tricks, previously discussed. But the natural choice for registry manipulation "setx.exe -f" doesn't seem to do a proper codepage translation from UTF-8, nor take a UTF-16LE input file. Workaround was to manually build a UTF-16LE .reg file, then use reg.exe to merge it into the registry.
2.a. Once the new variable is added to the registry, Windows needs to be notified before it acknowledges it (http://support.microsoft.com/kb/104011 - How to propagate environment variables to the system). The batch code itself cannot send the expected WM_SETTINGCHANGE message. One would hope that setx.exe did that after effecting changes, but it doesn't appear to. Turns out that wmic.exe does it after environment changes, however.
2.b. Even once Windows is notified, the environment changes are only visible to future processes, since each current one maintains its own copy, initialized at the time it was started. So a new process is needed to pick up the changes. Unfortunately, any 'cmd' launched from the active console runs as a child process, and inherits the environment of its parent (either current, or original for cmd/i) i.e. is oblivious to system level environment changes. One way to start a new 'cmd' process not-as-a-child is to use 'wmic process call create'.
2.c. Once the new process is started, and sees the just-added environment variable, issue remains that it has no direct way to return it to the caller. Workaround here is to create a temporary file with the given name, whose name can then be read back in the original batch. Since wmic starts the secondary 'cmd' asynchronously, the caller needs to wait until the callee completes.
That said, the sample set-utf8.cmd code is copied below.
Test case using the set-utf8-test.cmd copied below, and assuming the same utf8.txt file from the previous postoutputs
Liviu
Basic idea was fairly straightforward:
1. Get the string somehow merged into the registry under HKCU\Environment.
2. Pick up the newly registered environment variable from the registry, use it happily ever after
Difficulties along the way:
1.a. The UTF-8 string can be saved as either UTF-8 or UTF-16LE to an external file using known tricks, previously discussed. But the natural choice for registry manipulation "setx.exe -f" doesn't seem to do a proper codepage translation from UTF-8, nor take a UTF-16LE input file. Workaround was to manually build a UTF-16LE .reg file, then use reg.exe to merge it into the registry.
2.a. Once the new variable is added to the registry, Windows needs to be notified before it acknowledges it (http://support.microsoft.com/kb/104011 - How to propagate environment variables to the system). The batch code itself cannot send the expected WM_SETTINGCHANGE message. One would hope that setx.exe did that after effecting changes, but it doesn't appear to. Turns out that wmic.exe does it after environment changes, however.
2.b. Even once Windows is notified, the environment changes are only visible to future processes, since each current one maintains its own copy, initialized at the time it was started. So a new process is needed to pick up the changes. Unfortunately, any 'cmd' launched from the active console runs as a child process, and inherits the environment of its parent (either current, or original for cmd/i) i.e. is oblivious to system level environment changes. One way to start a new 'cmd' process not-as-a-child is to use 'wmic process call create'.
2.c. Once the new process is started, and sees the just-added environment variable, issue remains that it has no direct way to return it to the caller. Workaround here is to create a temporary file with the given name, whose name can then be read back in the original batch. Since wmic starts the secondary 'cmd' asynchronously, the caller needs to wait until the callee completes.
That said, the sample set-utf8.cmd code is copied below.
Code: Select all
:: set-utf8.cmd - convert utf-8 to utf-16 and store in an(other) variable
::
:: syntax: set-utf8 [out,ref] string-var, [in,ref] utf-8-string-var
::
:: - expected to fail on 'poison' (&%!) and illegal <:"\/|> path characters
:: which is fixable, but not relevant to the main point of this exercise
::
:: - otherwise checked ok under xp.sp3, win7.sp1.x64
@echo off & setLocal enableExtensions disableDelayedExpansion
if "%~2"=="" ( echo.
@rem dump :: comment lines at the top of the file
for /f "usebackq delims=" %%a in ("%~f0") do (
set "z=%%~a" & setlocal enableDelayedExpansion
if not "!z:~0,1!"==":" endlocal & goto :eof
echo !z! & endlocal
)
endLocal & goto :eof
)
@rem save original codepage ('.' for some localized windows e.g. german)
for /f "tokens=2 delims=:." %%a in ('chcp') do @set /a "cp=%%~a"
@rem set global variables
set "hkcu.env=HKEY_CURRENT_USER\Environment"
@rem utf-16le bom, hex 'FF FE' n.b. win7 requires chcp 1252, first
chcp 1252 >nul
set "bom16le=ÿþ"
chcp %cp% >nul
call :set.utf u16 "%~2"
endLocal & set "%~1=%u16%" & goto :eof
:set.utf
setLocal enableDelayedExpansion
set "var==%time::=.%.%random%"
set "tmp8=%temp%\%var%.tmp"
set "reg16=%temp%\%var%.reg"
:: build utf-16le .reg file including bom n.b. win7 requires chcp 1252, first
chcp 1252 >nul
cmd /d /a /c (set/p "=%bom16le%") <nul >"%reg16%" 2>nul
chcp %cp% >nul
@rem save fixed header
cmd /d /u /c ^
(echo Windows Registry Editor Version 5.00) ^& ^
(echo.) ^& ^
(echo [%hkcu.env%]) >>"%reg16%"
@rem save variable, separate echo> + dir/u type>> required for utf-8 conversion
echo "%var%"="!%~2!" >"%tmp8%"
chcp 65001>nul & cmd /u /c type "%tmp8%" >>"%reg16%" & chcp %cp%>nul
del "%tmp8%"
:: set variable in user's environment
@rem n.b. win7 sends 'operation completed successfully' to &2, therefore 2>&1
reg import "%reg16%" >nul 2>&1
:: force an environment refresh for the next cmd to pick up the new variable
@rem create another dummy variable since under xp at least
@rem - setx doesn't broadcast the necessary wm_settingchange, and anyway
@rem it only comes with the resource kit, not in the default install
@rem - wmic 'environment create' does broadcast the wm_settingchange, but
@rem sometimes hangs at exit waiting for input, therefore the <nul
wmic environment create name="%var% ",variablevalue=" ",username="%username%" <nul >nul 2>&1
:: run an external (not child) cmd to create a temp file with the utf-16 name
md "%temp%\!var!"
wmic process call create '%comspec% /v /c copy nul "%temp%\!var!\^!%var%^!.tmp"' <nul >nul 2>&1
:: wait until the external cmd completes
set "u16="
:loop
for %%u in ("%temp%\!var!\*.tmp") do set "u16=%%~nu"
if not defined u16 goto :loop
:: cleanup
rd /s /q "%temp%\!var!"
reg delete "%hkcu.env%" /v "!var!" /f >nul 2>&1
@rem this removes the other dummy variable, also forces an environment refresh
wmic environment where(name="!var! ") delete <nul >nul 2>&1
del "%reg16%"
endLocal & set "%~1=%u16%" & goto :eof
Test case using the set-utf8-test.cmd copied below, and assuming the same utf8.txt file from the previous post
Code: Select all
@echo off & setLocal disableDelayedExpansion & echo.
:: example of reading utf-8 from external file
@rem binary contents of 'utf8.txt' must be
@rem E2 80 B9 CE B1 C3 9F C2 A9 E2 88 82 E2 82 AC E2 80 BA 0D 0A
for /f %%s in (utf8.txt) do set "ucs2.utf8=%%s"
call set-utf8 "ucs2" "ucs2.utf8"
setLocal enableDelayedExpansion
echo "!ucs2.utf8!" [utf-8] = "!ucs2!" [utf-16]
endLocal
:: example of hardcoding utf-8 in batch itself
@rem binary contents of string below in the .cmd file must be
@rem E2 80 B9 CE B1 C3 9F C2 A9 E2 88 82 E2 82 AC E2 80 BA
set "ucs2.utf8=‹αß©∂€›"
call set-utf8 "ucs2" "ucs2.utf8"
setLocal enableDelayedExpansion
echo "!ucs2.utf8!" [utf-8] = "!ucs2!" [utf-16]
endLocal
endLocal & goto :eof
Code: Select all
C:\tmp>set-utf8-test
"ΓÇ╣╬▒├ƒ┬⌐ΓêéΓé¼ΓÇ║" [utf-8] = "‹αß©∂€›" [utf-16]
"ΓÇ╣╬▒├ƒ┬⌐ΓêéΓé¼ΓÇ║" [utf-8] = "‹αß©∂€›" [utf-16]
C:\tmp>
Liviu