Code: Select all
:: ‹αß©∂€›
::
:: first line in hex should be
:: 3A 3A 20 E2 80 B9 CE B1 C3 9F C2 A9 E2 88 82 E2
:: 82 AC E2 80 BA 0D 0A
@echo off & setlocal disableDelayedExpansion
for /f "tokens=2 delims=:" %%a in ('chcp') do @set /a "cp=%%~a"
@rem works in win7, fails in xp
chcp 65001 >nul
call :test
chcp %cp% >nul
call :dump
@rem works in win7, fails in xp
(chcp 65001 >nul) & call :test & (chcp %cp% >nul)
call :dump
@rem ...but this doesn't work !?
set "x="
(chcp 65001 >nul) & (<"%~f0" set /p "x=") & (chcp %cp% >nul)
setlocal enableDelayedExpansion
echo( & echo !x:~3! & endlocal
endlocal & goto :eof
:test --------------------------------
@rem inline assignment
set "a="
set "a=‹αß©∂€›"
@rem set/p from file
set "b="
<"%~f0" set /p "b="
set "b=%b:~3%"
@rem for/f from file
set "c="
for /f "usebackq delims=" %%c in ("%~f0") do (
set "c=%%~c"
goto :c
)
:c
set "c=%c:~3%"
@rem for/f from command output
@rem ...but 'more ^<"%~f0"' and 'type "%~f0" ^| more'
@rem both fail with 'not enough memory' !?
set "d="
for /f "delims=" %%d in ('type "%~f0"') do (
set "d=%%~d"
goto :d
)
:d
set "d=%d:~3%"
goto :eof
:dump --------------------------------
echo(
setlocal enableDelayedExpansion
echo !a!
echo !b!
echo !c!
echo !d!
endlocal & goto :eof
Code: Select all
C:\tmp>chcp 65001
Active code page: 65001
C:\tmp>type w7-utf8.txt >w7-utf8.cmd
C:\tmp>
Running the batch file under XP (sp3) fails at the first 'chcp 65001' line, and outputs nothing at all. In Win7 (x64.sp1) however, it yields...
Code: Select all
C:\tmp>ver
Microsoft Windows [Version 6.1.7601]
C:\tmp>chcp
Active code page: 437
C:\tmp>w7-utf8
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
‹αß©∂€›
C:\tmp>
Some notes...
This does not mean, or means to imply, that Win7 now runs UTF-8 batch files natively - it does not. The parts of the sample code which contain characters outside the default codepage (in my case 437) are only ever accessed while an explicit chcp 65001 is in effect. The reason this works is that UTF-8 is itself a byte-oriented encoding, its first 128 codepoints match the ASCII encoding, and no other multi-byte encodings use values 0-127. In particular, line breaks are the same between UTF-8 and ASCII. If a string contains no control characters, its UTF-8 encoding will contain no control characters, either. If a string does not contain quotes, neither will its UTF-8 encoding. In other words, the batch parser sees the .cmd file as a regular text file, with maybe some "odd" - but harmless - characters in the 128-255 range here and there. When, and only when, the codepage is explicitly switched to 65001 those "odd" characters are interpreted as UTF-8.
The 65001 codepage should be used sparingly, just for initializing the variables or reading the necessary data, then reverted to the default codepage. External programs may misbehave if launched under the 65001 codepage. Besides...
Redirection/pipes are still broken under chcp 65001, and the codepage associated with the parser and input/output streams seems to still be decided in advance for multi-line/parenthesized blocks, rather than for individual commands later at runtime - see the "...but !?" comments in the code.
The for/f loops in the above code read the first line off the "%~f0" batch file itself. That's just so that the sample is self contained, without requiring an auxiliary data file. In real life, the same code could work with an external UTF-8 encoded file, and would not need to stop at the first line, of course.
Liviu