Hi,
orange_batch wrote:As I just discovered, you can write and read properly a UTF-8 file under codepage 65001, but as usual for is unable to see the unicode from type. Strangely, you can paste unicode into for fine though. Redirection doesn't work either... type must do some kind of look-up when doing unicode that for doesn't wait for.
I see several questions.
1. display unicode
2. Working with a fix batch-text inside your batch
3. Working with unicode in a for-loop
4. redirecting unicode to another (unicode)text file
5. comparing characters/internal representation
1. In my opinion I can display unicode files with
type independent of cmd /u or cmd /a, and codepage is also irrelevant. Only UCS_16LittleEndian files seems to work.
But the font (of the cmd-window) is important, set it to
Lucida Console and you get more characters, but not all.
It's simply because they are missing in
Lucida Console.
Till today, I'm not able to activate another font for my cmd-window (I tried "Arial Unicode MS" and "Courier New" in the registry)
2. Not tested yet
3. Works for me with a Unicode Little Endian file, but it's neccessary to set the right codepage before the FOR starts, and it only works with
type not direct with a file.
Works with cmd /a but not with cmd /u (creates a file without BOM, perhaps UTF32 format, it is much longer than the other file)
Code: Select all
(
del u16_65001.txt 2> nul
chcp 65001 > nul
rem Not neccessay, but doesn't destroy anything
copy bom_utf8.txt u16_65001.txt
for /F "usebackq delims=" %%a in (`type unicode_L16.txt`) do (
echo %%a >> u16_65001.txt
)
chcp 1252 > nul
)
The parenthesis have to be before the codepage changed to 65001, else my batch stops immediatly after chcp 65001.
Only the redirected output worked, a simple echo display garbage.
4. Redirecting seems to work, see 3. Only
UTF-8 without BOM files are successful created. You can prefix with your own BOM file. But my text editor doesn't make a difference.
5. The unicode characters seems to be represented as multiple byte not as a single character.
So a string with "AﮓBﮧC" have a len of 9, because the both unicode chars are represented by 3 bytes (no, i can't read it)
my testfile
Code: Select all
СРРР
Њ-Ћ-
абвгд
ежзийк-
ﭱﭲﭳ
ﭴﭵﭶﭷ
ﭸﭹﭺ
ﭻﭼﭽ
ﭾﭿﮓ
AﮧBﮦCﯛ
ﯚﱞﱟﮰ-
waiting for more informations
jeb