Page 1 of 1

how to pass/read unicode char on command line/batch script

Posted: 13 Nov 2010 13:02
by vivek4075
please help me

I have one batch script file which is just reading one properties file which have some special unicode char
how to read those char as correct version so that i can use for other opertaion.

Thnaks
vJoy

Re: how to pass/read unicode char on command line/batch scri

Posted: 14 Nov 2010 05:54
by amel27

Code: Select all

help cmd|find "/U"

Re: how to pass/read unicode char on command line/batch scri

Posted: 22 Nov 2010 23:11
by vivek4075
Hi,
Is it possible scenario like if OS locale is English then can we pass Japanese char on command line if aal the lang font are installed on system
If yes then please let me know how we can pass the correct value via cmd.

Please help
Thanks in Advance
VJoy

Re: how to pass/read unicode char on command line/batch scri

Posted: 22 Nov 2010 23:53
by orange_batch
I did extensive research into unicode in relation to Command Prompt for my own purposes recently.

It can receive unicode arguments from explorer.exe/Windows OS.
It can write unicode to a text file in UTF-16LE format. Use "cmd /u" and redirect ">" / ">>". But...

It can't read any unicode file properly, even ones it encoded itself in UTF-16LE. Some latin-based languages might display ok, but extended unicode like Asian languages is impossible.

Command Prompt can work with unicode characters, representing them with an odd symbol, and it knows that two unicode characters are different. You can even retrieve unicode characters from folder/file paths using "for". (In fact I built an intuitive best-case-scenario script for retrieving the unicode from wildcard-containing paths without error.) ...but that's it.

It can only display the characters provided through the ANSI codepage. Yet this has no effect on it's ability to handle unicode. It's really only mean for the basic Roman alphabet and command characters for dealing with file and folder paths, not text.

Code: Select all

chcp /?

That is to say, changing locale or anything on the OS will have no effect. As I explained, Command Prompt relies on ANSI codepages.

So what else can be done? I tried to get people who can program in C/C++ to write a tool that works like "type" but processes unicode as the Windows OS does when passing an argument, but failed due to a general lack of knowledge. This would at least allow users to work with unicode characters stored in text files.

I believe it's possible to fix this since Windows can pass unicode and users can paste unicode onto the command line, but it's still beyond my capabilities. I hope this helps you understand the problem.

Re: how to pass/read unicode char on command line/batch scri

Posted: 23 Nov 2010 03:29
by amel27
orange_batch wrote:I did extensive research into unicode in relation to Command Prompt for my own purposes recently.

It can receive unicode arguments from explorer.exe/Windows OS.
It can write unicode to a text file in UTF-16LE format. Use "cmd /u" and redirect ">" / ">>". But...

It can't read any unicode file properly, even ones it encoded itself in UTF-16LE. Some latin-based languages might display ok, but extended unicode like Asian languages is impossible.

It can write from OEM string to Unicode file in UTF8 format like this:
(866 - default code page for my Ruissian local, BOM may be created via Set/P command):

Code: Select all

set LINE=Some localized OEM text
set FILE=test.txt

CHCP 65001|>>%FILE% Echo %LINE%&CHCP 866


TYPE command may use for read UTF16/UTF8 files with BOM for convert to OEM/ANSI

Re: how to pass/read unicode char on command line/batch scri

Posted: 23 Nov 2010 03:49
by orange_batch
Ah, so it will also output UTF-8.

amel27 wrote:TYPE command may use for read UTF16/UTF8 files with BOM for convert to OEM/ANSI


Example please?

Re: how to pass/read unicode char on command line/batch scri

Posted: 23 Nov 2010 06:39
by amel27
orange_batch wrote:
amel27 wrote:TYPE command may use for read UTF16/UTF8 files with BOM for convert to OEM/ANSI


Example please?

of course mistake for UTF8, but work for UTF16 with BOM: :roll:

Code: Select all

(for /f "delims=" %%a in  ('type utf16.txt') do echo.%%a
)>oem.txt

Re: how to pass/read unicode char on command line/batch scri

Posted: 23 Nov 2010 15:08
by orange_batch
*scratches head* :?

So Command Prompt can write UTF-8 files, and type can read UTF-8 files when the codepage is 65001 (chcp 65001). I'm still unable to set the unicode output of type though... Is there any way?

Re: how to pass/read unicode char on command line/batch scri

Posted: 23 Nov 2010 15:42
by !k

Code: Select all

cmd /u /c type oem.txt > utf.txt

Re: how to pass/read unicode char on command line/batch scri

Posted: 23 Nov 2010 16:34
by orange_batch
That just converts certain OEM to UTF-8...

If you want to see what I mean, try experimenting with this Japanese character: き

Under any and all circumstances (codepages, unicode mode, whatever), paste it into Command Prompt, echo it to a text file, type it back, try to set it to a variable with:

Code: Select all

for /f "delims=" %x in ('type utf.txt') do @set myvar=%x


As I just discovered, you can write and read properly a UTF-8 file under codepage 65001, but as usual for is unable to see the unicode from type. Strangely, you can paste unicode into for fine though. Redirection doesn't work either... type must do some kind of look-up when doing unicode that for doesn't wait for. :(

Re: how to pass/read unicode char on command line/batch scri

Posted: 23 Nov 2010 18:53
by amel27
orange_batch wrote:That just converts certain OEM to UTF-8...

hmm... interesting, for me it convert OEM to UTF16, but without BOM (Byte order mark), BOM can be write to file before text output

orange_batch wrote:If you want to see what I mean, try experimenting with this Japanese character
I think, before experimenting destination localization must be set as default in OS, and this char must be present in OEM charset for destination language

Re: how to pass/read unicode char on command line/batch scri

Posted: 23 Nov 2010 19:04
by orange_batch
Er sorry, whatever UTF it converted to. It's irrelevant to the main problem. :wink: But now I'm convinced I'm right about that problem.

Re: how to pass/read unicode char on command line/batch scri

Posted: 24 Nov 2010 13:11
by jeb
Hi,

orange_batch wrote:As I just discovered, you can write and read properly a UTF-8 file under codepage 65001, but as usual for is unable to see the unicode from type. Strangely, you can paste unicode into for fine though. Redirection doesn't work either... type must do some kind of look-up when doing unicode that for doesn't wait for.


I see several questions.
1. display unicode
2. Working with a fix batch-text inside your batch
3. Working with unicode in a for-loop
4. redirecting unicode to another (unicode)text file
5. comparing characters/internal representation

1. In my opinion I can display unicode files with type independent of cmd /u or cmd /a, and codepage is also irrelevant. Only UCS_16LittleEndian files seems to work.
But the font (of the cmd-window) is important, set it to Lucida Console and you get more characters, but not all.
It's simply because they are missing in Lucida Console.
Till today, I'm not able to activate another font for my cmd-window (I tried "Arial Unicode MS" and "Courier New" in the registry)

2. Not tested yet

3. Works for me with a Unicode Little Endian file, but it's neccessary to set the right codepage before the FOR starts, and it only works with type not direct with a file.
Works with cmd /a but not with cmd /u (creates a file without BOM, perhaps UTF32 format, it is much longer than the other file)

Code: Select all

(
   del  u16_65001.txt 2> nul
   chcp 65001 > nul
   rem Not neccessay, but doesn't destroy anything
   copy bom_utf8.txt u16_65001.txt
   for /F "usebackq delims=" %%a in (`type unicode_L16.txt`) do (
         echo %%a >> u16_65001.txt
   )
   chcp 1252  > nul
)


The parenthesis have to be before the codepage changed to 65001, else my batch stops immediatly after chcp 65001.
Only the redirected output worked, a simple echo display garbage.

4. Redirecting seems to work, see 3. Only UTF-8 without BOM files are successful created. You can prefix with your own BOM file. But my text editor doesn't make a difference.

5. The unicode characters seems to be represented as multiple byte not as a single character.
So a string with "AﮓBﮧC" have a len of 9, because the both unicode chars are represented by 3 bytes (no, i can't read it)

my testfile

Code: Select all

СРРР
Њ-Ћ-
абвгд
ежзийк-
ﭱﭲﭳ
ﭴﭵﭶﭷ
ﭸﭹﭺ
ﭻﭼﭽ
ﭾﭿﮓ
AﮧBﮦCﯛ
ﯚﱞﱟﮰ-


waiting for more informations 8)

jeb