An alternative command prompt (Dos9)

Message

Liviu · #31 Post by **Liviu** » 25 Apr 2014 22:10

Darkbatcher wrote:ansi input can still be used as long a you continue to type japanese on a japanese codepage

Not sure what you mean here, but anyway Japanese codepages are not the best starting point to practice codepage-based i/o in Windows. For one thing, they are double-byte codepages, which introduces another whole layer of complications vs. the single-byte western codepages.

Darkbatcher wrote:But there's another problem, in msdn, the fgetws function documentation say that the file is read as multibyte characters. But I was wondering to what encoding was it really referring to, because if it is codepage-independent, reading batch would be qui inneficent since batch are written in plenty of codepages.

According to http://msdn.microsoft.com/en-us/library/c4cy2b8e.aspx you need to open the file in binary mode for fgetws to read it as Unicode. Otherwise (when opened as text) the file is assumed to be narrow text in the current codepage - as set by your code, or maybe the libraries you are using, or inherited from the parent console, or lastly the system default OEM codepage.

Darkbatcher wrote:But i'm still not convinced by the translating of utf-16. Because usually, when I type "dir" in a folder where there are name like "trollé", the "é" is usually displayed as "ù" although there is an equivalent for "é" in the current codepage.

Make sure you are using a Unicode/TT (not raster) font in the console. Cmd's own echo/dir/for/set/etc have no problem with Unicode characters.

The above are more about general Windows/Unicode/C programming, rather than batch in particular. If you are serious about it, I suggest you start with the MSDN docs for reference, and find a better suited Windows/C/C++ forum/board for technicalities. Good luck with your project.

Liviu

#32 Post by **penpen** » 26 Apr 2014 04:26

@Liviu: I fear the japanese example is one of mine above.

@Darkbatcher: If you could ensure to read and write in one codepage this should be indeed no problem,
but you can't do that as the dos command shell is allowed to change the actual codepage (using the "chcp" command).
And you also can't ensure that any raw text file is created using this codepage, so you would run into problems,
when trying to interpret hex value 0x80 as this is not used by cp932.
See: http://msdn.microsoft.com/en-us/goglobal/cc305152
If you want to use a single codepage just ensure this codepage is "single byte" and interprets all hex values,
for example the oem codepage 850.

Darkbatcher wrote:But there's another problem, in msdn, the fgetws function documentation say that the file is read as multibyte characters. But I was wondering to what encoding was it really referring to, because if it is codepage-independent, reading batch would be qui inneficent since batch are written in plenty of codepages.

As far as i know (warning: old Visual Studio 6.0 information) the codepage is set when opening the file and applied to the stream.
So if you are changing the processes actual codepage it wouldn't affect how the stream is read.
I assume this is the reason, why the the command shell opens the file every time it parses the next line, and closes it after that.
You may watch this behaviour using ProcessMonitor.

Liviu wrote:According to <link to msdn removed, as only 2 links allowed> you need to open the file in binary mode for fgetws to read it as Unicode. Otherwise (when opened as text) the file is assumed to be narrow text in the current codepage - as set by your code, or maybe the libraries you are using, or inherited from the parent console, or lastly the system default OEM codepage.

That's not completely true, you may use the csc flag, see:
http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx.

Darkbatcher wrote:But i'm still not convinced by the translating of utf-16. Because usually, when I type "dir" in a folder where there are name like "trollé", the "é" is usually displayed as "ù" although there is an equivalent for "é" in the current codepage.

This is in general caused by a mix of the mapping of the unicode values to the font and something i carefully would describe as "exotic" behaviour.
It is a mapping to default characters (and this mapping somehow depends on the font, used).
To see this, just create a file called "table.dat" with the content (20, 00, 01, ..., FF) in hex notation, and then run this batch ("table.bat").

Code: Select all

@echo off
cls
setlocal enableDelayedExpansion
for /F "tokens=2 delims=:." %%c in ('chcp') do set "cp=%%c"

for %%t in (table1 table2) do set "%%t="
chcp 850 > nul
set /P "table1=" < "table.dat"
chcp 1252 > nul
set /P "table2=" < "table.dat"

rem ASCII range [0: 127] stays always the same; remove rem to se it
rem call :display   0  63
rem call :display  64 127

rem may be different on [128 : 255]; remove the rem to see the other chars, too
call :display 128 191
rem call :display 192 255

echo(not only glyphs but character replacement:
chcp 850 > nul
echo(!table2:~149,1!
chcp 1250 > nul
echo(!table2:~149,1!


chcp %cp% >nul
endlocal
goto :eof


:display
   echo(Hex values displayed [%~1 : %~2]
   set "line1= "
   set "line2= "
   for /L %%i in (%~1, 1, %~2) do (
      if "%%i" == "7" (
         rem replaced beep code by a
         set "line1=!line1!a"
         set "line2=!line2!a"
      ) else if "%%i" == "8" (
         rem replaced backspace code by b
         set "line1=!line1!b"
         set "line2=!line2!b"
      ) else if "%%i" == "9" (
         rem replaced horizontal tabulator code by t
         set "line1=!line1!t"
         set "line2=!line2!t"
      ) else if "%%i" == "10" (
         rem replaced newline code by n
         set "line1=!line1!n"
         set "line2=!line2!n"
      ) else if "%%i" == "13" (
         rem replaced varriage return code by c
         set "line1=!line1!c"
         set "line2=!line2!c"
      ) else (
         set "line1=!line1!!table1:~%%i,1!"
         set "line2=!line2!!table2:~%%i,1!"
      )
   )
   set "line1=!line1:~1!"
   set "line2=!line2:~1!"

   chcp 850 > nul
   echo(t1: !line1! ^<- cp  850

   chcp 1252 > nul
   echo(t2: !line2! ^<- cp 1252
   echo(t1: !line1!
   echo(t2: !line2!

   chcp 850 > nul
   echo(t1: !line1! ^<- cp  850
   echo(t2: !line2!
   echo(

   goto :eof

The output should be something like this (browser may corrupt the characters), if you use raster fonts:

Code: Select all

Hex values displayed [128 : 191]
ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑªº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐ <-- cp  850
ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑªº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐ <-- cp 1252
Ã³ÚÔõÓÕþÛÙÞ´¯ý─┼╔µã¶÷‗¹¨ Í▄°úÏÎâßÝ¾·±Ð¬║┐«¼¢╝í½╗ªªªªª┴┬└®ªª++óÑ+
ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑªº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐
ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑªº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐ <-- cp  850
??'ƒ".┼╬^%S<O?Z??''""--~Ts>o?zY ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿

not only glyphs but character replacement:

ò

The variable table1 contains the content of table.dat loaded using cp 850, and
the variable table2 contains the content of table.dat loaded using cp 1252.

What could be seen here:
If you load table.dat into a variable and display it in the same codepage as loaded, then you always get the same characters displayed (rasterfonts only, first two lines).
The third line is the result of codepage mapping of unsupported characters: I don't know the exact mapping but as there are multiple differen replacement characters, i assume it tries to map to a nearest character in some order (unknown to us).
The 4th line is just the same as the second, to easily compare to the next line (codepage has not changed between these lines).
The 5th line displays table1 on cp 850 again and you see, that the content wasn't altered, although multiple characters (░▒▓) are mapped to 'ª' in line three.
So only the output data changed not the internal stored.
The 6th line is similar to the 4th line but loaded in cp 1252 and displayed in cp 850 (not vice versa).

Now you should change the font to "Luicida" and you see the ░▒▓ replacement characters have changed to what this font interprets as 'ª' (in the dos shell you should see a difference)
rerun the batch and you will get an output like that:

Code: Select all

Hex values displayed [128 : 191]
t1: ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑªº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐ <- cp  850
t2: €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ <- cp 1252
t1: ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑªº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐
t2: €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
t1: ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑªº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐ <- cp  850
t2: €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿

not only glyphs but character replacement:
•
•

Here you see, that the ░▒▓ replacement characters are different ones.
also there is no beep, so it is not only a replacement of glyphs (caused by the change of the font).

Back to your character: cp1252(é) == 0xE9 == 233
This is only an effect of the Raster font glyph remapping.
Using "Luicida" font should show you this:

Code: Select all

Z:\>chcp 1252
Aktive Codepage: 1252.

Z:\>set "table="

Z:\>set /P "table=" < "table.dat"

Z:\>echo %table:~233,1%
é

Now just change to "Raster" font and you'll see (without any additional typing):

Code: Select all

Z:\>chcp 1252
Aktive Codepage: 1252.

Z:\>set /P "table=" < "table.dat"

Z:\>echo %table:~233,1%
ù

If you copy-paste the dos shell content you see, that this is the same character (only different glyphs).
But as described above it may happen more to the characters.

I haven't checked it (and it would be only be checked by decompiling the WHOLE cmd.exe which is just too much work to just find that out), but i doubt, that this is a special "feature" of cmd.exe, so i think you will not to have to find out the algorithm that shows this result: When using different fonts and the (default/microsoft) display functions, this behaviour should automatically apply to your application, too.

If you want to create system independent function, then just encapsulate the different functions in some own functions.
The default would be using an own namespace in cpp or using object functions for that.
The only ansi c namespaces (struct and typedef) are not really good for that, but you may encapsulate function pointers within a struct.

penpen

Edit: Changed the german msdn link to english (de-de --> en-us).
Edit2: Decompiling the whole cmd.exe is not needed, see next post: Thanks to Liviu.

Liviu · #33 Post by **Liviu** » 26 Apr 2014 16:24

@penpen, nice rundown. Just a couple of notes below.

penpen wrote:As far as i know (warning: old Visual Studio 6.0 information) the codepage is set when opening the file and applied to the stream.

penpen wrote:That's not completely true, you may use the csc flag, see: <link to German MSDN help for fopen>

VC 6 is a bit dated, in particular about Unicode support in the CRT. For example, the ccs flag was only introduced in VS 2005 a.k.a. VC 8. You can check the English version of the fopen page and select from "other versions" .NET 2003 a.k.a. VC 7.1 - there was no ccs flag back then (for some odd reason, the German page cuts the list short at VS 2005, while the English one goes back to 2003).

In post-2005 VS versions it is also possible to change the mode after the stream is opened, helpful for stdin/stdout which the program doesn't fopen itself. There is a discussion about _setmode and _O_U16TEXT archived at https://web.archive.org/web/20100316160056/http://blogs.msdn.com/michkap/archive/2008/03/18/8306597.aspx (archived, since that entire blog seems to have been retired from its original msdn address, though it used to be a great resource on topics of "internationalization").

penpen wrote:When using different fonts and the (default/microsoft) display functions, this behaviour should automatically apply to your application, too.

Copy/pasting from another post, but 'when codepages and Unicode are involved, you should be using a Unicode/TT - not raster - font in the console. Besides aesthetics, using a raster font actually changes the automatic codepage conversions done by the console, for output in particular - see the MSDN SetConsoleOutputCP docs "if the current font is a raster font, SetConsoleOutputCP does not affect how extended characters are displayed")'.

Liviu

Darkbatcher · #34 Post by **Darkbatcher** » 28 Apr 2014 10:09

Ok !

That's really interresting. Thank you so much for explanation of font behaviour. If I understand well, To make this work in the way it is intended to, a TrueType font should be used. Do you know why raster fonts are used by default ? It's allways sad to have not every functionnality enabled by default.

Anyway, thank you for you help guys, It have been really interresting. (And I've now work for days ! :mrgreen:

)

@+

#35 Post by **penpen** » 28 Apr 2014 10:50

The true type font "Lucida Console" isn't the default font because it was designed round about 2004: NT exists since just about 1994.

Side note: If you want to know, why no other font is choosable as a cmd.exe shell font: http://blogs.msdn.com/b/oldnewthing/archive/2007/05/16/2659903.aspx.

penpen

DosTips.com

An alternative command prompt (Dos9)

Re: An alternative command prompt (Dos9)

Re: An alternative command prompt (Dos9)

Re: An alternative command prompt (Dos9)

Re: An alternative command prompt (Dos9)

Re: An alternative command prompt (Dos9)