Question about code pages, characters and fonts

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Question about code pages, characters and fonts

#1 Post by Aacini » 07 Mar 2017 17:17

I read so many different descriptions about this matter that I am confused now. May someone help me to clear the points below in a simple and concise way? I am interested in the output generated by the cmd.exe standard commands we all know, like ECHO.

  • ASCII characters in 32-127 range are displayed in the same way in all computers no matters the locale. Right?
  • Characters in 128-255 range are not displayed the same, but depends on the locale in each computer.
  • If in two computers the same code page is set with CHCP command and the same font is selected in the cmd.exe window, the characters in 128-255 range are dislayed in the same way. Right?
  • Can the code pages 437 and/or 850 be set with CHCP command in all computers? If the answer is NO, please explain the reason and a possible solution, if it exist.
  • Are the "standard" Raster fonts with sizes 4x6, 6x8, 8x8, 16x8, 5x12, 7x12, 8x12, 16x12, 12x16 and 10x18 preinstalled in all computers?
  • More clear: If all computers have these Raster fonts preinstalled, have they all the same character definitions/pixels dispositions/glyphs?
  • If previous answer is NO: Can the corresponding *.fon files be copied from one computer and installed in other, so both computers display the same characters/glyphs?

Thanks!

Antonio

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Question about code pages, characters and fonts

#2 Post by penpen » 08 Mar 2017 10:05

Aacini wrote:
  • ASCII characters in 32-127 range are displayed in the same way in all computers no matters the locale. Right?
  • Characters in 128-255 range are not displayed the same, but depends on the locale in each computer.
Depends on how you mean this.
The above is true for all ASCII based codepages (by definition).

But there are codepages that are not ASCII based, for example the IBM EBCDIC based codepage 37:
'A' == U+0041 == codepage_437(0x41) == codepage_037(0xC1).


Aacini wrote:
  • If in two computers the same code page is set with CHCP command and the same font is selected in the cmd.exe window, the characters in 128-255 range are dislayed in the same way. Right?
Nearly:
There may be be different versions of the font, or the text layout may be different (right to left/left to right), so the used glyphs could still differ.


Aacini wrote:
  • Can the code pages 437 and/or 850 be set with CHCP command in all computers? If the answer is NO, please explain the reason and a possible solution, if it exist.
I don't know if there may be an default win xp, ... win10 pc with only one of them.
It should be possible to only have one of both codepages installed.
If a codepage is missing the only solution is to install the missing codepage from the win setup cd/any other source (i don't know, but assume you may download it somewhere in MS websites).


Aacini wrote:
  • Are the "standard" Raster fonts with sizes 4x6, 6x8, 8x8, 16x8, 5x12, 7x12, 8x12, 16x12, 12x16 and 10x18 preinstalled in all computers?
This is true for all XP setup disks published by MS.
I assume this is also true for up to win 10 - but i don't know it for sure.
In addition you probably can create a custom xp setup disk excluding some of these bitmap fonts.


Aacini wrote:
  • More clear: If all computers have these Raster fonts preinstalled, have they all the same character definitions/pixels dispositions/glyphs?
It is the main idea for these fonts to look the same no matter which codepage you use, but they might differ between windows versions (maybe other font versions).


Aacini wrote:
  • If previous answer is NO: Can the corresponding *.fon files be copied from one computer and installed in other, so both computers display the same characters/glyphs?
As far as i know, the bitmap fonts used as "Rasterfont" (at least under win XP) are seperated into multiple font files.
The used mapping probably is (buried somewhere deep) within the registry.

It might be sufficient (today) to just install the missing fonts, but the results easily could be incostistent in such a case;
for example if you install "font.fon" (sizes 4x6 and 6x8) that is partially installed (only 4x6 is missing), then your windows version:
- might install both deinstalling the font used for 4x6, or
- might not install the "font.fon" or
- whatever behaviour MS has implemented (if exists any).


penpen

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Question about code pages, characters and fonts

#3 Post by aGerman » 08 Mar 2017 11:54

Antonio

To be honest that's all rather guessing because I didn't find official documentations.
Additional to what penpen wrote:
Aacini wrote:
  • Characters in 128-255 range are not displayed the same, but depends on the locale in each computer.
It seem to depend on registy values ACP and OEMCP under key "HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage".

Aacini wrote:
  • If in two computers the same code page is set with CHCP command and the same font is selected in the cmd.exe window, the characters in 128-255 range are dislayed in the same way. Right?
If we are talking about a Batch file it also depends on the charset the file was saved. The same byte value should be displayed in the same way.

Aacini wrote:
  • Can the code pages 437 and/or 850 be set with CHCP command in all computers? If the answer is NO, please explain the reason and a possible solution, if it exist.
I assume that at least 437 should be present.
You may check the related NLS files or registry entries. I took the list of CP IDs from there https://msdn.microsoft.com/en-us/library/dd317756.aspx
in order to search for missing code pages. Note that Unicode CPs don't have separate files and some CPs in the registry redirect to another file.

Code: Select all

@echo off &setlocal
for %%h in (
  "037|IBM EBCDIC US-Canada"
  "437|OEM United States"
  "500|IBM EBCDIC International"
  "708|Arabic (ASMO 708)"
  "709|Arabic (ASMO-449+, BCON V4)"
  "710|Arabic - Transparent Arabic"
  "720|Arabic (Transparent ASMO); Arabic (DOS)"
  "737|OEM Greek (formerly 437G); Greek (DOS)"
  "775|OEM Baltic; Baltic (DOS)"
  "850|OEM Multilingual Latin 1; Western European (DOS)"
  "852|OEM Latin 2; Central European (DOS)"
  "855|OEM Cyrillic (primarily Russian)"
  "857|OEM Turkish; Turkish (DOS)"
  "858|OEM Multilingual Latin 1 + Euro symbol"
  "860|OEM Portuguese; Portuguese (DOS)"
  "861|OEM Icelandic; Icelandic (DOS)"
  "862|OEM Hebrew; Hebrew (DOS)"
  "863|OEM French Canadian; French Canadian (DOS)"
  "864|OEM Arabic; Arabic (864)"
  "865|OEM Nordic; Nordic (DOS)"
  "866|OEM Russian; Cyrillic (DOS)"
  "869|OEM Modern Greek; Greek, Modern (DOS)"
  "870|IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2"
  "874|ANSI/OEM Thai (ISO 8859-11); Thai (Windows)"
  "875|IBM EBCDIC Greek Modern"
  "932|ANSI/OEM Japanese; Japanese (Shift-JIS)"
  "936|ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)"
  "949|ANSI/OEM Korean (Unified Hangul Code)"
  "950|ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)"
  "1026|IBM EBCDIC Turkish (Latin 5)"
  "1047|IBM EBCDIC Latin 1/Open System"
  "1140|IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)"
  "1141|IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)"
  "1142|IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)"
  "1143|IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)"
  "1144|IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)"
  "1145|IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)"
  "1146|IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)"
  "1147|IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)"
  "1148|IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)"
  "1149|IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)"
  "1200|Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications"
  "1201|Unicode UTF-16, big endian byte order; available only to managed applications"
  "1250|ANSI Central European; Central European (Windows)"
  "1251|ANSI Cyrillic; Cyrillic (Windows)"
  "1252|ANSI Latin 1; Western European (Windows)"
  "1253|ANSI Greek; Greek (Windows)"
  "1254|ANSI Turkish; Turkish (Windows)"
  "1255|ANSI Hebrew; Hebrew (Windows)"
  "1256|ANSI Arabic; Arabic (Windows)"
  "1257|ANSI Baltic; Baltic (Windows)"
  "1258|ANSI/OEM Vietnamese; Vietnamese (Windows)"
  "1361|Korean (Johab)"
  "10000|MAC Roman; Western European (Mac)"
  "10001|Japanese (Mac)"
  "10002|MAC Traditional Chinese (Big5); Chinese Traditional (Mac)"
  "10003|Korean (Mac)"
  "10004|Arabic (Mac)"
  "10005|Hebrew (Mac)"
  "10006|Greek (Mac)"
  "10007|Cyrillic (Mac)"
  "10008|MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)"
  "10010|Romanian (Mac)"
  "10017|Ukrainian (Mac)"
  "10021|Thai (Mac)"
  "10029|MAC Latin 2; Central European (Mac)"
  "10079|Icelandic (Mac)"
  "10081|Turkish (Mac)"
  "10082|Croatian (Mac)"
  "12000|Unicode UTF-32, little endian byte order; available only to managed applications"
  "12001|Unicode UTF-32, big endian byte order; available only to managed applications"
  "20000|CNS Taiwan; Chinese Traditional (CNS)"
  "20001|TCA Taiwan"
  "20002|Eten Taiwan; Chinese Traditional (Eten)"
  "20003|IBM5550 Taiwan"
  "20004|TeleText Taiwan"
  "20005|Wang Taiwan"
  "20105|IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)"
  "20106|IA5 German (7-bit)"
  "20107|IA5 Swedish (7-bit)"
  "20108|IA5 Norwegian (7-bit)"
  "20127|US-ASCII (7-bit)"
  "20261|T.61"
  "20269|ISO 6937 Non-Spacing Accent"
  "20273|IBM EBCDIC Germany"
  "20277|IBM EBCDIC Denmark-Norway"
  "20278|IBM EBCDIC Finland-Sweden"
  "20280|IBM EBCDIC Italy"
  "20284|IBM EBCDIC Latin America-Spain"
  "20285|IBM EBCDIC United Kingdom"
  "20290|IBM EBCDIC Japanese Katakana Extended"
  "20297|IBM EBCDIC France"
  "20420|IBM EBCDIC Arabic"
  "20423|IBM EBCDIC Greek"
  "20424|IBM EBCDIC Hebrew"
  "20833|IBM EBCDIC Korean Extended"
  "20838|IBM EBCDIC Thai"
  "20866|Russian (KOI8-R); Cyrillic (KOI8-R)"
  "20871|IBM EBCDIC Icelandic"
  "20880|IBM EBCDIC Cyrillic Russian"
  "20905|IBM EBCDIC Turkish"
  "20924|IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)"
  "20932|Japanese (JIS 0208-1990 and 0212-1990)"
  "20936|Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)"
  "20949|Korean Wansung"
  "21025|IBM EBCDIC Cyrillic Serbian-Bulgarian"
  "21027|(deprecated)"
  "21866|Ukrainian (KOI8-U); Cyrillic (KOI8-U)"
  "28591|ISO 8859-1 Latin 1; Western European (ISO)"
  "28592|ISO 8859-2 Central European; Central European (ISO)"
  "28593|ISO 8859-3 Latin 3"
  "28594|ISO 8859-4 Baltic"
  "28595|ISO 8859-5 Cyrillic"
  "28596|ISO 8859-6 Arabic"
  "28597|ISO 8859-7 Greek"
  "28598|ISO 8859-8 Hebrew; Hebrew (ISO-Visual)"
  "28599|ISO 8859-9 Turkish"
  "28603|ISO 8859-13 Estonian"
  "28605|ISO 8859-15 Latin 9"
  "29001|Europa 3"
  "38598|ISO 8859-8 Hebrew; Hebrew (ISO-Logical)"
  "50220|ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)"
  "50221|ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)"
  "50222|ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)"
  "50225|ISO 2022 Korean"
  "50227|ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)"
  "50229|ISO 2022 Traditional Chinese"
  "50930|EBCDIC Japanese (Katakana) Extended"
  "50931|EBCDIC US-Canada and Japanese"
  "50933|EBCDIC Korean Extended and Korean"
  "50935|EBCDIC Simplified Chinese Extended and Simplified Chinese"
  "50936|EBCDIC Simplified Chinese"
  "50937|EBCDIC US-Canada and Traditional Chinese"
  "50939|EBCDIC Japanese (Latin) Extended and Japanese"
  "51932|EUC Japanese"
  "51936|EUC Simplified Chinese; Chinese Simplified (EUC)"
  "51949|EUC Korean"
  "51950|EUC Traditional Chinese"
  "52936|HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)"
  "54936|Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)"
  "57002|ISCII Devanagari"
  "57003|ISCII Bangla"
  "57004|ISCII Tamil"
  "57005|ISCII Telugu"
  "57006|ISCII Assamese"
  "57007|ISCII Odia"
  "57008|ISCII Kannada"
  "57009|ISCII Malayalam"
  "57010|ISCII Gujarati"
  "57011|ISCII Punjabi"
  "65000|Unicode (UTF-7)"
  "65001|Unicode (UTF-8)"
) do for /f "tokens=1* delims=|" %%i in (%%h) do (
  if not exist "%SystemRoot%\System32\C_%%i.NLS" echo file %%i %%j
  for /f "tokens=* delims=0" %%k in ("%%i") do (
    reg query "HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage" /v %%k >nul 2>&1 || echo key %%i %%j
  )
)
pause
Maybe interesting for programmers: https://gist.github.com/ynkdir/b92727e2a52e55a4010f

Aacini wrote:
  • More clear: If all computers have these Raster fonts preinstalled, have they all the same character definitions/pixels dispositions/glyphs?
I'm not sure. I can imagine that they are different in Chinese, Japanese, ... environments.

Steffen

Squashman
Expert
Posts: 4486
Joined: 23 Dec 2011 13:59

Re: Question about code pages, characters and fonts

#4 Post by Squashman » 08 Mar 2017 13:21

Interesting. I did not know there was that many EBCDIC code pages. FTP transfers to our mainframe default to 037 but if a file has extended ascii characters in them then I use 1047.

Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Question about code pages, characters and fonts

#5 Post by Aacini » 08 Mar 2017 22:03

@penpen and @aGerman, thanks a lot for your answers! :D


I found this information on the Win-32 API documentation:

IsValidCodePage function description wrote:Starting with Windows Vista, all code pages that can be installed are loaded by default.

This means that code pages 437 or 850 (or any other one, for that matter) can always be set with CHCP command in all computers with Windows Vista or posterior.


SetConsoleOutputCP function description wrote:A code page maps 256 character codes to individual characters. Different code pages include different special characters, typically customized for a language or a group of languages.

If the current font is a fixed-pitch Unicode font, SetConsoleOutputCP changes the mapping of the character values into the glyph set of the font, rather than loading a separate font each time it is called. This affects how extended characters (ASCII value greater than 127) are displayed in a console window. However, if the current font is a raster font, SetConsoleOutputCP does not affect how extended characters are displayed.

This means that if the current font is a raster font and the code page is ASCII based (like 437 or 850), then the characters/glyphs displayed depends exclusively on the installed raster font.

This also means that if I take the *.fon raster font files from one computer and install them on another one, both computers will show exactly the same characters/glyphs under these conditions.

Note that I have not mentioned the problem related to the installation of the copied *.fon font files in the other computer. I just assumed that if those font files are installed, then this mechanism should work as described.

Do you think this conclusion is correct?

Antonio

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Question about code pages, characters and fonts

#6 Post by penpen » 09 Mar 2017 12:23

Aacini wrote:This means that code pages 437 or 850 (or any other one, for that matter) can always be set with CHCP command in all computers with Windows Vista or posterior.
Well i agree with your interpretation of the given sentence, but I'm still unsure if it's right:
For example i can't use codepage 932 (shift_jis; ANSI/OEM Japanese; Japanese (Shift-JIS)), although it is fully installed under my wind 10 (according to XP related source - but maybe something changed):
https://support.microsoft.com/de-de/help/164948/how-to-install-a-code-page.
Same for codepages 709, 710, 936, 949, 950, 1200, 1201, 12000, 12001, 29001, 50930, 50931, 50933, 50935, 50936, 50937, 50939, 51932, 51936, and 51950.

Code: Select all

Z:\>chcp 932
Ungültige Codepage.
It also could be, MS has forgotten to mention "available only to managed applications" like codepage 1200, ... .


Aacini wrote:This means that if the current font is a raster font and the code page is ASCII based (like 437 or 850), then the characters/glyphs displayed depends exclusively on the installed raster font.
No, if the current raster font is registered temporarily (for one session, like carlos 1 pixel font), then it doesn't depend on the installed raster fonts, but on the current raster font you are using.


Aacini wrote:This also means that if I take the *.fon raster font files from one computer and install them on another one, both computers will show exactly the same characters/glyphs under these conditions.

Note that I have not mentioned the problem related to the installation of the copied *.fon font files in the other computer. I just assumed that if those font files are installed, then this mechanism should work as described.

Do you think this conclusion is correct?
Honestly: I don't know.
The "rasterfont" is no single font file, and you may run into some issues:
Maybe windows could "protect essential fonts", or chooses another alternative, or ... (you have to try).

If i remember right the related fonts are:
- "%SystemRoot%\Fonts\cga*.fon"
- "%SystemRoot%\Fonts\ega*.fon"
- "%SystemRoot%\Fonts\vga*.fon"
- "%SystemRoot%\Fonts\sma*.fon"
- ... (maybe more)

Probably it should be easier to register an own font for a specific session, like carlos did.
You only need to access special characters using utf8 (65001).
If you create such a font, you must also meet the requirements mentioned here.


penpen

Edit: Link now is placed in URL tag.

einstein1969
Expert
Posts: 960
Joined: 15 Jun 2012 13:16
Location: Italy, Rome

Re: Question about code pages, characters and fonts

#7 Post by einstein1969 » 10 Mar 2017 11:26

Hi Antonio,

+1 for this code page, fonts and characters questions :D . I'm interested too. 8)

einstein1969

jfl
Posts: 226
Joined: 26 Oct 2012 06:40
Location: Saint Hilaire du Touvet, France
Contact:

Re: Question about code pages, characters and fonts

#8 Post by jfl » 15 Mar 2017 14:16

By a lucky coincidence, I’ve been working for that past couple of months on several tools that are very relevant to this discussion!

1) I've updated my Microsoft C library eXtensions Library (MsvcLibX),
so that all text written to stdout or stderr, and that goes to the console, is transparently written as UTF-16 Unicode.
This ensures that the expected characters are displayed correctly, whatever the current code page is,
and even if they're not part of that code page.

If stdout or stderr are redirected to a pipe of a file, then the output is converted to the current code page.
This is consistent to what cmd.exe itself does. For example:

Code: Select all

dir Non-ASCII-dir

displays Unicode file names, even with characters not in the current code page.

Code: Select all

dir Non-ASCII-dir | more

converts Unicode file names into the current code page, changing missing characters to '?'.

Give it a try after extracting this Non-ASCII.zip file.

2) I've written a new tool, called codepage.exe, that gives information
about the available code pages in your console.
- Without argument, it lists the current console and system code pages.
- Option -i lists all installed code pages that you can use. (Similar to the list that aGerman posted above)
- Option -s lists all supported code pages that you can install. (Same results as -i on Windows 7/8/10, but different in XP)
- Giving it a code page number displays a table of characters for that code page.
(And yes, thanks to the MsvcLibX update, they will be visible correctly whatever your current code page is :-) )

3) I've updated my conv.exe code page conversion tool with the following features:
- It now checks the input file BOM, and automatically selects the right input encoding (ANSI/UTF7/UTF8/UTF16).
- Again, thanks to the MsvcLibX update, by default it outputs Unicode to the console.
This allows typing a non-ASCII text file to the console, without having to know its encoding. Just run:

Code: Select all

conv My_File.txt


4) I've rebuilt all my tool box with the new library.
So for example the dirc.exe directory comparison tool
will display Russian or Hebrew file names in the US OEM code page.

5) I've used MsvcLibX to build a Windows version of The Silver Searcher (ag.exe)
with the same multilingual capability.
This tool is a fast and powerful text search tool, originally built for Linux.
For information about this tool, see the ag home page or that old post.
The sources of my Windows port are there.

If you find bugs (which is likely, as all this is very new!), please preferably report them in their respective GitHub interfaces.

Any feedback welcome.

Enjoy!

Jean-François

misol101
Posts: 475
Joined: 02 May 2016 18:20

Re: Question about code pages, characters and fonts

#9 Post by misol101 » 15 Mar 2017 16:08

This all seems awfully complicated. Is it alright if I ask a related question here? Basically, the question is, what does it mean to "support Unicode" in a cmd window?

See attached image. I used Google Translate, and renamed one file into Chinese and one into Arabic (upper part of image).

Image

I then went into a cmd window and did a regular "dir" command, as shown in middle part. The files show up as ???.txt etc. At first I was using a bitmap font, so I figured ok I guess they are missing from the font. I then tried Consolas and Lucida Console fonts, that are selectable from my Win 7 cmd window, but again, these glyphs are all missing. If none of the actual fonts used by Microsoft for cmd even supports characters outside of the ASCII range, then what would be the purpose of supporting such input?

Granted, a command like "type" did accept ???.txt as input when tabbing to get the file name, but it all seems very awkward to use like this.

Am I missing something? I'm asking this because it was suggested to me that my external tools should support Unicode and seeing this I just don't really see the point for a cmd line tool.

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Question about code pages, characters and fonts

#10 Post by aGerman » 15 Mar 2017 17:12

misol101 wrote:Am I missing something? I'm asking this because it was suggested to me that my external tools should support Unicode and seeing this I just don't really see the point for a cmd line tool.

Internally cmd.exe can handle UTF-16. Restrictions are 1) the font of the console window and 2) the required single-byte charset of batch codes.
To answer your question:
You can't process the output of DIR if the font doesn't support the characters of the file name. Try ...

Code: Select all

for /f "delims=" %%i in ('dir /a-d /b *.txt') do notepad "%%~i"
... vs. ...

Code: Select all

for %%i in (*.txt) do notepad "%%~i"
The latter should work without problems for your arabic file name.
Furthermore you can process a dragged/dropped file using %1. Try

Code: Select all

notepad %1


Even if cmd.exe supports Unicode other Microsoft command line utilities do not.

Code: Select all

findstr "a" %1
As you can see while notepad.exe supports Unicode arguments, findstr.exe does not. This makes batch codes inconsistent. Don't make the same failure as Microsoft :!:

Steffen

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Question about code pages, characters and fonts

#11 Post by penpen » 15 Mar 2017 20:52

aGerman wrote:You can't process the output of DIR if the font doesn't support the characters of the file name. Try ...

Code: Select all

for /f "delims=" %%i in ('dir /a-d /b *.txt') do notepad "%%~i"
No, that's wrong:
You can process unicode data no matter if the font has defined glyphs for the needed characters, or not.

Definition ("application A supports unicode", sloppy):
A is able to work with unicode characters properly (read, process, write).
It doesn't guarantee, that the characters could be displayed.
It also doesn't guarantee, that the main-/any sub-process handles unicode properly - it's just possible for A.

The reason why your above example does not work properly is, that the unicode characters are mapped to the actual codepage after the dir-output is generated.

Working example (no matter which font, for example: "Rasterfont"):

Code: Select all

@echo off
setlocal enableExtensions enableDelayedExpansion
for /f "tokens=2 delims=:." %%a in ('chcp') do set "cp=%%~a"
set "cp=%cp: =%"
if "%cp%" == "65001" >nul chcp 850
>"dummy" echo(ÿþA4B4C4D4E4F4

>nul chcp 65001
>"dummy.txt" type "dummy"
<"dummy.txt" set /p "line="
set "line=%line:~0,-1%"

set "line"
>"%line%.txt" echo(test1
>"%line:~0,4%.txt" echo(test2

for /f "delims=" %%i in ('dir /a-d /b "%line:~0,1%*.txt"') do notepad "%%~i"


>nul chcp %cp%
endlocal
goto :eof


penpen

misol101
Posts: 475
Joined: 02 May 2016 18:20

Re: Question about code pages, characters and fonts

#12 Post by misol101 » 16 Mar 2017 06:46

Thanks for the explanation you two, I'll keep it in mind!

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Question about code pages, characters and fonts

#13 Post by aGerman » 16 Mar 2017 11:30

penpen wrote:No, that's wrong:
You can process unicode data no matter if the font has defined glyphs for the needed characters, or not.

:shock: I always thought it was the font that restricts the character support in the console window. But you are absolutely right! I just c/p Chinese glyphs in a cmd window where I set CP 65001. Surprisingly I was able to copy them back correctly even if they were displayed as question marks in a box. Thanks for pointing this out!

Steffen

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Question about code pages, characters and fonts

#14 Post by penpen » 16 Mar 2017 19:18

aGerman wrote: :shock: I always thought it was the font that restricts the character support in the console window.
That was true in the (good) old (MS-DOS 6.22) days; because auf that *.FNT files were called CodePageInformation files:
http://www.seasip.info/DOS/CPI/cpi.html

But since XP (? i'm unsure, maybe win 95/98/me/nt) windows treats these files only as a glyph database (ignoring mappings outside the "real" codepage);
since then the codepages are located in *.nls files, that nowadays also could contain dll's.


penpen

Post Reply