ctrl-z blues

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
Sponge Belly
Posts: 231
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: ctrl-z blues

#16 Post by Sponge Belly » 05 Nov 2013 16:34

Hi Squashman! :-)

Speaking of Line Feeds, I stumbled across some bizarre behaviour in my tests. :twisted:

Create a test file called hi.txt with your favourite text editor:

Code: Select all

<255><254>hi there<SUB><CR><LF>


The first two characters in the file are ASCII 255 and 254. This is known as the Byte Order Mark (BOM) and is found at the start of all Windows Unicode (UTF-16, to be precise) files.

Anyways, once you’ve created the file, enter the following command:

Code: Select all

cmd /d /u /c type hi.txt > ansihi.txt


Edit ansihi.txt and you will find:

Code: Select all

hi there<SUB><CR>


The BOM is gone and so is the Line Feed! :?:

But the weirdness doesn’t end there…

Code: Select all

fc /b ansihi.txt unihi.txt
Comparing files ansihi.txt and HI.TXT
00000000: 68 FF
00000001: 69 FE
00000002: 20 68
00000003: 74 69
00000004: 68 20
00000005: 65 74
00000006: 72 68
00000008: 1A 72
00000009: 0D 65
FC: HI.TXT longer than ansihi.txt


Do you see that? Line 7 of the output is missing! :shock:

Is this an fc bug? I wasn’t aware of it.

Well, I must go lie down in a darkened room with a damp cloth over my eyes now…

- SB

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: ctrl-z blues

#17 Post by aGerman » 05 Nov 2013 17:04

Is this an fc bug? I wasn’t aware of it.

No it's the normal behaviour. FC displays only the differences. Because of the removed BOM the first and the last "e" in "there" correspond in the different files.

Regards
aGerman

Sponge Belly
Posts: 231
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: ctrl-z blues

#18 Post by Sponge Belly » 06 Nov 2013 10:36

Hi aGerman! :-)

Thanks for spelling out that aspect of fc’s behaviour for me. I should’ve realised it myself, sorry. :oops:

Too many synapses lost to the hard stuff when I was younger… ;-)

- SB

PS: Would still like to know what happened to that Line Feed.

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: ctrl-z blues

#19 Post by aGerman » 06 Nov 2013 12:52

To be honest I'm surprised that it works at all. You marked the file to be UTF-16 LE encoded, but then you wrote the content in ASCII. I would be totally confused if I were the CMD :wink:

Regards
aGerman

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: ctrl-z blues

#20 Post by penpen » 10 Nov 2013 06:48

Sponge Belly wrote:Would still like to know what happened to that Line Feed.

The reason is simple A UTF-16 character consists of 2, or 4 bytes.
If an unfinished UTF16 character at the end is read, the InputStream is handled as broken and the last 1 byte (or 3 bytes) is dropped:

Content of the file hi.txt in Unicode characters:

Code: Select all

<255><254>   FFFE, Unknown (Unknown Script) and therefore used as the byte order mark
hi           CJK UNIFIED IDEOGRAPH-6869 (Han Script)
 t           SUPERSCRIPT FOUR (Other Number)
he           CJK UNIFIED IDEOGRAPH-6865 (Han Script)
re           CJK UNIFIED IDEOGRAPH-7265 (Han Script)
<SUB><CR>    BUGINESE LETTER JA (Buginese Script)
<LF>         <broken stream: only one byte>

penpen

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: ctrl-z blues

#21 Post by carlos » 10 Nov 2013 07:19

Type stop when found the SUB character (Ctrl+Z or 26 ascii) because it uses the c function fopen using "r" = "text mode", using this that character is interpreted as the end of file. For avoid it it should be use "rb" = "binary mode". But the problem is not of the type command, is of the c standard library. Really "text mode" for me is bad because it interpretes characters for example it also translates \n to \r\n. Type should read the file in binary mode for avoid it.

For /F read the file in binary mode.

carlos
Expert
Posts: 503
Joined: 20 Aug 2010 13:57
Location: Chile
Contact:

Re: ctrl-z blues

#22 Post by carlos » 10 Nov 2013 07:39

Try use this: file called foo.txt

ctrl+z support.

Code: Select all

cmd /a /c for /F "usebackq delims=" %%a in ("foo.txt") do cmd /u /c set /p "=%%a" < nul | find /v ""


the only problem is all the limits of set /p (not permit text beginning with = character and windows 7: trim left spaces).

Sponge Belly
Posts: 231
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: ctrl-z blues

#23 Post by Sponge Belly » 10 Nov 2013 10:29

@Penpen:

Thanks for the explanation! 8)

What utility did you use to make the Unicode dump? I want it! And what on Earth is Buginese? :?:

@Carlos:

Thanks for the info on type and for /f loops. Fascinating! :-)

As near as I can tell, echo, set, and set /p can correctly process UTF-16 LE strings as well as type. Your code snippet can be rewritten as:

Code: Select all

copy nul sub >nul
for /f %%z in (sub) do set sub=%%z
for /f usebackq^ delims^=^ eol^= %%a in ("foo.txt") ^
do cmd /d /u /c echo(%%a| find /v "%sub%"


The above will strip all SUBs (and NULs) from the file. Of course, you could chain additional find commands to filter out other unwanted characters, but that would quickly become grossly inefficient.

Anyways, the only problem now is line length. I need some way to split up an extremely long line into 8191-character chunks so I can process them one character at a time and stick the line back together again.

I might have to cheat and use a hybrid to solve that problem, but it’s a slippery slope. If I’m going to use another language for part of my program, why not write the whole program in that language? It’s a powerful argument, except…

In the words of Samuel Beckett:

All of old. Nothing else ever. Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.


BFN!

- SB
Last edited by Sponge Belly on 10 Nov 2013 14:18, edited 1 time in total.

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: ctrl-z blues

#24 Post by penpen » 10 Nov 2013 13:22

Sponge Belly wrote:What utility did you use to make the Unicode dump? I want it!
I've used the internet utility from the author of Unicode:
http://unicode.org/cldr/utility/character.jsp
Just type in the hex value in the field (eg:1A0D) and click the show button.

Sponge Belly wrote:And what on Earth is Buginese? :?:
It is the language of the Bugis:
http://en.wikipedia.org/wiki/Buginese_language

penpen

Sponge Belly
Posts: 231
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: ctrl-z blues

#25 Post by Sponge Belly » 19 Nov 2013 17:22

Hi Again! :-)

Thanks to Penpen for the link to the Unicode website and for the information on the Buginese people. It just goes to show you come across the strangest things while trying to learn Batch! ;-)

But something Penpen said…

A UTF-16 character consists of 2, or 4 bytes. If an unfinished UTF16 character at the end is read, the InputStream is handled as broken and the last 1 byte (or 3 bytes) is dropped.


gave me an idea. :idea:

All will be revealed shortly. ;-)

- SB

Post Reply