ctrl-z blues

Message

Sponge Belly · #16 Post by **Sponge Belly** » 05 Nov 2013 16:34

Hi Squashman! :-)

Speaking of Line Feeds, I stumbled across some bizarre behaviour in my tests. :twisted:

Create a test file called hi.txt with your favourite text editor:

Code: Select all

<255><254>hi there<SUB><CR><LF>

The first two characters in the file are ASCII 255 and 254. This is known as the Byte Order Mark (BOM) and is found at the start of all Windows Unicode (UTF-16, to be precise) files.

Anyways, once you’ve created the file, enter the following command:

Code: Select all

cmd /d /u /c type hi.txt > ansihi.txt

Edit ansihi.txt and you will find:

Code: Select all

hi there<SUB><CR>

The BOM is gone and so is the Line Feed! :?:

But the weirdness doesn’t end there…

Code: Select all

fc /b ansihi.txt unihi.txt
Comparing files ansihi.txt and HI.TXT
00000000: 68 FF
00000001: 69 FE
00000002: 20 68
00000003: 74 69
00000004: 68 20
00000005: 65 74
00000006: 72 68
00000008: 1A 72
00000009: 0D 65
FC: HI.TXT longer than ansihi.txt

Do you see that? Line 7 of the output is missing! :shock:

Is this an fc bug? I wasn’t aware of it.

Well, I must go lie down in a darkened room with a damp cloth over my eyes now…

- SB

#17 Post by **aGerman** » 05 Nov 2013 17:04

Is this an fc bug? I wasn’t aware of it.

No it's the normal behaviour. FC displays only the differences. Because of the removed BOM the first and the last "e" in "there" correspond in the different files.

Regards
aGerman

Sponge Belly · #18 Post by **Sponge Belly** » 06 Nov 2013 10:36

Hi aGerman! :-)

Thanks for spelling out that aspect of fc’s behaviour for me. I should’ve realised it myself, sorry. :oops:

Too many synapses lost to the hard stuff when I was younger… ;-)

- SB

PS: Would still like to know what happened to that Line Feed.

#19 Post by **aGerman** » 06 Nov 2013 12:52

To be honest I'm surprised that it works at all. You marked the file to be UTF-16 LE encoded, but then you wrote the content in ASCII. I would be totally confused if I were the CMD :wink:

Regards
aGerman

#20 Post by **penpen** » 10 Nov 2013 06:48

Sponge Belly wrote:Would still like to know what happened to that Line Feed.

The reason is simple A UTF-16 character consists of 2, or 4 bytes.
If an unfinished UTF16 character at the end is read, the InputStream is handled as broken and the last 1 byte (or 3 bytes) is dropped:

Content of the file hi.txt in Unicode characters:

Code: Select all

<255><254>   FFFE, Unknown (Unknown Script) and therefore used as the byte order mark
hi           CJK UNIFIED IDEOGRAPH-6869 (Han Script)
 t           SUPERSCRIPT FOUR (Other Number)
he           CJK UNIFIED IDEOGRAPH-6865 (Han Script)
re           CJK UNIFIED IDEOGRAPH-7265 (Han Script)
<SUB><CR>    BUGINESE LETTER JA (Buginese Script)
<LF>         <broken stream: only one byte>

penpen

carlos · #21 Post by **carlos** » 10 Nov 2013 07:19

Type stop when found the SUB character (Ctrl+Z or 26 ascii) because it uses the c function fopen using "r" = "text mode", using this that character is interpreted as the end of file. For avoid it it should be use "rb" = "binary mode". But the problem is not of the type command, is of the c standard library. Really "text mode" for me is bad because it interpretes characters for example it also translates \n to \r\n. Type should read the file in binary mode for avoid it.

For /F read the file in binary mode.

carlos · #22 Post by **carlos** » 10 Nov 2013 07:39

Try use this: file called foo.txt

ctrl+z support.

Code: Select all

cmd /a /c for /F "usebackq delims=" %%a in ("foo.txt") do cmd /u /c set /p "=%%a" < nul | find /v ""

the only problem is all the limits of set /p (not permit text beginning with = character and windows 7: trim left spaces).

Sponge Belly · #23 Post by **Sponge Belly** » 10 Nov 2013 10:29

@Penpen:

Thanks for the explanation!

What utility did you use to make the Unicode dump? I want it! And what on Earth is Buginese? :?:

@Carlos:

Thanks for the info on type and for /f loops. Fascinating! :-)

As near as I can tell, echo, set, and set /p can correctly process UTF-16 LE strings as well as type. Your code snippet can be rewritten as:

Code: Select all

copy nul sub >nul
for /f %%z in (sub) do set sub=%%z
for /f usebackq^ delims^=^ eol^= %%a in ("foo.txt") ^
do cmd /d /u /c echo(%%a| find /v "%sub%"

The above will strip all SUBs (and NULs) from the file. Of course, you could chain additional find commands to filter out other unwanted characters, but that would quickly become grossly inefficient.

Anyways, the only problem now is line length. I need some way to split up an extremely long line into 8191-character chunks so I can process them one character at a time and stick the line back together again.

I might have to cheat and use a hybrid to solve that problem, but it’s a slippery slope. If I’m going to use another language for part of my program, why not write the whole program in that language? It’s a powerful argument, except…

In the words of Samuel Beckett:

All of old. Nothing else ever. Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.

BFN!

- SB

#24 Post by **penpen** » 10 Nov 2013 13:22

Sponge Belly wrote:What utility did you use to make the Unicode dump? I want it!

I've used the internet utility from the author of Unicode:
http://unicode.org/cldr/utility/character.jsp
Just type in the hex value in the field (eg:1A0D) and click the show button.

Sponge Belly wrote:And what on Earth is Buginese?

It is the language of the Bugis:
http://en.wikipedia.org/wiki/Buginese_language

penpen

Sponge Belly · #25 Post by **Sponge Belly** » 19 Nov 2013 17:22

Hi Again!

Thanks to Penpen for the link to the Unicode website and for the information on the Buginese people. It just goes to show you come across the strangest things while trying to learn Batch! ;-)

But something Penpen said…

A UTF-16 character consists of 2, or 4 bytes. If an unfinished UTF16 character at the end is read, the InputStream is handled as broken and the last 1 byte (or 3 bytes) is dropped.

gave me an idea. :idea:

All will be revealed shortly. ;-)

- SB

DosTips.com

ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues

Re: ctrl-z blues