findstr.bat and repl.bat and NULLS

Message

#16 Post by **foxidrive** » 23 Jul 2014 00:09

Thanks aGerman,

This slightly changed snippet changes nulls to a pipe, which repl and findrepl seemed unable to do.

var objAdoS = WScript.CreateObject("ADODB.Stream");
objAdoS.Type = 2;
objAdoS.CharSet = "us-ascii";
objAdoS.Open();
objAdoS.LoadFromFile("file.j8i");
var strContent = objAdoS.ReadText();
objAdoS.Close();
var strFind = strContent.replace(/\x00/g,"|");

WScript.Echo(strFind);

carlos · #17 Post by **carlos** » 23 Jul 2014 06:55

In my bhx program I used a adodb object, and I remember that the charset for handle binary data is "windows-1252", else some characters are interpreted bad.

#18 Post by **aGerman** » 23 Jul 2014 13:50

Good point carlos.
As long as all characters are standard ASCII it doesn't matter. Although probably you must not use a unicode character set (even if UTF-8 could also be a possible encoding

). Whether you have to use a different character set can only be discovered if it happens that one or more extended ASCII characters are in the plain text phrase. I assume only foxidrive can answer that issue ...

@foxidrive
Basically it's not a question of the replace method but a question of how different string types or string streams are casted into a JScript string object. Printable ASCII characters seem to be casted always correctly. Since the NUL character is also used as string terminator in some string types it may happen that a certain expression will be truncated at the first occurrence of a 0 byte. That's the behavior you will find if you use the ReadAll Method instead of the ADO Stream workaround.

Regards
aGerman

#19 Post by **foxidrive** » 23 Jul 2014 18:22

aGerman wrote:Since the NUL character is also used as string terminator in some string types it may happen that a certain expression will be truncated at the first occurrence of a 0 byte. That's the behavior you will find if you use the ReadAll Method instead of the ADO Stream workaround.

Thanks aGerman, that is the behaviour of every tool I tried in plain batch too.
set /p also truncates at the first nul for instance.

What I was interested in (my first post isn't really clear on this point, but the thread title is - sorry) is that repl.bat and findrepl.bat don't correctly handle a replacement of \x00 with another character.

The character set in this example was plain ascii so that aspect was ok.

#20 Post by **dbenham** » 23 Jul 2014 22:18

There is something wonky going on with JScript. REPL.BAT is designed to support replacement of 0x00, and in all my prior tests it worked fine.

I created a file test.txt with the following hex values:

Code: Select all

C:\test\test>hexdump test.txt
00 41 00 42 00 43 00

I was able to replace the null bytes no problem:

Code: Select all

C:\test\test><test.txt repl \x00 _ >new.txt

C:\test\test>type new.txt
_A_B_C_

But if I try to replace the original file.j8i, without redirection, I get screwy results:

Code: Select all

C:\test\test><file.j8i repl \x00 _
JS-8_U__♦___?_?_?U__C___‼_?_____?9__?9__&___?_?_____?9__?9__(___·_?_____?9__?9__
C___?_?_____?9__?9__☺___?_?_?U__☺___?_?_?U__☺___?_?_?U__'___?_?_____?9__?9__♣___
?_7_?U__♦___8_1_?U__C___?_1_____?9__?9__&___?_2_____?9__?9__(___?_2_____?9__?9__
C___·_2_____?9__?9______________________________?}___H_♀_____________________♦__
___d__ô♥A_^_☺♦__☻ _%I↕♦IéA'<S('♦(S☻(¢ò(,♦äJ♥E¢,(`♣"*¢?I►ôîéA<______ ______ _____
________________________________________________________________________________
___________________________a???JS-8_U______?9__?9__(___?_?_____?9__?9__C___7_?__
___?9__?9__☺___?_?_?U__☺___?_?_?U__☺___?_?_?U__'___?_?_____?9__?9__♣___?_I_?U__♦
___I_Q_?U__C___?_Q_____?9__?9__&___?_R_____?9__?9__(___?_R_____?9__?9__C___?_R__
___?9__?9__☺___?_?_?U__☺___Z_e_?U__☺___E_?_?U__'___Z_?_____?9__?9__♣___?_?_?U__♦
___?_?_?U__C___?_?_____?9__?9__&___Z_?_____?9__?9__(___?_?_____?9__?9__C___?_?__
___?9__?9__☺___?_?_?U__☺___?_?_?U__☺___?_?_?U__'___?_?_____?9__?9__♣___?_?_?U__♦
___?_?_?U__C___?_?_____?9__?9__&___?_?_____?9__?9__(___?_?______________________
___

The real mystery occurs when I try to redirect the output to a file:

Code: Select all

C:\test\test><file.j8i repl \x00 _ >file.new
C:\utils\repl.bat(292, 39) Microsoft JScript runtime error: Invalid procedure call or argument

That should not happen :!:

It looks to me like a loose pointer. I don't see how my code could logically give the above results. I wonder if this is a JScript bug?

Dave Benham

#21 Post by **Aacini** » 23 Jul 2014 22:29

This point seems to be related with the old problem about the fact that JScript always manage Unicode characters, so a problem arise when a character beyond Ascii 256 is sent to the screen.

Don't you include the character mapping we talking about sometime into repl.bat?

Antonio

#22 Post by **dbenham** » 24 Jul 2014 05:12

I don't think that is the issue. Both your FINDREPL.BAT and my REPL.BAT are choking on foxi's file, and both have code to compensate for the unicode.

I create a file named ASCII.TXT containing all byte codes, in order, from 0x00 - 0xFF. I am able to properly replace \x00 with _ using

Code: Select all

<ascii.txt repl \x00 _ m >ascii.new

The M option is used to prevent addition of \x0D \0x0A at end.

I also replaced null with itself and verified the output matched the input using FC

Code: Select all

<ascii.txt repl \x00 \x00 mx >ascii.new
fc /b ascii.txt ascii.new

I don't think there is anything wrong with the logic in REPL.BAT, but instead there is something in foxi's file that is triggering a bug in JScript, causing it to run amok.

Or perhaps some byte sequence in the file is interpreted by JScript as an invalid variable length Unicode sequence?

Dave Benham

#23 Post by **penpen** » 24 Jul 2014 13:22

I just noticed, that the result is different, when repeating it.

Code: Select all

Z:\><file.j8i repl \x00 _
JS-8_2__♦___?_?_?2__C___?_?_____?6__?6__&___._?_____?6__?6__(___"_?_____?6__?6__
C___?_?_____?6__?6__☺___?_?_?2__☺___?_?_?2__☺___?_?_?2__'___?_?_____?6__?6__♣___
?_?_?2__♦___?_?_?2__C___?_?_____?6__?6__&___?_?_____?6__?6__(___?_?_____?6__?6__
C___"_?_____?6__?6______________________________?}___H_♀_____________________♦__
___ð__ô♥À_^_☺♦__☻ _%Ï↕♦ÎéÂ'<S('♦(S☻(¢ò('♦äJ♥È¢'('♣"*¢?Ï►ôîéÂ<______ ______ _____
________________________________________________________________________________
___________________________a???JS-8_2______?6__?6__(___?_?_____?6__?6__C___¹_?__
___?6__?6__☺___?_?_?2__☺___?_?_?2__☺___?_?_?2__'___?_?_____?6__?6__♣___?_g_?2__♦
___H_?_?2__C___?_?_____?6__?6__&___?_N_____?6__?6__(___?_N_____?6__?6__C___?_N__
___?6__?6__☺___Q_?_?2__☺___T_?_?2__☺___K_F_?2__'___T_F_____?6__?6__♣___?_?_?2__♦
___?_?_?2__C___?_?_____?6__?6__&___T_?_____?6__?6__(___Q_?_____?6__?6__C___?_?__
___?6__?6__☺___?_?_?2__☺___?_?_?2__☺___?_?_?2__'___?_?_____?6__?6__♣___?_?_?2__♦
___?_?_?2__C___?_?_____?6__?6__&___?_?_____?6__?6__(___?_?______________________
___

Z:\><file.j8i repl \x00 _
JS-8_n__♦___?_?_?n__C___?_?_____?☼__?☼__&___._?_____?☼__?☼__(___"_?_____?☼__?☼__
C___?_?_____?☼__?☼__☺___?_?_?n__☺___?_?_?n__☺___?_?_?n__'___?_?_____?☼__?☼__♣___
?_?_?n__♦___?_?_?n__C___?_?_____?☼__?☼__&___?_?_____?☼__?☼__(___?_?_____?☼__?☼__
C___"_?_____?☼__?☼______________________________?}___H_♀_____________________♦__
___ð__ô♥À_^_☺♦__☻ _%Ï↕♦ÎéÂ'<S('♦(S☻(¢ò('♦äJ♥È¢'('♣"*¢?Ï►ôîéÂ<______ ______ _____
________________________________________________________________________________
___________________________a???JS-8_n______?☼__?☼__(___?_?_____?☼__?☼__C___¹_?__
___?☼__?☼__☺___?_?_?n__☺___?_?_?n__☺___?_?_?n__'___?_?_____?☼__?☼__♣___?_g_?n__♦
___H_?_?n__C___?_?_____?☼__?☼__&___?_N_____?☼__?☼__(___?_N_____?☼__?☼__C___?_N__
___?☼__?☼__☺___Q_?_?n__☺___T_?_?n__☺___K_F_?n__'___T_F_____?☼__?☼__♣___?_?_?n__♦
___?_?_?n__C___?_?_____?☼__?☼__&___T_?_____?☼__?☼__(___Q_?_____?☼__?☼__C___?_?__
___?☼__?☼__☺___?_?_?n__☺___?_?_?n__☺___?_?_?n__'___?_?_____?☼__?☼__♣___?_?_?n__♦
___?_?_?n__C___?_?_____?☼__?☼__&___?_?_____?☼__?☼__(___?_?______________________
___

Z:\>

So the problem is not located within the jscript part, i think (else it should always produce the same output).
I assume jscript has problems to read from the pipe (used for the redirected input stream).
If this is true, there should be a limit up to where the reading was ok (somewhere around 1 KB).

And indeed the result was ok after i've shrinked it to a filesize of 260 (0x104) (why not 1KB or 512 bytes: don't know).
I replaced all file content to NUL characters and later to random characters (containing at least one NUL char) no problems.
Errors always occure on files with sizes >= 261 and one or more NUL character(s) within this block.

So i assume it is a kind of pipe "corruption" (or better: the missing ability of jscript to read from it correctly) if using more than 260 bytes.

penpen

Edit: I've tested the above on pipes, too ("type file.j8i | repl \x00 _"); same results. So i've tricked myself with terms and conclusions a little bit.

#24 Post by **dbenham** » 24 Jul 2014 14:59

penpen wrote:I just noticed, that the result is different, when repeating it.
...
So the problem is not located within the jscript part, i think (else it should always produce the same output).
I assume jscript has problems to read from the pipe (used for the redirected input stream).
If this is true, there should be a limit up to where the reading was ok (somewhere around 1 KB).

And indeed the result was ok after i've shrinked it to a filesize of 260 (0x104) (why not 1KB or 512 bytes: don't know).
I replaced all file content to NUL characters and later to random characters (containing at least one NUL char) no problems.
Errors always occure on files with sizes >= 261 and one or more NUL character(s) within this block.

So i assume it is a kind of pipe "corruption" (or better: the missing ability of jscript to read from it correctly) if using more than 260 bytes.

I interpret the evidence very differently. First off, there is no pipe - JScript is reading redirected stdin - a very basic OS function. Second, the issue cannot be strictly size based, as I have successfully used REPL.BAT on many files that had hundreds of megabytes. Finally, the fact that it does not give consistent results makes me think that there is a bug in JScript itself - again, I think there is a loose pointer within the JScript engine (or within some library that it uses).

Dave Benham

carlos · #25 Post by **carlos** » 24 Jul 2014 22:29

In cmd when a program print the character 0xA cmd convert it to 0xD 0xA. Maybe some interpretation ocurrs.

einstein1969 · #26 Post by **einstein1969** » 25 Jul 2014 01:23

My experiment:

I have readapted a script and seem work... not bug.

simple_repl.vbs

Code: Select all

set re=new regexp
re.global=true
re.pattern="[\x00]"
set stream2=createobject("adodb.stream")
set stream=createobject("adodb.stream")
stream.open
stream.type=1
stream.loadfromfile wscript.arguments.item(0)
for p=0 to stream.size-1 step 16
  buf=stream.read(16)
  txt=mid(hex(&H1000000+p),2)
  for k=0 to 15
    if p+k<stream.size then
      txt=txt & " " & mid(hex(&h100+ascb(midb(buf,k+1,1))),2)
    else
      txt=txt & "   "
    end if
  next
  stream2.open
  stream2.type=1
  stream2.write buf
  stream2.position=0
  stream2.type=2
  stream2.charset="iso-8859-1"
  wscript.echo txt,re.replace(stream2.readtext(-1),"|")
  stream2.close
next

output

Code: Select all

E:\x264\provini\tmp>cscript ..\simple_repl.vbs file.j8i
Microsoft (R) Windows Script Host Versione 5.8
Copyright (C) Microsoft Corporation 1996-2001. Tutti i diritti riservati.

000000 4A 53 2D 38 00 00 04 14 66 6D 74 20 00 00 00 04 JS-8||♦¶fmt |||♦
000010 00 00 00 01 4A 38 49 20 00 00 04 00 00 00 00 01 |||☺J8I ||♦||||☺
000020 50 6F 70 3A 43 6C 61 73 73 69 63 20 50 6F 70 00 Pop:Classic Pop|
000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0000A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0000B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0000C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0000D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0000E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0000F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000120 80 7D 00 00 00 48 00 0C 00 00 00 00 00 00 00 00 ?}|||H|♀||||||||
000130 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00 00 |||||||||||||♦||
000140 00 00 00 F0 00 00 F4 00 08 03 C0 00 88 00 01 04 |||ð||ô♥À|^|☺♦
000150 00 00 02 20 00 89 CF 12 04 CE E9 C2 27 3C 8A 28 ||☻ |%Ï↕♦ÎéÂ'<S(
000160 92 04 28 8A 02 28 A2 F2 28 82 04 E4 4A 03 C8 A2 '♦(S☻(¢ò('♦äJ♥È¢
000170 82 28 91 05 22 2A 02 08 A2 81 CF 10 F4 EE E9 C2 '('♣"¢?Ï►ôîéÂ
000180 07 3C 00 08 00 00 00 00 00 00 20 00 08 00 00 00 <|||||| |||
000190 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 ||| ||||||||||||
0001A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0001B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0001C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0001D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0001E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0001F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000220 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000230 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000240 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000250 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000270 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000280 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000290 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0002A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0002B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0002C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0002D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0002E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0002F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000300 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000330 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000340 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000350 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000360 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000370 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000380 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000390 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0003A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0003B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0003C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0003D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0003E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
0003F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000400 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ||||||||||||||||
000410 00 00 00 00 00 00 00 00 00 00 00 00             ||||||||||||

einstein1969

#27 Post by **penpen** » 27 Jul 2014 10:53

I've done some more testings on this (and edited my above post).

dbenham wrote:First off, there is no pipe - JScript is reading redirected stdin - a very basic OS function

Right: I was wrong naming "redirected input" "piped input";
i've added an edit (note) in my above post why i've written about pipes.
My (too hasty; sry i'd not much time that day) conclusions are wrong, too:
I must apologise for that.

dbenham wrote:Second, the issue cannot be strictly size based, as I have successfully used REPL.BAT on many files that had hundreds of megabytes.

It is not strictly sized based: I've only treated such files containing at minimum one NUL character.
But if there is a NUL character then this is a size based issue.
But it not realates to one fixed size; the size of 261 characters (/file bytes) is only the minimal size, where the issue occurs.

All my tests have the same result (Win xp home version):
The data is divided in parts P_1 ... P_n (|P_1|=260, |P_i|=256; i in setN_>1).
All data in P_n is as it should be, and
all data in parts P_1, ... P_n-1 is only ok up to the first NUL byte.
The content of my test files is (in regular expression): 'x'* NUL 'x'* '@' (("123456789" NUL)^24) "12345".

I now think the following is happening.
JScript reads the input from the stdIn text stream to an internal buffer (B1).
If the input exceeds an given size (B.crit in {260+256*i| i in setN_0} ), then a new buffer (B2) is created to hold the data.
Then the (old) data (string) is copied from B1 to B2 assuming it holds a NUL terminated string, so the data between the first NUL character and the first character in P_n gets corrupted.

penpen

Edit: It seems, that instead of "str1 += WScript.StdIn.ReadLine();" ( or ...ReadAll...), you could use:

Code: Select all

   while (!WScript.StdIn.AtEndOfLine) {
      str1 += WScript.StdIn.Read(1);
   }

BUT this is very slow (especially for big files).

#28 Post by **dbenham** » 27 Jul 2014 23:14

Fantastic penpen :!:

I optimized your work-around by using Read(260) instead of Read(1). I've updated REPL.BAT to accept the N option to enable proper reading of NULL bytes using the work-around. It only works when also using the M option.

I haven't tested, but I imagine it is still slow with large files, but at least slow is better than broken.

Dave Benham

#29 Post by **dbenham** » 30 Jul 2014 18:26

It turns out it is only ReadAll() that suffers the bug. Read(n) works properly with any size n (within limits of string size).

I removed the N option from my new REPL.BAT version 4.1. The M option now always works properly with binary files. The initial binary read is 1024 bytes, and the size doubles each time there is more content to read. This is a major speed boost.

I successfully processed a 100 MB file containing NULL bytes in 8 seconds.

Note: I also tested ReadAll() using VBS instead of JScript, and it also failed with NULL bytes. So the bug is in the core of the scripting host; it is not specific to JScript.

Dave Benham

DosTips.com

findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS

Re: findstr.bat and repl.bat and NULLS