findstr with regex to find characters with MSB bit set

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
miskox
Posts: 630
Joined: 28 Jun 2010 03:46

findstr with regex to find characters with MSB bit set

#1 Post by miskox » 13 Nov 2022 11:52

Hello!

I am strugling with regex. Let's say I have a test.txt (one byte characters) file (50 bytes, NOTEPAD shows ANSI bottom right):

Code: Select all

testline1
somethingelse
Šmoretext
anotherline
(Š above has a value of 0x8A - it doesn't matter which character you insert if bit7 is set)
I want to find lines with characters that have MSB bit set (bit 7).

No luck:

Code: Select all

c:\findstr /R "[^\x00-\x7F]" test.txt
returns no lines. I need this line:

Code: Select all

Šmoretext

Code: Select all

C:\>findstr /R "[\x8a]" test.txt
returns:

Code: Select all

Őmoretext
anotherline
instead of

Code: Select all

Őmoretext
This would be a start for my task: ideally I would like to find all lines that have a character that is different from:

valid characters:
LF: 0x0A
CR: 0x0D
Č: 0xAC
Š: 0xE6
Ž: 0xA6
č: 0x9F
š: 0xE7
ž: 0xA7
and 0x20-0x7f

Any ideas?
Thanks.
Saso

(I really don't have much experiences with regex)

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: findstr with regex to find characters with MSB bit set

#2 Post by aGerman » 13 Nov 2022 16:00

NOTEPAD shows ANSI
My guess is that the default ANSI code page is Windows-1250 in your case.
However, ...
Č: 0xAC
Š: 0xE6
Ž: 0xA6
č: 0x9F
š: 0xE7
ž: 0xA7
--- are character codes for CP 852.
I really doubt that your test.txt file is CP 852 encoded though. Could you shed some light on that?

Besides of that, don't expect that FINDSTR supports RegEx in a usual way. It doesn't know anything about hex expressions like \xHH.

Steffen

miskox
Posts: 630
Joined: 28 Jun 2010 03:46

Re: findstr with regex to find characters with MSB bit set

#3 Post by miskox » 14 Nov 2022 00:44

@Steffen: you are correct about the CP1250 and CP852 - but it is not important for the test because bit 7 is set.

Looks like this is causing the problems:
Besides of that, don't expect that FINDSTR supports RegEx in a usual way. It doesn't know anything about hex expressions like \xHH.
Is there a solution for that?

Thanks.
Saso

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: findstr with regex to find characters with MSB bit set

#4 Post by aGerman » 14 Nov 2022 11:19

miskox wrote:
14 Nov 2022 00:44
but it is not important for the test because bit 7 is set.
Oh, I've been under the impression that you wanted to exclude those characters since you have them in the list of "valid characters". Hmm...
Is there a solution for that?
Yes and no.
Yes - You may use 3rd party utilities like SED or GRAP. You could use an existing JScript hybrid like JREPL. And you could write your own code in another language like VBScript, JScript, or PowerShell.
No - I don't see any possibility to use FINDSTR for this task.

Steffen

miskox
Posts: 630
Joined: 28 Jun 2010 03:46

Re: findstr with regex to find characters with MSB bit set

#5 Post by miskox » 14 Nov 2022 12:12

Thanks Steffen!

Currently I am dumping my .txt file with your fildump - and I will proceed from there (it is ~33.000 lines, 4+MB in size). After that I will see what to do. The fact is that this might be a one time test only (or maybe twice a year) so I don't really need a quick solution. Next option is: I can write an .exe (now I use PureBASIC because my old DOS COBOL is not good anymore).

Thanks again.
Saso

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: findstr with regex to find characters with MSB bit set

#6 Post by aGerman » 14 Nov 2022 13:46

It's not too complicated.

Code: Select all

cmd /c jrepl.bat "^[\x09\x20-\x7F]*$" "" /N 1 /R 0 /XSEQ /F "test.txt"
You can add more valid characters right after \x7F. Use the char codes of Windows-1250 in this case. (FWIW I started with \x09 because I guess you don't want tab characters reported.)

Steffen

miskox
Posts: 630
Joined: 28 Jun 2010 03:46

Re: findstr with regex to find characters with MSB bit set

#7 Post by miskox » 15 Nov 2022 04:39

Thanks. Works like a charm! Found some invalid characters. Fixed. I added some more of course (for example ü is a valid character because we have some people with Müller as the last name for example).

Saso

Post Reply