Page 1 of 1

findstr with regex to find characters with MSB bit set

Posted: 13 Nov 2022 11:52
by miskox
Hello!

I am strugling with regex. Let's say I have a test.txt (one byte characters) file (50 bytes, NOTEPAD shows ANSI bottom right):

Code: Select all

testline1
somethingelse
Šmoretext
anotherline
(Š above has a value of 0x8A - it doesn't matter which character you insert if bit7 is set)
I want to find lines with characters that have MSB bit set (bit 7).

No luck:

Code: Select all

c:\findstr /R "[^\x00-\x7F]" test.txt
returns no lines. I need this line:

Code: Select all

Šmoretext

Code: Select all

C:\>findstr /R "[\x8a]" test.txt
returns:

Code: Select all

Őmoretext
anotherline
instead of

Code: Select all

Őmoretext
This would be a start for my task: ideally I would like to find all lines that have a character that is different from:

valid characters:
LF: 0x0A
CR: 0x0D
Č: 0xAC
Š: 0xE6
Ž: 0xA6
č: 0x9F
š: 0xE7
ž: 0xA7
and 0x20-0x7f

Any ideas?
Thanks.
Saso

(I really don't have much experiences with regex)

Re: findstr with regex to find characters with MSB bit set

Posted: 13 Nov 2022 16:00
by aGerman
NOTEPAD shows ANSI
My guess is that the default ANSI code page is Windows-1250 in your case.
However, ...
Č: 0xAC
Š: 0xE6
Ž: 0xA6
č: 0x9F
š: 0xE7
ž: 0xA7
--- are character codes for CP 852.
I really doubt that your test.txt file is CP 852 encoded though. Could you shed some light on that?

Besides of that, don't expect that FINDSTR supports RegEx in a usual way. It doesn't know anything about hex expressions like \xHH.

Steffen

Re: findstr with regex to find characters with MSB bit set

Posted: 14 Nov 2022 00:44
by miskox
@Steffen: you are correct about the CP1250 and CP852 - but it is not important for the test because bit 7 is set.

Looks like this is causing the problems:
Besides of that, don't expect that FINDSTR supports RegEx in a usual way. It doesn't know anything about hex expressions like \xHH.
Is there a solution for that?

Thanks.
Saso

Re: findstr with regex to find characters with MSB bit set

Posted: 14 Nov 2022 11:19
by aGerman
miskox wrote:
14 Nov 2022 00:44
but it is not important for the test because bit 7 is set.
Oh, I've been under the impression that you wanted to exclude those characters since you have them in the list of "valid characters". Hmm...
Is there a solution for that?
Yes and no.
Yes - You may use 3rd party utilities like SED or GRAP. You could use an existing JScript hybrid like JREPL. And you could write your own code in another language like VBScript, JScript, or PowerShell.
No - I don't see any possibility to use FINDSTR for this task.

Steffen

Re: findstr with regex to find characters with MSB bit set

Posted: 14 Nov 2022 12:12
by miskox
Thanks Steffen!

Currently I am dumping my .txt file with your fildump - and I will proceed from there (it is ~33.000 lines, 4+MB in size). After that I will see what to do. The fact is that this might be a one time test only (or maybe twice a year) so I don't really need a quick solution. Next option is: I can write an .exe (now I use PureBASIC because my old DOS COBOL is not good anymore).

Thanks again.
Saso

Re: findstr with regex to find characters with MSB bit set

Posted: 14 Nov 2022 13:46
by aGerman
It's not too complicated.

Code: Select all

cmd /c jrepl.bat "^[\x09\x20-\x7F]*$" "" /N 1 /R 0 /XSEQ /F "test.txt"
You can add more valid characters right after \x7F. Use the char codes of Windows-1250 in this case. (FWIW I started with \x09 because I guess you don't want tab characters reported.)

Steffen

Re: findstr with regex to find characters with MSB bit set

Posted: 15 Nov 2022 04:39
by miskox
Thanks. Works like a charm! Found some invalid characters. Fixed. I added some more of course (for example ü is a valid character because we have some people with Müller as the last name for example).

Saso