[SOLVED] [regex] Multiline seeking with (s)sed?

Message

Shohreh · #1 Post by **Shohreh** » 05 Mar 2020 10:28

Hello,

Using ssed ("Super Sed" 3.62 based on GNU sed 4.1), I need to remove <desc>…</desc> blocks for GPX files that can spread over multiple lines… but nothing happens.

Does someone know how to get it work?

Thank you.

Code: Select all

@echo off

REM Called with : mybatch.bat *.gpx

REM BAD for %%f in ("%1") DO echo Handling "%%f" & ssed.exe -R "s@(?s)<desc>.+?</desc>@@g" < "%%f" > "%%f.DESC.gpx"
REM BAD for %%f in ("%1") DO echo Handling "%%f" & ssed.exe -r "s@(?s)<desc>.+?</desc>@@g" < "%%f" > "%%f.DESC.gpx"

REM BAD for %%f in ("%1") DO echo Handling "%%f" & ssed.exe -R "s@<desc>.+?</desc>@@gs" < "%%f" > "%%f.DESC.gpx"
REM BAD for %%f in ("%1") DO echo Handling "%%f" & ssed.exe -r "s@<desc>.+?</desc>@@gs" < "%%f" > "%%f.DESC.gpx"

siberia-man · #2 Post by **siberia-man** » 05 Mar 2020 10:45

Assuming that the both <desc> and </desc> live alone in different lines (not in one line together), you can use the following command:

Code: Select all

sed "/<desc>/,/<\/desc>/d"

If both <desc> and </desc> could be on the same line or on the same line with other tags you need something much more complicated. I hope this is not your case so the example above can cover the declared restrictions.

Shohreh · #3 Post by **Shohreh** » 06 Mar 2020 06:48

Thanks for the help.

The line aboves seems to remove all the lines that contain <desc> or </desc>, even if the tag is in the middle of a line, instead of just what's living between the two tags spread over multiple lines.

More googling seems to show that sed simply isn't the right tool for this unless you're an expert at sed… which I am not.

As a work-around, I'll just install Perl and run a one-liner.

: 833FFA6D-E949-4241-89BA-4F893BA3D716.png (28.51 KiB) Viewed 10166 times

siberia-man · #4 Post by **siberia-man** » 06 Mar 2020 07:28

As I said early

If both <desc> and </desc> could be on the same line or on the same line with other tags you need something much more complicated

There is workaround:

Code: Select all

sed "place each <desc> and </desc> completely to the separate lines" | sed "/<desc>/,/<\/desc>/d"

You need only to invent this algorithm: place each <desc> and </desc> completely to the separate lines. It's not too complicated. Most probably you'll need to modify the second command.

siberia-man · #5 Post by **siberia-man** » 06 Mar 2020 08:25

It could something like this:

Code: Select all

sed "s/></>\n</g" | sed "/<desc>.*<\/desc>/d; /<desc>/,/<\/desc>/d"

Some explanation:
1. the first SED walks over all >< and inserts new line character \n between angle brackets
2. the second SED does two actions:
2.1 remove all lines having <desc> something </desc>
2.2 remove all lines between <desc> and </desc> including these lines

#6 Post by **dbenham** » 06 Mar 2020 08:29

Shohreh wrote: ↑
06 Mar 2020 06:48
As a work-around, I'll just install Perl and run a one-liner.

Another option is to use my JREPL.BAT regular expression file processing utility. It is pure script (hybrid JScript/batch) that runs natively on any any Windows version from XP onward, without the need of any 3rd party exe or dll file.

Code: Select all

jrepl "<desc>[\s\S]*?</desc>" "" /m /f "input.xml|utf-8" /o -

The above relies on the /M option, which requires that the entire file be loaded into memory. This limits the size of the file that can be processed (I think the max size is some value that approaches 1 GB, but I'm not sure).

The output will include the UTF-8 BOM in the final output. If you don't want it, then use

Code: Select all

jrepl "<desc>[\s\S]*?</desc>" "" /m /f "input.xml|utf-8|NB" /o -

If the command is included in a batch script, then you must use CALL JREPL, because JREPL is itself a batch script.

Since the find/replace operation does not need to interpret any multi-byte unicode characters, the utf-8 specification can probably be dropped as follows. This would probably improve performance, and might increase the maximum file size limit.

Code: Select all

jrepl "<desc>[\s\S]*?</desc>" "" /m /f "input.xml" /o -

As long as your machines default character set is a single byte character set, then each byte of a multi-byte unicode character would be treated as its own character that would either be preserved if outside a <desc></desc> block, or dropped if within one. It will not work if your default character set uses a variable number of bytes per character.

This last version would neither remove any pre-existing BOM, nor would it add one.

Dave Benham

#7 Post by **dbenham** » 06 Mar 2020 09:07

Note - XML CDATA and/or comments containing <desc> or </desc> are likely to break any regular expression based solution. Regular expressions should generally not be used to parse or manipulate XML unless you are confident in the physical layout of the XML file.

Shohreh · #8 Post by **Shohreh** » 06 Mar 2020 10:47

Thanks both for the idea of first rearranging the input data to make parsing easier!

#9 Post by **dbenham** » 06 Mar 2020 11:07

A JREPL solution without the /M option (no size limit as long as no line approaches 1 GB) can be done in 3 steps (assuming all <desc> and </desc> are paired properly):

Code: Select all

:: Remove all <desc>...</desc> blocks within one line
call jrepl "<desc>.*?</desc>" "" /f "input.xml|utf-8" /o -
:: Remove all lines between line containing <desc> and line containing </desc>
call jrepl "^" "" /k 0 /exc "'<desc>'+1:'</desc>'-1" /f "input.xml|utf-8" /o -
:: Remove all text on line from beginning through </desc>, and from <desc> through end
call jrepl ".*</desc>|<desc>.*" "|" /t "|" /f "input.xml|utf-8" /o -

Dave Benham

DosTips.com

[SOLVED] [regex] Multiline seeking with (s)sed?

[SOLVED] [regex] Multiline seeking with (s)sed?

Re: [regex] Multiline seeking with (s)sed?

Re: [regex] Multiline seeking with (s)sed?

Re: [regex] Multiline seeking with (s)sed?

Re: [regex] Multiline seeking with (s)sed?

Re: [regex] Multiline seeking with (s)sed?

Re: [regex] Multiline seeking with (s)sed?

Re: [regex] Multiline seeking with (s)sed?

Re: [SOLVED] [regex] Multiline seeking with (s)sed?