Batchscript to extract texts from multiple lines

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Batchscript to extract texts from multiple lines

#1 Post by plasma33 » 26 Jul 2017 21:34

Hi all,

I have a text file which contains the following texts:

Code: Select all

RGRGRKRGRHRGRGRGIGRMKHIGRMRRIGKMKMIGRHRLIGRIRNIGRLRGIGRKRGIGRGRGIGRQRHIGKLKHIGRGRIIGRGRGIGRGRGIGRGRGIGRRRRIGKKKKGDGARGRGRKRGRHRGRHRGIGRMKHIGRGRGIGKMKMIGRHRLIGRIRMIGRLRGIGRKRGIGRGRGIGRGRRIGKMKLIGRGRRIGRNRIIGR------KKHIGRGRGIGKMKMIGRHRLIGRLKLIGRLRGIGRK
|||||||||||||||||||||||||.|.||.||.|||...|||...|||...||....|||.|.|||.|.||....|||.|.|||||||||...|||...|||...||....|||||||||||||||||||||||||||||...|||||||||||||||.|.|||.|||||||.|||||||||...||....|||...|||.|.|||      |..||||||||.||.|||||.|||.|.|||.|.|||.
RGRGRKRGRHRGRGRGIGRMKHIGRGRKIGRMKHIGRLKHIGRMKHIGRHKLIGKMKMIGRHRLIGRGRGIGRQRGIGRKRNIGRGRGIGRMKHIGRNKMIGRMKHIGRRRQGDGARGRGRKRGRHRGRHRGIGRMKHIGRRKMIGKMKMIGRHRLIGRGRKIGRQRGIGRKRNIGRGRGIGRMKHIGRHRRIGRMKHIGRGRQIGRQRGIGRKRNIGRGRGIGRMKHIGRHRPIGRMKHIGRNRRIGRM


I would like to extract the aligned texts from the above in the format given below:

Code: Select all

RGRGRKRGRHRGRGRGIGRMKHIGR
|||||||||||||||||||||||||
RGRGRKRGRHRGRGRGIGRMKHIGR

R
|
R

IG
||
IG

MK
||
MK

IGR
|||
IGR

IGR
|||
IGR

IGR
|||
IGR

IG
||
IG

IGR
|||
IGR

R
|
R

IGR
|||
IGR

R
|
R

IG
||
IG

IGR
|||
IGR

R
|
R

IGRGRGIGR
|||||||||
IGRGRGIGR

IGR
|||
IGR

IGR
|||
IGR

IG
||
IG

GDGARGRGRKRGRHRGRHRGIGRMKHIGR
|||||||||||||||||||||||||||||
GDGARGRGRKRGRHRGRHRGIGRMKHIGR

IGKMKMIGRHRLIGR
|||||||||||||||
IGKMKMIGRHRLIGR

R
|
R

IGR
|||
IGR

RGIGRKR
|||||||
RGIGRKR

IGRGRGIGR
|||||||||
IGRGRGIGR

IG
||
IG

IGR
|||
IGR

IG
||
IG

R
|
R

R
|
R

IGR------K
|||      |
IGRQRGIGRK

IGRGRGIG
||||||||
IGRGRGIG

MK
||
MK

IGRHR
|||||
IGRHR

IGR
|||
IGR

K
|
K

IGR
|||
IGR

R
|
R

IGR
|||
IGR



I will have two separate text files. One will the input file (which will contain the unstructured aligned texts) and other will be the output file (extracted pure aligned texts).

Thanks in advance guys.

Plasma33

penpen
Expert
Posts: 2009
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Batchscript to extract texts from multiple lines

#2 Post by penpen » 26 Jul 2017 23:29

Is the line length different for multiple files?
If not, how long is the longest possible line?
How do you define/split aligned text? (Only at the dot character ('.')?)

penpen

elzooilogico
Posts: 128
Joined: 23 May 2016 15:39
Location: Spain

Re: Batchscript to extract texts from multiple lines

#3 Post by elzooilogico » 27 Jul 2017 02:51

This should work if yourfile.txt has three lines

Code: Select all

@echo off 
SetLocal DisableDelayedExpansion
set ^"LF=^

^" don't remove previous line     & rem line feed
set ^"\n=^^^%LF%%LF%^%LF%%LF%^^"  & rem newline with line continuation

:: get string length (macro definition)
:: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
:: http://www.dostips.com/forum/viewtopic.php?f=3&t=2518

set STRLEN=for %%{ in (1 2) do if %%{==2 (%\n%
  for /F "tokens=1,2 delims=, " %%1 in ("!argv!") do (%\n%
    set "S=A!%%~2!"^&set "L=0"%\n%
    for /L %%A in (12,-1,0) do (set/a "L|=1<<%%A"^&for %%B in (!L!) do if "!S:~%%B,1!"=="" set/a "L&=~1<<%%A")%\n%
    for /F "delims=" %%} in ("!L!") do EndLocal^& set "%%1=%%~}"%\n%
  )%\n%
) else SetLocal EnableDelayedExpansion ^& set argv=,


set "inputFile=yourfile.txt"
set "outputFile=result.txt"

SetLocal EnableExtensions EnableDelayedExpansion

rem read three lines from file
<"%inputFile%" (
  set /p line1=
  set /p line2=
  set /p line3=
)

rem split string into substrings based on delimiter
rem http://www.dostips.com/forum/viewtopic.php?f=3&t=6429#p41035
set/a i=1
set "x!i!=%line2:.=" & set /A i+=1 & set "x!i!=%"

set/a next=0

rem parse splitted string. any string length of 0 means concurrent dots
>"%outputFile%" (
  for /L %%i in (1,1,!i!) do (

    set/a start%%i=!next!
   
    %STRLEN% len%%i,x%%i

    rem if len not 0 then
    if !len%%i! NEQ 0 (
      set/A next+=len%%i
      for %%a in (!start%%i!) do (
        for %%b in (!len%%i!) do (
          echo !line1:~%%a,%%b!
          echo !line2:~%%a,%%b!
          echo !line3:~%%a,%%b!
          echo/
        )
      )
    )
    rem advance start pointer
    set/A next+=1
  )
)
EndLocal
EndLocal
exit/B

Aacini
Expert
Posts: 1913
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Batchscript to extract texts from multiple lines

#4 Post by Aacini » 27 Jul 2017 07:11

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Read the lines
set "i=0"
for /F "delims=" %%a in (input.txt) do (
   set /A i+=1
   set "line[!i!]=%%a"
)

rem Extract the lengths from second line
set "lens="
set "len=0"
set "lastChar=%line[2]:~0,1%"
for /L %%i in (0,1,8191) do if defined lastChar (
   set "newChar=!line[2]:~%%i,1!"
   if "!newChar!" equ "!lastChar!" (
      set /A len+=1
   ) else (
      set "lens=!lens! !len!"
      set "len=1"
      set "lastChar=!newChar!"
   )
)

rem Generate output over first three lines
set /A i=0, show=0
if "%line[2]:~0,1%" equ "|" set "show=1"
(for %%n in (%lens%) do (
   if !show! equ 1 (
      for %%i in (!i!) do for /L %%j in (1,1,3) do (
         echo !line[%%j]:~%%i,%%n!
      )
      echo/
   )
   set /A "i+=%%n, show=^!show"
)) > output.txt

EDIT: I simplified previous code a little:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Read the lines
set "i=0"
for /F "delims=" %%a in (input.txt) do (
   set /A i+=1
   set "line[!i!]=%%a"
)

rem Extract lengths from second line
set /A i=0, len=0, show=0
set "lastChar=%line[2]:~0,1%"
if "%lastChar%" equ "|" set "show=1"
(for /L %%i in (0,1,8191) do if defined lastChar (
   set "newChar=!line[2]:~%%i,1!"
   if "!newChar!" equ "!lastChar!" (
      set /A len+=1
   ) else (
      rem Generate output over first three lines
      if !show! equ 1 (
         for /F "tokens=1,2" %%i in ("!i! !len!") do for /L %%k in (1,1,3) do (
            echo !line[%%k]:~%%i,%%j!
         )
         echo/
      )
      set /A "i+=len, len=1, show=^!show"
      set "lastChar=!newChar!"
   )
)) > output.txt

Antonio

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Batchscript to extract texts from multiple lines

#5 Post by dbenham » 27 Jul 2017 12:04

@Aacini - Your last code fails if the last character(s) in the line are supposed to be printed. Easily fixed if you check show after the loop and print out the last string if true.

Here is my working variant of your general technique:

Code: Select all

@echo off
setlocal enableDelayedExpansion
set "input=test.txt"
set "output=output.txt"

for /f "delims=: tokens=1*" %%A in ('findstr /n "^" "%input%"') do set "ln%%A=%%B"

set "beg="
>"%output%" (
  for /l %%N in (0 1 8191) do (
    if "!ln2:~%%N,1!" equ "|" (
      if not defined beg set /a beg=%%N
    ) else if defined beg (
      set /a len=%%N-beg
      for %%A in ("!beg!,!len!") do for /l %%N in (1 1 3) do echo !ln%%N:~%%~A!
      echo(
      set "beg="
    )
  )
  if defined beg for /l %%N in (1 1 3) do echo !ln%%N:~%beg%!
)

And here is a version that relies on JREPL.BAT

Code: Select all

@echo off
setlocal enableDelayedExpansion
set "input=test.txt"
set "output=output.txt"

for /f "delims=: tokens=1*" %%A in ('findstr /n "^" "%input%"') do set "ln%%A=%%B"

set /a pos=0
>"%output%" (
  for /f "tokens=1,2" %%A in (
    'jrepl "(.)\1*" "$txt=$0.length+' '+$1" /inc 2 /jmatchq /f "%input%"'
  ) do (
    if "%%B" == "|" (
      for %%P in (!pos!) do for /l %%N in (1 1 3) do echo !ln%%N:~%%P,%%A!
      echo(
    )
    set /a pos+=%%A
  )
 )


Dave Benham

elzooilogico
Posts: 128
Joined: 23 May 2016 15:39
Location: Spain

Re: Batchscript to extract texts from multiple lines

#6 Post by elzooilogico » 28 Jul 2017 04:55

@Antonio and @Dave, Wow your code is brilliant! But, please let me introduce a point. The sequence

Code: Select all

IGR------K
|||      |
IGRQRGIGRK

in your scripts, is parsed as (also using jrepl.bat)

Code: Select all

IGR
|||
IGR

K
|
K



May I missed something?

Aacini
Expert
Posts: 1913
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Batchscript to extract texts from multiple lines

#7 Post by Aacini » 28 Jul 2017 15:43

dbenham wrote:@Aacini - Your last code fails if the last character(s) in the line are supposed to be printed.

Err... No, it didn't. Did you tested it?

The condition to show characters is the ELSE part of this IF:

Code: Select all

   set "newChar=!line[2]:~%%i,1!"
   if "!newChar!" equ "!lastChar!" (

The ELSE part is normally executed when there is a character change between "|" and ".", so "newChar" is different than "lastChar", but when the line ends "newChar" is empty, so it is also different than "lastChar=|" and the last character(s) are also printed...


elzooilogico wrote:@Antonio and @Dave, Wow your code is brilliant! But, please let me introduce a point. The sequence

Code: Select all

IGR------K
|||      |
IGRQRGIGRK

in your scripts, is parsed as (also using jrepl.bat)

Code: Select all

IGR
|||
IGR

K
|
K



May I missed something?


Well, my code don't check for specific "|", "." or space characters, just for a change between them, so spaces placed between pipes are processed the same as if they were dots; the output example requires to process spaces as if they were pipes. This detail is fixed in the new code below.


I wrote a new solution for this problem that uses an entirely different approach. A quick test proves that this new method run in less than 12 % the time of my first version, that is, it is eight times faster approximately...

Code: Select all

@echo off
setlocal EnableDelayedExpansion

rem Put here the maximum number of consecutive dots:
set "nDots=4"
set "dots=" & for /L %%i in (1,1,%nDots%) do set "dots=!dots!."

rem Read the lines
for /F "tokens=1* delims=:" %%a in ('findstr /N "^" input.txt') do set "ln%%a=%%b"

rem Extract lengths from second line
set "l2=%ln2: =|%."
for /L %%i in (%nDots%,-1,1) do for /F %%d in ("!dots:~0,%%i!") do set "l2=!l2:%%d="_"@"_/A"i+=%%i+j,j=!"
set "l2=%l2:@=n=^!n^! "^^^!i^^^!,^^^!j^^^!"%" & set "l2=!l2:|=+1!"
set "n=" & set "i=0" & 2> nul set /A "j=%l2:_= & set %0" & set "n=!n:"0,"=!"

rem Generate output over first three lines
(for %%n in (%n%) do (
   for /L %%j in (1,1,3) do echo !ln%%j:~%%~n!
   echo/
)) > output.txt

Antonio

plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Re: Batchscript to extract texts from multiple lines

#8 Post by plasma33 » 28 Jul 2017 20:48

Hello all,

First of all thank you all for providing me the code and sorry for my delayed response. I tried all your code and it doesn't seem to be giving me the desired output that I want. I am providing one of the input text files for your reference (download link below). The file is around 2.86mb. I have around 15mb worth of text files with similar format as demonstrated in my original post.

Download link (input.txt): https://www.mediafire.com/?23l3nni3kgtb7o7

Much appreciated for all your kind effort.

Plasma33

Aacini
Expert
Posts: 1913
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Batchscript to extract texts from multiple lines

#9 Post by Aacini » 28 Jul 2017 22:03

The maximum line length that can be processed in a Batch file is 8191 characters... :cry:

Antonio

plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Re: Batchscript to extract texts from multiple lines

#10 Post by plasma33 » 29 Jul 2017 00:18

Aacini wrote:The maximum line length that can be processed in a Batch file is 8191 characters... :cry:

Antonio


All good :) I will make do with this. I forgot about the maximum line length of 8191 characters. Thanks again.

Plasma33

plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Re: Batchscript to extract texts from multiple lines

#11 Post by plasma33 » 29 Jul 2017 00:53

Hello again,

Is there a way to process first set of 8191 characters at a time and then move onto next set of 8191 characters and so on and save all the results in a single output file? But also not losing the aligned texts at the same time. For example, the last line of the first set of 8191 characters might not have the complete aligned texts and might be continued in the next set of 8191 characters. It doesn't have to be 8191 characters every time unless it's within the batch maximum line length limit. Below is a good example of what I am trying to convey:

Code: Select all

Demonstration purposes only:
Not suitable:
Last lines of first set of 8191 characters:
..........IGRKRRIGKRKMGDGARGRGRGRIRIRN
..........||..|.||....||||||||||||||||
..........IGKMRQIGRGRPGDGARGRGRGRIRIRN
First lines of second set of 8191 characters:
RRRGIGRI------RMIGKMK..........
|||||||.|||||||.|||||..........
RRRGIGRMKLIGRHRLIGKMK..........

Suitable:
Last lines of first set of 8191 characters:
..........IGRKRRIGKRKMGDGARGRGRGRIRIRNRRRGIGR
..........||..|.||....|||||||||||||||||||||||
..........IGKMRQIGRGRPGDGARGRGRGRIRIRNRRRGIGR
First lines of second set of 8191 characters:
I------RMIGKMK..........
.|||||||.|||||..........
MKLIGRHRLIGKMK..........


Fingers crossed if we can make it work. The best text file to test your code can be downloaded from the link below:
https://www.mediafire.com/?23l3nni3kgtb7o7

Plasma33

elzooilogico
Posts: 128
Joined: 23 May 2016 15:39
Location: Spain

Re: Batchscript to extract texts from multiple lines

#12 Post by elzooilogico » 29 Jul 2017 05:27

This is a batch and c# hybrid script.

If the c# exe is not found, the batch creates it. Else, simply call the c# exe with desired input and output filenames.

Once the exe is created, you can make a simple call to it and remove all the useless stuff.

The c# code is not the best approach as it process input file char by char (as your file may be a large one).

In my laptop (i5 with a SDD disk) it swallows your 3 MB file in less than a second.

Hope it helps.

NOTE The .Net framework must be installed on your system.

NOTE As the file you provide has three lines, this processes only three lines of any length (without newline concatenation nor further processing). For more lines (or very big files), code must be reworked.

Code: Select all

//>nul 2>nul||@goto :batch_code
/*
:batch_code
@echo off
setlocal
rem place desired exe name
set "theExeFile=myParser.exe"

rem input output files. place full path plus filename i.e)
rem set "inputFile=C:\Users\Michael\Desktop\input.txt"
rem set "outputFile=C:\Users\Michael\Desktop\output.txt"
set "inputFile=.\input.txt"
set "outputFile=.\output2.txt"

if not exist "%theExeFile%" call :build_the_exe || exit/B

echo Processing, wait...
rem for /F "tokens=*" %%a in ('%theExeFile% "%inputFile%" "%outputFile%"') do echo %errorlevel% %%a
%theExeFile% "%inputFile%" "%outputFile%"
if %errorlevel% NEQ 0 (
  echo Error code %errorlevel%
) else ( 
  echo Done.
)
endlocal
exit /b 0



:build_the_exe
:: find csc.exe
set "frm=%SystemRoot%\Microsoft.NET\Framework\"
for /f "tokens=* delims=" %%v in ('dir /b /a:d  /o:-n "%SystemRoot%\Microsoft.NET\Framework\v*"') do (
   set netver=%%v
   goto :break_loop
)
:break_loop
set "csc=%frm%%netver%\csc.exe"
:: csc not found
if "%csc%" == "\csc.exe" echo/&echo/Warning: Net Framework Not Found&exit/B 1
::csc found
call %csc% /nologo /out:"%theExeFile%" "%~dpsfnx0"
exit/B 0
*/


//begin c# code
using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;

namespace ElZooilogico
{
  public class Parser
  {
    private static string[] readArray = new string[] {};

    private static void parse(string inputFile, string outputFile)
    {
      readArray = File.ReadLines(inputFile).Take(3).ToArray();

      using ( System.IO.StreamWriter output = new System.IO.StreamWriter(outputFile) )
      {
        char last = '\0';
        int start = 0, end = 0;

        foreach(char c in readArray[1])
        {
          if ( c == '.' )
          {
            if ( last == '.' )
              start++;
            else
            {
              for ( int j = 0; j < 3; j++ )
                output.WriteLine(readArray[j].Substring(start, end-start));

              output.WriteLine();
              start = end+1;
            }
          }
          last=c;
          end++;
        }
        if (end > start)
        {
          for ( int j = 0; j < 3; j++ )
            output.WriteLine(readArray[j].Substring(start, end-start));
          output.WriteLine();
        }
      } // output file is closed here

    }

    public static int Main(string[] args)
    {
      try {
        if ( args.Length != 2 )
          return 9;
        if ( !File.Exists(args[0]) )
          return 1;
        parse(args[0], args[1]);
      } catch { return 3; }
      return 0;
    }

  } // class Parser

} // namespace ElZooilogico

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Batchscript to extract texts from multiple lines

#13 Post by dbenham » 29 Jul 2017 08:31

Aacini wrote:
dbenham wrote:@Aacini - Your last code fails if the last character(s) in the line are supposed to be printed.

Err... No, it didn't. Did you tested it?
Obviously not, sorry :oops:
Half my brain knew that expanding past the end returned nothing, and you that you were looping through the max possible string length.
Yet the other half my brain treated your loop as if it terminated at the end of the string - Doh :roll:

@plasma33 - Here is a batch / JScript hybrid that works with long lines.

JScript is available on all Windows machines, whereas .Net may not be available

The solution could have been pure JScript. The hybrid batch portion only serves to make the script easier to call

ExtractRuns.bat

Code: Select all

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

:: Batch Code
@cscript.exe //E:JScript //nologo "%~f0"
@exit /b

:: Jscript Code */
var ln1 = WScript.StdIn.ReadLine(),
    ln2 = WScript.StdIn.ReadLine(),
    ln3 = WScript.StdIn.ReadLine();

for (var beg=false, i=0; i<ln2.length; i++) {
  if (ln2.charAt(i) !== '.') {
    if (beg === false) beg=i;
  } else {
    if (beg !== false) output(beg,i);
    beg = false;
  }
}
if (beg !== false) output(beg,ln2.length);

function output( beg, end ) {
  WScript.StdOut.WriteLine(
    ln1.substring(beg,end)+'\n'+
    ln2.substring(beg,end)+'\n'+
    ln3.substring(beg,end)+'\n'
  );
}


To use, simply use stdin and stdout redirection:

Code: Select all

ExtractRuns <input.txt >output.txt


If you have a program that generates the input, then you can use a pipe and avoid creation of the intermediate file:

Code: Select all

yourProgram | ExtractRuns >output.txt


Dave Benham

plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Re: Batchscript to extract texts from multiple lines

#14 Post by plasma33 » 29 Jul 2017 19:53

Hello gentlemen,

@elzooilogico and @Dave both of your code works like a charm and within a span of couple of seconds. Unbelievable :) Thanks for all your effort.

Plasma33

rojo
Posts: 26
Joined: 14 Jan 2015 13:51

Re: Batchscript to extract texts from multiple lines

#15 Post by rojo » 30 Jul 2017 00:22

I had another idea of using System.IO.StreamReader() to read the text file, to avoid loading the entire thing into active memory as the previous solutions all do. With plasma33's example text file, this method is slower than dbenham's JScript hybrid (around 33 seconds for my script, versus around 8 for his -- probably because seeks on disk are much slower than seeks in RAM). But if plasma33's data file is likely to grow to a size where loading the file's full contents into RAM is impractical, he might find this a useful alternative.

This is a hybrid Batch + PowerShell script that should be saved with a .bat extension.

Code: Select all

<# : begin Batch portion
@echo off & setlocal

set "infile=input.txt"
set "outfile=output.txt"

powershell -noprofile "iex (${%~f0} | out-string)"
echo Done.
exit /b

: end Batch / begin PowerShell hybrid code #>

$reader = new-object IO.StreamReader[] 3
0..2 | %{ $reader[$_] = new-object IO.StreamReader($env:infile) }
$writer = new-object IO.StreamWriter($env:outfile)
$reader[1,2,2].ReadLine() | out-null

write-host -n "Working... "

while (@(13,10) -notcontains $char) {
   [char[]]$chunk = @()
   $char = $reader[1].Read()

   while ($char -eq 46) {
      $reader[0,2].Read() | out-null
      $char = $reader[1].Read()
   }

   while (@(13,10,46,-1) -notcontains $char -and $chunk.length -lt 8191) {
      $chunk += ,$char
      $char = $reader[1].Read()
   }

   if ($chunk.Length) {

      $buffer = new-object char[] $chunk.Length
      [void]$reader[0].Read($buffer, 0, $buffer.length)
      $writer.WriteLine($buffer)
      $writer.WriteLIne($chunk)
      [void]$reader[2].Read($buffer, 0, $buffer.length)
      $writer.WriteLine($buffer)
      $writer.WriteLine()
      $reader[0,2].Read() | out-null
   }
}

$reader | %{ $_.Close() }
$writer.Close()


This script opens three streams for reading to the data file, one for each line. At first I played with using a single stream for reading and manipulating the file pointer using $reader.BaseStream.Position, but having a single stream seeking back and forth in the file was much slower than maintaining a stream for each line and moving forward only. This is about as efficient as I can make it.
Last edited by rojo on 30 Jul 2017 17:35, edited 3 times in total.

Post Reply