Alignment using batchscript

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Alignment using batchscript

#1 Post by plasma33 » 27 Aug 2017 01:15

Hi guys,

I am trying to find common substrings between two strings using one to one alignment. My input will be two strings coded in A-Z characters. A best example is demonstrated below:
Input:

Code: Select all

String1
GIGRGRGIGRGRGGDGARGRHRGRGRHRGRQRGIGKMKMIGKMKMIGKMKMIGKMKMIGRPRGIGRHRIIGRGRGIGRGRHIGKHKIIGRHRIIGRGRGIGRGRHIGRKKLIGRKRIIGRRRHIGRRRGGDGARGRHRGRGRHRGRPRGIGRGRGIGRGRGIGRGRGIGRGRGIGRIKLIGRQRRIGRNKKIGRRKIIGRGRHIGRGRGIGRGRGIGRGRGIGRKRPIGRLRRIGRKRPIGRKRNGDGARGRHRGRGRHRGK
String2
RIGRKRKIGRLRRGDGARGRHRGRGRHRGRQRGIGRPKRIGKLRKIGKMKKIGKHKIIGRLRMIGKKKIIGKLRHIGRKRLIGKIKMIGRHRGIGRHKHIGRNRGIGRIRHIGKMKIIGKKRHIGRGRMGDGARGRHRGRGRHRGRPRGIGRGRHIGRGRGIGRGRGIGRGRGIGRNRRIGRNRPIGRNRRIGRNRKIGRNRIIGRMRHIGRNRPIGRIKLIGRNRGIGRMRKIGRMRIIGRGRGGDGARGRHRGRGRHRGK


Output:

Code: Select all

Common Subtsring
IGR
R
IGR
R
GDGARGRHRGRGRHRGRQRGIG
K
IGK
IGKMK
IGK
K
IGR
R
IG
IIG
R
IGR
R
IGK
K
IGRHR
IGR
IGR
R
IGR
IG
IIG
RHIGR
R
GDGARGRHRGRGRHRGRPRGIGRGR
IGRGRGIGRGRGIGRGRGIGR
IGR
R
IGRN
IGR
IGR
R
IGR
R
IGR
R
IGR
IGR
R
IGR
R
IGR
R
IGR
R
GDGARGRHRGRGRHRGK


I am attaching the full text file for your reference. Please see the following link:
https://www.mediafire.com/file/cfdk5j2j5vlwjnq/input.txt

Thanks, guys.

Plasma33

DosItHelp
Expert
Posts: 239
Joined: 18 Feb 2006 19:54

Re: Alignment using batchscript

#2 Post by DosItHelp » 27 Aug 2017 02:18

plasma33,

Here is a longest-common-sequence implementation http://www.dostips.com/?t=Experimental.StringDiff as a starting point.
It could be improved and converted to a function.

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Alignment using batchscript

#3 Post by aGerman » 27 Aug 2017 05:49

The string limit in Batch is 8191 characters. You're far above in the text file you uploaded.

Steffen

plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Re: Alignment using batchscript

#4 Post by plasma33 » 27 Aug 2017 19:35

Hello all,

@DosItHelp, thank you for the link. I will give that a try.

@aGerman, is there anyway a hybrid (i.e. a combination of batch script or VBScript or JScript) could resolve the string limit issue? Or by splitting the string into several smaller strings?

@Aacini and @dbenham, I need your take on this, please.

Thanks, all.

Plasma33

Aacini
Expert
Posts: 1913
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Alignment using batchscript

#5 Post by Aacini » 27 Aug 2017 22:40

Test this code and report any problem:

Code: Select all

@echo off
setlocal EnableDelayedExpansion

if exist input1.txt if exist input2.txt goto SecondPart

rem First part: Split a file with 2 very long lines into 2 files with shorter lines
rem http://www.dostips.com/forum/viewtopic.php?f=3&t=4945
echo Processing big input file...
for /F %%a in ('copy /Z "%~F0" NUL') do set "CR=%%a"
del input1.txt input2.txt 2> nul
set "in=1"
call :SplitLines < input.txt
goto SecondPart


:SplitLines
echo/
echo Reading input line # %in%
set "lineNum=0"
:loopLine
   set /A lineNum+=2
   set /P "=Output line: %lineNum%!CR!" < NUL
   set /P "line="
   >> input%in%.txt echo %line:~0,512%
   >> input%in%.txt echo/%line:~512%
if "%line:~1022%" neq "" goto loopLine
echo/
set /A in+=1
if %in% leq 2 goto SplitLines
echo/
exit /B



:SecondPart
echo Processing input files...
set "last="
set "lineNum=0"
< input2.txt (
for /F "delims=" %%a in (input1.txt) do (
   set /A lineNum+=1
   set /P "=Input line: !lineNum!!CR!" < NUL > CON
   set "line1=%%a"
   set /P "line2="

   if defined last (
      set /P "=!last!" < NUL & set "last="
      if "!line1:~0,1!" neq "!line2:~0,1!" echo/
   )

   set "start="
   for /L %%i in (0,1,511) do (
      if "!line1:~%%i,1!" equ "!line2:~%%i,1!" (
         if not defined start (set /A "start=end=%%i") else set "end=%%i"
         if %%i equ 511 (
            for /F %%m in ("!start!") do set "last=!line1:~%%m!"
         )
      ) else if defined start (
         set /A "len=end-start+1"
         for /F "tokens=1,2" %%m in ("!start! !len!") do echo !line1:~%%m,%%n!
         set "start="
      )
   )
)
if defined last echo !last!
) > output.txt
echo/

Antonio

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Alignment using batchscript

#6 Post by aGerman » 28 Aug 2017 01:48

plasma33 wrote:@aGerman, is there anyway a hybrid

Yes why not.

Code: Select all

@if (@a)==(@b) @end /*

@echo off &setlocal
set "filename=test.txt"
cscript //nologo //e:jscript "%~fs0" "%filename%"
pause


exit /b&::*/
var objFile = WScript.CreateObject('Scripting.FileSystemObject').OpenTextFile(WScript.Arguments(0)),
    str1 = objFile.ReadLine(), str2 = objFile.ReadLine(),
    len = Math.min(str1.length, str2.length),
    bSame = false, strMatch = '', i, chr;

objFile.Close();
for (i = 0; i < len; ++i) {
  chr = str1.charAt(i);
  if (chr == str2.charAt(i)) {
    strMatch += chr;
    if (!bSame) {
      bSame = true;
    }
  } else {
    if (bSame) {
      WScript.Echo(strMatch);
      bSame = false;
      strMatch = '';
    }
  }
}

Steffen

elzooilogico
Posts: 128
Joined: 23 May 2016 15:39
Location: Spain

Re: Alignment using batchscript

#7 Post by elzooilogico » 28 Aug 2017 05:58

for the sake of speed...

Code: Select all

//>nul 2>nul||@goto :batch_code
/*
:batch_code
@echo off

rem place desired exe name
set "theExeFile=myParser.exe"
set "inputFile=.\input.txt"
set "outputFile=.\output.txt"

if not exist "%theExeFile%" call :build_the_exe || exit/B

echo Processing, wait...
%theExeFile% "%inputFile%" "%outputFile%"
if %errorlevel% NEQ 0 (
  echo Error code %errorlevel%
) else (
  echo Done.
)
endlocal
exit /b 0



:build_the_exe
for /f "tokens=* delims=" %%v in ('dir /b /s /a:-d /o:-n "%SystemRoot%\Microsoft.NET\Framework\csc.exe"') do (
   set "csc=%%v"
)
if "%csc%" == "" echo/&echo/Warning: Net Framework Not Found&exit/B 1
call "%csc%" /nologo /out:"%theExeFile%" "%~dpsfnx0"
exit/B 0
*/


//begin c# code
using System;
using System.IO;

namespace ElZooilogico
{
  public class Parser
  {
    private static void parse(string inputFile, string outputFile)
    {
      string[] buffer = null;
      string match = string.Empty;

      using ( System.IO.StreamWriter output = new System.IO.StreamWriter(outputFile) )
      {
        buffer = File.ReadAllLines(inputFile);

        for ( int i = 0; i < buffer[0].Length; i++ )
        {
          if ( buffer[0][i] == buffer[1][i] )
            match += buffer[0][i];
          else if ( match != string.Empty )
          {
            output.WriteLine(match);
            match = string.Empty;
          }
        }
        if ( match != string.Empty )
          output.WriteLine(match);
      }
      return;
    }

    public static int Main(string[] args)
    {
      try {
        if ( args.Length != 2 )
          return 9;
        if ( !File.Exists(args[0]) )
          return 1;
        parse(args[0], args[1]);
      } catch (Exception e) { System.Windows.Forms.MessageBox.Show(e.Message); return 3; }
      return 0;
    }

  } // class Parser

} // namespace ElZooilogico

or to process huge file in chunks,

Code: Select all

//>nul 2>nul||@goto :batch_code
/*
:batch_code
@echo off

rem place desired exe name
set "theExeFile=myParser.exe"
set "inputFile=.\input.txt"
set "outputFile=.\output.txt"

if not exist "%theExeFile%" call :build_the_exe || exit/B

echo Processing, wait...
%theExeFile% "%inputFile%" "%outputFile%"
if %errorlevel% NEQ 0 (
  echo Error code %errorlevel%
) else (
  echo Done.
)
endlocal
exit /b 0



:build_the_exe
for /f "tokens=* delims=" %%v in ('dir /b /s /a:-d /o:-n "%SystemRoot%\Microsoft.NET\Framework\csc.exe"') do (
   set "csc=%%v"
)
if "%csc%" == "" echo/&echo/Warning: Net Framework Not Found&exit/B 1
call "%csc%" /nologo /out:"%theExeFile%" "%~dpsfnx0"
exit/B 0
*/


//begin c# code
using System;
using System.IO;

namespace ElZooilogico
{
  public class Parser
  {
    const int blockSize = 4096;
    private static int filePtr = 0;

    private static void parse(string inputFile, string outputFile)
    {
      long size=0;
      int  chunks=0, rest=0;
      char[] buffer1 = new char[blockSize], buffer2 = new char[blockSize];

      using (FileStream fs = File.OpenRead(inputFile))
      {
        chunks = Convert.ToInt32((fs.Length / blockSize));

        using (BinaryReader br = new BinaryReader(fs))
        {
          for ( int i = 0; i < chunks; i++ )
          {
            buffer1 = br.ReadChars(blockSize);

            for ( int j = 0; j < blockSize; j++ )
            {
              if ( (buffer1[j] == 13 && buffer1[j+1] == 10) || (buffer1[j] == 10 && buffer1[j+1] == 13) )
              {
                if ( filePtr == 0 )
                {
                  filePtr = Convert.ToInt32(size + j + 2);
                  break;
                }
              }
            }
            size += blockSize;
          }
          chunks = Convert.ToInt32((filePtr / blockSize));
          rest = Convert.ToInt32(filePtr - (chunks*blockSize));
        }
      }

      using ( System.IO.StreamReader  line1 = new System.IO.StreamReader(inputFile),
                                      line2 = new System.IO.StreamReader(inputFile) )
      {
        string match = string.Empty;

        line2.BaseStream.Seek(filePtr, SeekOrigin.Current);

        using ( System.IO.StreamWriter output = new System.IO.StreamWriter(outputFile) )
        {
          for ( int i = 0; i < chunks; i++ )
          {
            System.Console.Out.Write("\rBlock {0} of {1}", i.ToString(), chunks.ToString());

            line1.Read(buffer1, 0, blockSize);
            line2.Read(buffer2, 0, blockSize);

            for ( int j = 0; j < blockSize; j++ )
            {
              if ( buffer1[j] == buffer2[j] )
                match += buffer1[j];
              else if ( match != string.Empty )
              {
                output.WriteLine(match);
                match = string.Empty;
              }
            }
          }
          Console.WriteLine("\rBlock " + chunks.ToString()+ " of " + chunks.ToString());

          if ( rest > 0 )
          {
            line1.Read(buffer1, 0, rest);
            line2.Read(buffer2, 0, rest);

            for ( int j = 0; j < rest; j++ )
            {
              if ( buffer1[j] == buffer2[j] )
                match += buffer1[j];
              else if ( match != string.Empty )
              {
                output.WriteLine(match);
                match = string.Empty;
              }
            }
          }
          if ( match != string.Empty )
            output.WriteLine(match);
        } // StreamWriter closed;
      } // StreamReaders Closed
      return;
    }

    public static int Main(string[] args)
    {
      try {
        if ( args.Length != 2 )
          return 9;
        if ( !File.Exists(args[0]) )
          return 1;
        parse(args[0], args[1]);
      } catch (Exception e) { System.Windows.Forms.MessageBox.Show(e.Message); return 3; }
      return 0;
    }

  } // class Parser

} // namespace ElZooilogico

Aacini
Expert
Posts: 1913
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: Alignment using batchscript

#8 Post by Aacini » 29 Aug 2017 07:17

This is another version of a Batch-JScript hybrid script that should run fast.

Code: Select all

@if (@CodeSection == @Batch) @then


@echo off
CScript //nologo //E:JScript "%~F0" < input.txt > output.txt
goto :EOF


@end


var line1 = "1"+WScript.Stdin.ReadLine()+"10",
    line2 = "2"+WScript.Stdin.ReadLine()+"20",
    len = line1.length-1, i, j;

for ( i = 0; line1.charAt(i) != line2.charAt(i); ++i );

while ( i < len ) {

   for ( j = i; line1.charAt(i) == line2.charAt(i); ++i );
   WScript.Stdout.WriteLine(line1.slice(j,i));
   while ( line1.charAt(i) != line2.charAt(i) ) ++i;

}

This program took 25 seconds to process the 1.52 MB input(2).txt file in my old&slow computer; the generated output have 150,164 lines. These are a few lines from begin and end of output file:

Code: Select all

RGRHRGRGRHRGRGRGIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFGDGARGRHRGRGRHRGRHRGIGRFRFIGRFRFIGRFRFIGRFRFIGRGRGIGRGRGIGRGRGIGRGRGIGRFRFIGRFRFIGRFRFIGRFRFIGR
R
IGR
R
IGR
R
IGR
R
GDGARGRHRGRGRHRGRIRGIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
GDGARGRHRGRGRHRGRRRGIGR
R
IGR
R
IGR
R


. . .


R
IGR
R
GDGARGRHRGRHKIKMKMRGIGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
IGR
R
IGR
R
GDGA

Antonio

plasma33
Posts: 22
Joined: 26 Jul 2017 21:18

Re: Alignment using batchscript

#9 Post by plasma33 » 29 Aug 2017 19:55

Hello everyone,

Sorry for my delayed response.

All of your codes worked perfectly as it should. I can't believe how fast all of your codes work.

Just a question is it possible to use a matrix to generate the common substrings? I have an IDENTITY (scoring) matrix which could be used to generate common substrings adopting the longest common substring implementation.

Please follow the link below for the ID matrix:
https://www.mediafire.com/file/ll6jjw3vxbibcag/IDENTITY

Thanks again, guys.

Plasma33

Post Reply