Discussion forum for all Windows batch related topics.
Moderator: DosItHelp
-
plasma33
- Posts: 22
- Joined: 26 Jul 2017 21:18
#1
Post
by plasma33 » 27 Aug 2017 01:15
Hi guys,
I am trying to find common substrings between two strings using one to one alignment. My input will be two strings coded in A-Z characters. A best example is demonstrated below:
Input:
Code: Select all
String1
GIGRGRGIGRGRGGDGARGRHRGRGRHRGRQRGIGKMKMIGKMKMIGKMKMIGKMKMIGRPRGIGRHRIIGRGRGIGRGRHIGKHKIIGRHRIIGRGRGIGRGRHIGRKKLIGRKRIIGRRRHIGRRRGGDGARGRHRGRGRHRGRPRGIGRGRGIGRGRGIGRGRGIGRGRGIGRIKLIGRQRRIGRNKKIGRRKIIGRGRHIGRGRGIGRGRGIGRGRGIGRKRPIGRLRRIGRKRPIGRKRNGDGARGRHRGRGRHRGK
String2
RIGRKRKIGRLRRGDGARGRHRGRGRHRGRQRGIGRPKRIGKLRKIGKMKKIGKHKIIGRLRMIGKKKIIGKLRHIGRKRLIGKIKMIGRHRGIGRHKHIGRNRGIGRIRHIGKMKIIGKKRHIGRGRMGDGARGRHRGRGRHRGRPRGIGRGRHIGRGRGIGRGRGIGRGRGIGRNRRIGRNRPIGRNRRIGRNRKIGRNRIIGRMRHIGRNRPIGRIKLIGRNRGIGRMRKIGRMRIIGRGRGGDGARGRHRGRGRHRGK
Output:
Code: Select all
Common Subtsring
IGR
R
IGR
R
GDGARGRHRGRGRHRGRQRGIG
K
IGK
IGKMK
IGK
K
IGR
R
IG
IIG
R
IGR
R
IGK
K
IGRHR
IGR
IGR
R
IGR
IG
IIG
RHIGR
R
GDGARGRHRGRGRHRGRPRGIGRGR
IGRGRGIGRGRGIGRGRGIGR
IGR
R
IGRN
IGR
IGR
R
IGR
R
IGR
R
IGR
IGR
R
IGR
R
IGR
R
IGR
R
GDGARGRHRGRGRHRGK
I am attaching the full text file for your reference. Please see the following link:
https://www.mediafire.com/file/cfdk5j2j5vlwjnq/input.txtThanks, guys.
Plasma33
-
aGerman
- Expert
- Posts: 4678
- Joined: 22 Jan 2010 18:01
- Location: Germany
#3
Post
by aGerman » 27 Aug 2017 05:49
The string limit in Batch is 8191 characters. You're far above in the text file you uploaded.
Steffen
-
plasma33
- Posts: 22
- Joined: 26 Jul 2017 21:18
#4
Post
by plasma33 » 27 Aug 2017 19:35
Hello all,
@DosItHelp, thank you for the link. I will give that a try.
@aGerman, is there anyway a hybrid (i.e. a combination of batch script or VBScript or JScript) could resolve the string limit issue? Or by splitting the string into several smaller strings?
@Aacini and @dbenham, I need your take on this, please.
Thanks, all.
Plasma33
-
Aacini
- Expert
- Posts: 1914
- Joined: 06 Dec 2011 22:15
- Location: México City, México
-
Contact:
#5
Post
by Aacini » 27 Aug 2017 22:40
Test this code and report any problem:
Code: Select all
@echo off
setlocal EnableDelayedExpansion
if exist input1.txt if exist input2.txt goto SecondPart
rem First part: Split a file with 2 very long lines into 2 files with shorter lines
rem http://www.dostips.com/forum/viewtopic.php?f=3&t=4945
echo Processing big input file...
for /F %%a in ('copy /Z "%~F0" NUL') do set "CR=%%a"
del input1.txt input2.txt 2> nul
set "in=1"
call :SplitLines < input.txt
goto SecondPart
:SplitLines
echo/
echo Reading input line # %in%
set "lineNum=0"
:loopLine
set /A lineNum+=2
set /P "=Output line: %lineNum%!CR!" < NUL
set /P "line="
>> input%in%.txt echo %line:~0,512%
>> input%in%.txt echo/%line:~512%
if "%line:~1022%" neq "" goto loopLine
echo/
set /A in+=1
if %in% leq 2 goto SplitLines
echo/
exit /B
:SecondPart
echo Processing input files...
set "last="
set "lineNum=0"
< input2.txt (
for /F "delims=" %%a in (input1.txt) do (
set /A lineNum+=1
set /P "=Input line: !lineNum!!CR!" < NUL > CON
set "line1=%%a"
set /P "line2="
if defined last (
set /P "=!last!" < NUL & set "last="
if "!line1:~0,1!" neq "!line2:~0,1!" echo/
)
set "start="
for /L %%i in (0,1,511) do (
if "!line1:~%%i,1!" equ "!line2:~%%i,1!" (
if not defined start (set /A "start=end=%%i") else set "end=%%i"
if %%i equ 511 (
for /F %%m in ("!start!") do set "last=!line1:~%%m!"
)
) else if defined start (
set /A "len=end-start+1"
for /F "tokens=1,2" %%m in ("!start! !len!") do echo !line1:~%%m,%%n!
set "start="
)
)
)
if defined last echo !last!
) > output.txt
echo/
Antonio
-
aGerman
- Expert
- Posts: 4678
- Joined: 22 Jan 2010 18:01
- Location: Germany
#6
Post
by aGerman » 28 Aug 2017 01:48
plasma33 wrote:@aGerman, is there anyway a hybrid
Yes why not.
Code: Select all
@if (@a)==(@b) @end /*
@echo off &setlocal
set "filename=test.txt"
cscript //nologo //e:jscript "%~fs0" "%filename%"
pause
exit /b&::*/
var objFile = WScript.CreateObject('Scripting.FileSystemObject').OpenTextFile(WScript.Arguments(0)),
str1 = objFile.ReadLine(), str2 = objFile.ReadLine(),
len = Math.min(str1.length, str2.length),
bSame = false, strMatch = '', i, chr;
objFile.Close();
for (i = 0; i < len; ++i) {
chr = str1.charAt(i);
if (chr == str2.charAt(i)) {
strMatch += chr;
if (!bSame) {
bSame = true;
}
} else {
if (bSame) {
WScript.Echo(strMatch);
bSame = false;
strMatch = '';
}
}
}
Steffen
-
elzooilogico
- Posts: 128
- Joined: 23 May 2016 15:39
- Location: Spain
#7
Post
by elzooilogico » 28 Aug 2017 05:58
for the sake of speed...
Code: Select all
//>nul 2>nul||@goto :batch_code
/*
:batch_code
@echo off
rem place desired exe name
set "theExeFile=myParser.exe"
set "inputFile=.\input.txt"
set "outputFile=.\output.txt"
if not exist "%theExeFile%" call :build_the_exe || exit/B
echo Processing, wait...
%theExeFile% "%inputFile%" "%outputFile%"
if %errorlevel% NEQ 0 (
echo Error code %errorlevel%
) else (
echo Done.
)
endlocal
exit /b 0
:build_the_exe
for /f "tokens=* delims=" %%v in ('dir /b /s /a:-d /o:-n "%SystemRoot%\Microsoft.NET\Framework\csc.exe"') do (
set "csc=%%v"
)
if "%csc%" == "" echo/&echo/Warning: Net Framework Not Found&exit/B 1
call "%csc%" /nologo /out:"%theExeFile%" "%~dpsfnx0"
exit/B 0
*/
//begin c# code
using System;
using System.IO;
namespace ElZooilogico
{
public class Parser
{
private static void parse(string inputFile, string outputFile)
{
string[] buffer = null;
string match = string.Empty;
using ( System.IO.StreamWriter output = new System.IO.StreamWriter(outputFile) )
{
buffer = File.ReadAllLines(inputFile);
for ( int i = 0; i < buffer[0].Length; i++ )
{
if ( buffer[0][i] == buffer[1][i] )
match += buffer[0][i];
else if ( match != string.Empty )
{
output.WriteLine(match);
match = string.Empty;
}
}
if ( match != string.Empty )
output.WriteLine(match);
}
return;
}
public static int Main(string[] args)
{
try {
if ( args.Length != 2 )
return 9;
if ( !File.Exists(args[0]) )
return 1;
parse(args[0], args[1]);
} catch (Exception e) { System.Windows.Forms.MessageBox.Show(e.Message); return 3; }
return 0;
}
} // class Parser
} // namespace ElZooilogico
or to process huge file in chunks,
Code: Select all
//>nul 2>nul||@goto :batch_code
/*
:batch_code
@echo off
rem place desired exe name
set "theExeFile=myParser.exe"
set "inputFile=.\input.txt"
set "outputFile=.\output.txt"
if not exist "%theExeFile%" call :build_the_exe || exit/B
echo Processing, wait...
%theExeFile% "%inputFile%" "%outputFile%"
if %errorlevel% NEQ 0 (
echo Error code %errorlevel%
) else (
echo Done.
)
endlocal
exit /b 0
:build_the_exe
for /f "tokens=* delims=" %%v in ('dir /b /s /a:-d /o:-n "%SystemRoot%\Microsoft.NET\Framework\csc.exe"') do (
set "csc=%%v"
)
if "%csc%" == "" echo/&echo/Warning: Net Framework Not Found&exit/B 1
call "%csc%" /nologo /out:"%theExeFile%" "%~dpsfnx0"
exit/B 0
*/
//begin c# code
using System;
using System.IO;
namespace ElZooilogico
{
public class Parser
{
const int blockSize = 4096;
private static int filePtr = 0;
private static void parse(string inputFile, string outputFile)
{
long size=0;
int chunks=0, rest=0;
char[] buffer1 = new char[blockSize], buffer2 = new char[blockSize];
using (FileStream fs = File.OpenRead(inputFile))
{
chunks = Convert.ToInt32((fs.Length / blockSize));
using (BinaryReader br = new BinaryReader(fs))
{
for ( int i = 0; i < chunks; i++ )
{
buffer1 = br.ReadChars(blockSize);
for ( int j = 0; j < blockSize; j++ )
{
if ( (buffer1[j] == 13 && buffer1[j+1] == 10) || (buffer1[j] == 10 && buffer1[j+1] == 13) )
{
if ( filePtr == 0 )
{
filePtr = Convert.ToInt32(size + j + 2);
break;
}
}
}
size += blockSize;
}
chunks = Convert.ToInt32((filePtr / blockSize));
rest = Convert.ToInt32(filePtr - (chunks*blockSize));
}
}
using ( System.IO.StreamReader line1 = new System.IO.StreamReader(inputFile),
line2 = new System.IO.StreamReader(inputFile) )
{
string match = string.Empty;
line2.BaseStream.Seek(filePtr, SeekOrigin.Current);
using ( System.IO.StreamWriter output = new System.IO.StreamWriter(outputFile) )
{
for ( int i = 0; i < chunks; i++ )
{
System.Console.Out.Write("\rBlock {0} of {1}", i.ToString(), chunks.ToString());
line1.Read(buffer1, 0, blockSize);
line2.Read(buffer2, 0, blockSize);
for ( int j = 0; j < blockSize; j++ )
{
if ( buffer1[j] == buffer2[j] )
match += buffer1[j];
else if ( match != string.Empty )
{
output.WriteLine(match);
match = string.Empty;
}
}
}
Console.WriteLine("\rBlock " + chunks.ToString()+ " of " + chunks.ToString());
if ( rest > 0 )
{
line1.Read(buffer1, 0, rest);
line2.Read(buffer2, 0, rest);
for ( int j = 0; j < rest; j++ )
{
if ( buffer1[j] == buffer2[j] )
match += buffer1[j];
else if ( match != string.Empty )
{
output.WriteLine(match);
match = string.Empty;
}
}
}
if ( match != string.Empty )
output.WriteLine(match);
} // StreamWriter closed;
} // StreamReaders Closed
return;
}
public static int Main(string[] args)
{
try {
if ( args.Length != 2 )
return 9;
if ( !File.Exists(args[0]) )
return 1;
parse(args[0], args[1]);
} catch (Exception e) { System.Windows.Forms.MessageBox.Show(e.Message); return 3; }
return 0;
}
} // class Parser
} // namespace ElZooilogico
-
Aacini
- Expert
- Posts: 1914
- Joined: 06 Dec 2011 22:15
- Location: México City, México
-
Contact:
#8
Post
by Aacini » 29 Aug 2017 07:17
This is another version of a Batch-JScript hybrid script that should run fast.
Code: Select all
@if (@CodeSection == @Batch) @then
@echo off
CScript //nologo //E:JScript "%~F0" < input.txt > output.txt
goto :EOF
@end
var line1 = "1"+WScript.Stdin.ReadLine()+"10",
line2 = "2"+WScript.Stdin.ReadLine()+"20",
len = line1.length-1, i, j;
for ( i = 0; line1.charAt(i) != line2.charAt(i); ++i );
while ( i < len ) {
for ( j = i; line1.charAt(i) == line2.charAt(i); ++i );
WScript.Stdout.WriteLine(line1.slice(j,i));
while ( line1.charAt(i) != line2.charAt(i) ) ++i;
}
This program took 25 seconds to process the 1.52 MB input(2).txt file in my old&slow computer; the generated output have 150,164 lines. These are a few lines from begin and end of output file:
Code: Select all
RGRHRGRGRHRGRGRGIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFGDGARGRHRGRGRHRGRHRGIGRFRFIGRFRFIGRFRFIGRFRFIGRGRGIGRGRGIGRGRGIGRGRGIGRFRFIGRFRFIGRFRFIGRFRFIGR
R
IGR
R
IGR
R
IGR
R
GDGARGRHRGRGRHRGRIRGIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGRFRFIGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
GDGARGRHRGRGRHRGRRRGIGR
R
IGR
R
IGR
R
. . .
R
IGR
R
GDGARGRHRGRHKIKMKMRGIGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
R
IGR
IGR
R
IGR
R
GDGA
Antonio
-
plasma33
- Posts: 22
- Joined: 26 Jul 2017 21:18
#9
Post
by plasma33 » 29 Aug 2017 19:55
Hello everyone,
Sorry for my delayed response.
All of your codes worked perfectly as it should. I can't believe how fast all of your codes work.
Just a question is it possible to use a matrix to generate the common substrings? I have an IDENTITY (scoring) matrix which could be used to generate common substrings adopting the longest common substring implementation.
Please follow the link below for the ID matrix:
https://www.mediafire.com/file/ll6jjw3vxbibcag/IDENTITYThanks again, guys.
Plasma33