How to extract data from website?

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#16 Post by PaperTronics » 12 May 2017 07:41

Hackoo wrote:Hi :)
This another tweaked version in order to extract all links from source code of a website, and also, can be filtered by string to be searched like ("Mediafire" "Aacini" "Thebateam") and If you want to extract the InnerText, just uncomment this line after HREF (get rid from quote)
Link.WriteLine HREF '^& " ========> " ^& InnerText
becomes

Code: Select all

Link.WriteLine HREF ^&  " ========> " ^& InnerText
or simply write like that :

Code: Select all

Link.WriteLine HREF
So the whole code of ExtractLinks.bat

Code: Select all

@echo off
Title Extracting HREF links from website source code by Hackoo 2017
REM Extract all links from source code of a website, and also, can be filtered by string to be searched
mode con cols=75 lines=3 & color 9E
set "vbsfile=%tmp%\%~n0.vbs"
set "InputFile=Doc.txt"
If Not exist "%InputFile%" (
   Color 0C
   
   echo(
   echo  The "%InputFile%" does not exist,please check it and re-run this batch again
   pause>nul
   exit
)
Set "OutPutFile=All_Links.txt"
set Filter_Strings="Mediafire" "Aacini" "Thebateam"
echo(
echo       Please Wait a While ... Extrating Links is in Progress ....
Call :ExtractLinks "%InputFile%" "%OutPutFile%"
For %%a in (%Filter_Strings%) Do (
   Type "%OutPutFile%" | find /I %%a > %~dp0%%a_Links.txt
   If exist "%~dp0%%a_Links.txt" Start "" "%~dp0%%a_Links.txt"
)
start "" "%OutPutFile%" & Exit
::*************************************************************************************************
:ExtractLinks <InputFile> <OutPutFile>
(
   echo InputFile = Wscript.Arguments(0^)
   echo OutPutFile = Wscript.Arguments(1^)
   echo Call ExtractLinks(InputFile,OutPutFile^)
   echo '-------------------------------------------------------------------------------------------
   echo Function ExtractLinks(InputFile,OutPutFile^)
   echo      Set fso = CreateObject("Scripting.FileSystemObject"^) 
   echo      Set f = Fso.OpenTextFile(InputFile,1^)
   echo      Set Link = fso.OpenTextFile(OutPutfile,2,True,-1^)
   echo      Data = f.ReadAll
   echo      Set reLink = New RegExp
   echo      reLink.Global = True
   echo      reLink.IgnoreCase = True 
   echo      reLink.Pattern = "<a\b[^>]*\bhref=(?:([""'])([\s\S]+?)\1|([^\s>]*))[^>]*>([\s\S]+?)</a>"
   echo      Set reText = New RegExp
   echo      reText.GLobal = True
   echo      reText.Pattern = "<[^>]*>"     
   echo      For Each Match in reLink.Execute(Data^)
   echo          HREF = Match.SubMatches(1^) ^& Match.SubMatches(2^)
   echo          InnerText = reText.Replace(Match.SubMatches(3^), ""^)
   echo 'If you want to extract the InnerText just uncomment this line after HREF (get rid from quote^)
   echo          Link.WriteLine HREF '^&  " ========> " ^& InnerText
   echo      Next 
   echo End Function
   echo '-------------------------------------------------------------------------------------------
)>"%vbsfile%"
Cscript /nologo "%vbsfile%" "%~1" "%~2"
exit /b
::*************************************************************************************************


I appreciate the time and effort you've put into this code. Can you:
1. Make it extract the "TheBATeam" Links only? I was too lazy to dig in the source code.
2. Make it not echo the "Please Wait... Extracting Links in Progress" thingy?
3. Make it not open the files when the extracting is finished.

I have to say, your method is pretty fast!

Thanks,
PaperTronics

Hackoo
Posts: 103
Joined: 15 Apr 2014 17:59

Re: How to extract data from website?

#17 Post by Hackoo » 12 May 2017 08:39

PaperTronics wrote:I appreciate the time and effort you've put into this code. Can you:
1. Make it extract the "TheBATeam" Links only? I was too lazy to dig in the source code.
2. Make it not echo the "Please Wait... Extracting Links in Progress" thingy?
3. Make it not open the files when the extracting is finished.

I have to say, your method is pretty fast!
Thanks,
PaperTronics

I have one question : Which tool or script did you use to get the source code of the website ?
Here is the modification that you request for it :)

Code: Select all

@echo off
Title Extracting HREF links from website source code by Hackoo 2017
REM Extract all links from source code of a website, and also, can be filtered by string to be searched
mode con cols=75 lines=3 & color 9E
set "vbsfile=%tmp%\%~n0.vbs"
set "InputFile=Doc.txt"
If Not exist "%InputFile%" (
   Color 0C
   echo(
   echo  The "%InputFile%" does not exist,please check it and re-run this batch again
   pause>nul
   exit
)
Set "OutPutFile=All_Links.txt"
set Filter_Strings="Thebateam"
Call :ExtractLinks "%InputFile%" "%OutPutFile%"
For %%a in (%Filter_Strings%) Do (
   Type "%OutPutFile%" | find /I %%a > %~dp0%%a_Links.txt
)
Exit
::*************************************************************************************************
:ExtractLinks <InputFile> <OutPutFile>
(
   echo InputFile = Wscript.Arguments(0^)
   echo OutPutFile = Wscript.Arguments(1^)
   echo Call ExtractLinks(InputFile,OutPutFile^)
   echo '-------------------------------------------------------------------------------------------
   echo Function ExtractLinks(InputFile,OutPutFile^)
   echo      Set fso = CreateObject("Scripting.FileSystemObject"^) 
   echo      Set f = Fso.OpenTextFile(InputFile,1^)
   echo      Set Link = fso.OpenTextFile(OutPutfile,2,True,-1^)
   echo      Data = f.ReadAll
   echo      Set reLink = New RegExp
   echo      reLink.Global = True
   echo      reLink.IgnoreCase = True 
   echo      reLink.Pattern = "<a\b[^>]*\bhref=(?:([""'])([\s\S]+?)\1|([^\s>]*))[^>]*>([\s\S]+?)</a>"
   echo      Set reText = New RegExp
   echo      reText.GLobal = True
   echo      reText.Pattern = "<[^>]*>"     
   echo      For Each Match in reLink.Execute(Data^)
   echo          HREF = Match.SubMatches(1^) ^& Match.SubMatches(2^)
   echo          InnerText = reText.Replace(Match.SubMatches(3^), ""^)
   echo          Link.WriteLine HREF
   echo      Next 
   echo End Function
   echo '-------------------------------------------------------------------------------------------
)>"%vbsfile%"
Cscript /nologo "%vbsfile%" "%~1" "%~2"
exit /b
::*************************************************************************************************

thefeduke
Posts: 211
Joined: 05 Apr 2015 13:06
Location: MA South Shore, USA

Re: How to extract data from website?

#18 Post by thefeduke » 12 May 2017 16:19

Hackoo wrote:Here is the modification that you request for it :)
Good work. This reply is directed more to you, than @PaperTronics. I don't know what editor you use but escaping those special characters using ECHO looks tedious. I favor a more WYSIWYG technique. I use it for most of my inline test files. Here is your code slightly modified so that the .vbs code is entered more simply.

Code: Select all

@echo off
Title Extracting HREF links from website source code by Hackoo 2017
::
::  Posted: Fri May 12, 2017 10:39 am by Hackoo 
::  http://www.dostips.com/forum/viewtopic.php?p=52303#p52303
::  Post subject: Re: How to extract data from website?
::  thefeduke altered :ExtractLinks to eliminate those hard to work with ECHOes
::
REM Extract all links from source code of a website, and also, can be filtered by string to be searched
mode con cols=75 lines=3 & color 9E
set "vbsfile=%tmp%\%~n0.vbs"
set "InputFile=Doc.txt"
If Not exist "%InputFile%" (
   Color 0C
   echo(
   echo  The "%InputFile%" does not exist,please check it and re-run this batch again
   pause>nul
   exit
)
Set "OutPutFile=All_Links.txt"
set Filter_Strings="Thebateam"
Call :ExtractLinks "%InputFile%" "%OutPutFile%"
For %%a in (%Filter_Strings%) Do (
   Type "%OutPutFile%" | find /I %%a > %~dp0%%a_Links.txt
)
Exit
::*************************************************************************************************
:ExtractLinks <InputFile> <OutPutFile>
(
   echo InputFile = Wscript.Arguments(0^)
   echo OutPutFile = Wscript.Arguments(1^)
   echo Call ExtractLinks(InputFile,OutPutFile^)
   echo '-------------------------------------------------------------------------------------------
   echo Function ExtractLinks(InputFile,OutPutFile^)
   echo      Set fso = CreateObject("Scripting.FileSystemObject"^) 
   echo      Set f = Fso.OpenTextFile(InputFile,1^)
   echo      Set Link = fso.OpenTextFile(OutPutfile,2,True,-1^)
   echo      Data = f.ReadAll
   echo      Set reLink = New RegExp
   echo      reLink.Global = True
   echo      reLink.IgnoreCase = True 
   echo      reLink.Pattern = "<a\b[^>]*\bhref=(?:([""'])([\s\S]+?)\1|([^\s>]*))[^>]*>([\s\S]+?)</a>"
   echo      Set reText = New RegExp
   echo      reText.GLobal = True
   echo      reText.Pattern = "<[^>]*>"     
   echo      For Each Match in reLink.Execute(Data^)
   echo          HREF = Match.SubMatches(1^) ^& Match.SubMatches(2^)
   echo          InnerText = reText.Replace(Match.SubMatches(3^), ""^)
   echo          Link.WriteLine HREF
   echo      Next 
   echo End Function
   echo '-------------------------------------------------------------------------------------------
)>"%vbsfile%"
Rem.Cscript /nologo "%vbsfile%" "%~1" "%~2"
Call :TempFile "WebSite.vbs" "file"
Call Cscript /nologo "%Temp%\~Scripts~\%~n0_WebSite.vbs" "%~1" "%~2"
exit /b
::*************************************************************************************************

GoTo :EndOfWebSite.vbsFile
InputFile = Wscript.Arguments(0)
OutPutFile = Wscript.Arguments(1)
Call ExtractLinks(InputFile,OutPutFile)
'-------------------------------------------------------------------------------------------
Function ExtractLinks(InputFile,OutPutFile)
     Set fso = CreateObject("Scripting.FileSystemObject")
     Set f = Fso.OpenTextFile(InputFile,1)
     Set Link = fso.OpenTextFile(OutPutfile,2,True,-1)
     Data = f.ReadAll
     Set reLink = New RegExp
     reLink.Global = True
     reLink.IgnoreCase = True
     reLink.Pattern = "<a\b[^>]*\bhref=(?:([""'])([\s\S]+?)\1|([^\s>]*))[^>]*>([\s\S]+?)</a>"
     Set reText = New RegExp
     reText.GLobal = True
     reText.Pattern = "<[^>]*>"
     For Each Match in reLink.Execute(Data)
         HREF = Match.SubMatches(1) & Match.SubMatches(2)
         InnerText = reText.Replace(Match.SubMatches(3), "")
         Link.WriteLine HREF
     Next
End Function
'-------------------------------------------------------------------------------------------
:EndOfWebSite.vbsFile

:TempFile Name.Ext_Val[In] OutFormatVal[In]
    @echo Off & SetLocal EnableDelayedExpansion
    If %~1==. First-argument-is-mandatory-but-an-empty-string
    If %~2==. Second-argument-is-mandatory-but-an-empty-string
    For %%E In ("%~1") DO Set "fName=%%~nxE"
    If Not Exist "%Temp%\~Scripts~"         MkDir "%Temp%\~Scripts~"
    If Exist "%Temp%\~Scripts~\%~n0_%fName%" DEL "%Temp%\~Scripts~\%~n0_%fName%"
    For /f "delims=:" %%i in (
        'findstr /nir /c:"^goto[ ]*\:EndOf%fName%" /c:"^\:EndOf%fName%" "%~fs0"'
    ) Do Set "DataRange=!DataRange! %%i"
    For /f "tokens=1,2" %%i in ("%DataRange%") Do (Set /A "BeginData=%%i+1" & Set /A "EndData=%%j-1")
   (For /L %%i In (2 1 %BeginData%) Do Set /P "="
        For /L %%i In (!BeginData! 1 %EndData%) Do (
            Set "line=" &Set /P "line="
            If /I "%~2" EQU "File" Echo(!line!
            set "whole=!Whole!!line!"
        )
        If /I "%~2" EQU "Line" Echo(!whole!
   ) < "%~f0"   >"%Temp%\~Scripts~\%~n0_%fName%"
   EndLocal
Exit /B

John A.

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#19 Post by PaperTronics » 13 May 2017 06:36

Thnx for the help @Hackoo and @thefeduke


I'm having another problem in my program. The links being extracted are in a wrong order not because of your code but because of the website's source code. So I want to sort the links.

I found a way to do this using the the year and date of publish in the middle of the links. e.g : http://www.thebateam.org/2017/02/how-to-customize-cmd-completely-by.html

But the code I tried to apply wasn't working. I need some pro help here.


Answer to Hackoo's question :
I didn't use anything to get the source code of the site. thebateam.org is my own website so that's why I easily got the source code of it. Though if you wanna do the same to any other website I prefer :

1. Go to the website from which you want to extract the code
2. Press Ctrl+U. The source code should open immediately in a new tab


Thanks,
PaperTronics

Aacini
Expert
Posts: 1914
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: How to extract data from website?

#20 Post by Aacini » 13 May 2017 14:39

Aacini wrote:Please, try this version of the code:

Code: Select all

@if (@CodeSection == @Batch) @then

@echo off
cscript //nologo //E:JScript "%~F0" < Doc.txt > output.txt
goto :EOF

@end

var search = /http:\/\/www\.mediafire\.com[^"]*/g, file = WScript.StdIn.ReadAll(), match;
while ( match = search.exec(file) ) WScript.Stdout.WriteLine(match[0]);

If still don't works, post the output from the command-line window...

Antonio



PaperTronics wrote:The error states

Code: Select all

C:\Users\pratik\Desktop\BatchStore\DummyBase.bat(1, 6) Microsoft JScript compilation error: Conditional compilation is turned off



Ok. Such an error is unusual. However, the next version should correctly run in your computer:

Code: Select all

@echo off

 > extract.js echo var search = /http:\/\/www\.mediafire\.com[^^"]*/g, file = WScript.StdIn.ReadAll(), match;
>> extract.js echo while ( match = search.exec(file) ) WScript.Stdout.WriteLine(match[0]);

cscript //nologo extract.js < Doc.txt


PaperTronics wrote:I'm having another problem in my program. The links being extracted are in a wrong order not because of your code but because of the website's source code. So I want to sort the links.

I found a way to do this using the the year and date of publish in the middle of the links. e.g : http://www.thebateam.org/2017/02/how-to-customize-cmd-completely-by.html

But the code I tried to apply wasn't working. I need some pro help here.


This works here:

Code: Select all

@echo off

 > extract.js echo var search = /http:\/\/www\.thebateam\.org[^^"]*/g, file = WScript.StdIn.ReadAll(), match;
>> extract.js echo while ( match = search.exec(file) ) WScript.Stdout.WriteLine(match[0]);

cscript //nologo extract.js < Doc.txt | sort /+26 > output.txt

Antonio

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#21 Post by PaperTronics » 19 May 2017 06:14

Thnx Aacini the method of sorting and extracting is working perfect now. Now all I need is a downloading program which downloads any file from the internet. Although I had one, but it wasn't that much good and was a lil' bit buggy.

Hackoo
Posts: 103
Joined: 15 Apr 2014 17:59

Re: How to extract data from website?

#22 Post by Hackoo » 19 May 2017 11:09

PaperTronics wrote:Now all I need is a downloading program which downloads any file from the internet. Although I had one, but it wasn't that much good and was a lil' bit buggy.
Hi :)
Can you provide us a sample direct link to test the downloading ?

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#23 Post by PaperTronics » 20 May 2017 06:47

Can you provide us a sample direct link to test the downloading ?


Sample : http://www.mediafire.com/file/4tovbku6k ... eecher.rar

The downloading program should download the program that is in the link without asking for it's name

Hackoo
Posts: 103
Joined: 15 Apr 2014 17:59

Re: How to extract data from website?

#24 Post by Hackoo » 20 May 2017 09:11

PaperTronics wrote:
Can you provide us a sample direct link to test the downloading ?

Sample : http://www.mediafire.com/file/4tovbku6k ... eecher.rar
The downloading program should download the program that is in the link without asking for it's name
Hi 8)
NB: To ensure that this script works, you should put a direct link of the URL :wink:
Just give a try for this batch that can download your file on your desktop :)
I tested it before posting this here, and it's works for me 5/5, and i hope that will work on your side too ! :mrgreen:

Code: Select all

@echo off
Title Batch script to download a file from a direct link by Hackoo
Color 9E & Mode con cols=90 lines=3
Set "URL=http://download1334.mediafire.com/i77fj7bj37xg/4tovbku6kcercc7/Speecher.rar"
REM To extract the name of the file to be downloaded from the URL.
For %%F in (%URL%) Do (
    Set "MyProgram=%%~nxF"
    Set "MyProgram_Name=%%~nF"
)
REM We set the Location of MyProgram where to be downloaded
Set "Location=%userprofile%\Desktop\%MyProgram%"
REM If there is any previous version of MyProgram we delete it.
If Exist "%Location%" Del "%Location%"
REM We download the last version of MyProgram from its original web site.
If Not Exist "%Location%" (
   echo(
   echo    Please wait a while ... Downloading the last version of "%MyProgram_Name%" is in progress ...
   Call:Download "%URL%" "%Location%"
)
Explorer.exe /select,"%Location%"
Exit
::*********************************************************************************
:Download <url> <File>
Powershell.exe -command "(New-Object System.Net.WebClient).DownloadFile('%1','%2')"
exit /b
::*********************************************************************************

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: How to extract data from website?

#25 Post by aGerman » 20 May 2017 10:00

Mediafire doesn't want you to download the file directly because they offer their service for free. The way they earn money is via advertising. Thus, the id "i77fj7bj37xg" was "2xcjc69j2npg" when I browsed the site. I don't say it's impossible, but you would need to make a lot of efforts to get the current direct-link to download the file.

Steffen

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#26 Post by PaperTronics » 21 May 2017 01:55

As @aGerman said it's a lot of effort to get the current direct download link to the file, I think I should use the alternative method which is to:
Download all the source code files of the mediafire links, use Aacini's algorithm to find the names of the program and then use download.exe(the previous program which I had selected for downloading the files) to download those programs

Actually the only problem with download.exe is that it requires the program's name of which it's downloading, so that's why I asked y'all to suggest me another downloading program

igor_andreev
Posts: 16
Joined: 25 Feb 2017 12:55
Location: Russia

Re: How to extract data from website?

#27 Post by igor_andreev » 21 May 2017 02:36

Approximate order of actions
1. Download by wget mediafire-URL to anyname.tmp
2. Find in anyname.tmp(it's just html-page) line with words "DownloadButtonAd-startDownload gbtnSecondary"
3. Extract direct link
i made step 2 and step 3 in one line with sed&grep:
type anyname.tmp | sed s/\x27/\n/g | grep -o "^http:\/\/.*$"
or by sed only:
type anyname.tmp | sed s/\x27/\n/g | sed "/^http:\/\/.*$/!d"
4. wget direct-link
5. Profit :)
Image

aGerman
Expert
Posts: 4678
Joined: 22 Jan 2010 18:01
Location: Germany

Re: How to extract data from website?

#28 Post by aGerman » 21 May 2017 04:23

A few weeks ago we had a similar topic.
viewtopic.php?f=3&t=7797

Adapted to meet your requirements:

Code: Select all

@if (@a)==(@b) @end /* Batch part:

@echo off &setlocal
::                                    mediafire site where to find the direct link                 directory where to save the file
cscript //nologo //e:jscript "%~fs0" "http://www.mediafire.com/file/4tovbku6kcercc7/Speecher.rar" "%userprofile%\Desktop"
pause
exit /b


JScript Part : */

var objIE = null;
try {
  WScript.Echo('Searching link ...');
  objIE = new ActiveXObject('InternetExplorer.Application');
  // objIE.Visible = true;
  objIE.Navigate(WScript.Arguments(0));
  while (objIE.Busy) { WScript.Sleep(100); }
  WScript.Sleep(3000);
  var link = objIE.document.getElementsByClassName('DownloadButtonAd-startDownload gbtnSecondary')[0].getAttribute('href');
  WScript.Echo('Found: ' + link);

  WScript.Echo('Downloading ...');
  var objXMLHTTP = new ActiveXObject('MSXML2.ServerXMLHTTP');
  objXMLHTTP.open('GET', link, false);
  objXMLHTTP.send();

  var objADOStream = new ActiveXObject('ADODB.Stream');
  objADOStream.Type = 1;
  objADOStream.Mode = 3;
  objADOStream.Open();
  objADOStream.Write(objXMLHTTP.responseBody);
  objADOStream.Position = 0;

  objIE.Quit();
  objIE = null;

  WScript.Echo('Saving ...');
  var objFSO = new ActiveXObject('Scripting.FileSystemObject');
  objADOStream.SaveToFile(objFSO.BuildPath(WScript.Arguments(1), objFSO.GetFileName(link)), 2);
  objADOStream.Close();
  WScript.Quit(0);
}
catch(e) {
  if (objIE != null) { objIE.Quit(); }
  WScript.Echo('Error!');
  WScript.Quit(1);
}



Even if loading the site has been completed the link isn't available immediately. It will be updated after a while. In the meantime you'll see "Preparing Download" if you browse the site manually. I can't predict how long it takes. That's the reason why I added a 3 seconds delay (WScript.Sleep(3000);). It might or might not be too long.

You should be aware that as soon as Mediafire decides to change the site (e.g. they change the class name of the style) the script won't work anymore.

Steffen

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#29 Post by PaperTronics » 26 May 2017 05:37

@aGerman
The script was working when I placed it on my desktop. So I copy/pasted the script to my program's code and it gave the same error as Aacini's code :

Code: Select all

C:\Users\pratik\Desktop\BATCHS~1\DUMMYB~1.BAT(1, 6) Microsoft JScript compilatio
n error: Conditional compilation is turned off

I did some research and found that whenever I place some batch code in the same file then it gives this error so I made a separate file in which I placed your code and in my original file I wrote the command

Code: Select all

Start DownloadLinks.bat

but still the same error. I don't know if something is wrong with my computer or what.

thefeduke
Posts: 211
Joined: 05 Apr 2015 13:06
Location: MA South Shore, USA

Re: How to extract data from website?

#30 Post by thefeduke » 26 May 2017 09:09

PaperTronics wrote:so I made a separate file in which I placed your code and in my original file I wrote the command

Code: Select all

Start DownloadLinks.bat
but still the same error. I don't know if something is wrong with my computer or what.
The first operand of the start command is not the program name but the title of the started window. You can use "" as a default, as in

Code: Select all

Start "" DownloadLinks.bat
John A.

Post Reply