The ghostmachine4 & isacmahad Debate thread

Message

isacmahad · #31 Post by **isacmahad** » 05 Nov 2009 16:15

Ok answered my own question on this, now I would like some input.
What are your thoughts on wget. Is this a decent tool? I am used to this from linux, hows about for windows, any good?

ghostmachine4 · #32 Post by **ghostmachine4** » 05 Nov 2009 21:42

wget fetches entire site or web pages for you. just use it and see....read the docs, its that simple. If you are doing a big project, i suggest you choose, learn and use Python. the equivalent of wget in Python is urllib2, urllib. Plus you can parse web sites using BeautifulSoup etc.. there are lots of modules for you if you want to play with HTML parsing. Not to forget XML as well.... doing them in shell is a hassle.

isacmahad · #33 Post by **isacmahad** » 05 Nov 2009 22:16

Thank you for the reply ghostmachine4.
I have been playing with the wget and I read through the help context.
Perhaps there is something I'm missing here and just overlooked.
It is a simple question I know, but I don't have the answer.
How do I output the content to a specified directory and not the current one in which I'm running from.
Thnx again you have been of great help in this journey. Another question I have would be your thoughts or suggestions on web crawling.

Wget Compiled Help wrote:‘--spider’
When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there. For example, you can use Wget to check your bookmarks:

wget --spider --force-html -i bookmarks.html

This feature needs much more work for Wget to get close to the functionality of real web spiders.

What other tools might you recommend for this kind of task.
Wget seems to weak in searching for specific content in which to grab, and when it comes to equations needing to check for page freshness there are none. In this case I do not wish as of yet to write my own but to study some examples and go from there.
I did some searching around and ran into a wiki def with some example programs, but am not to clear on where to start.
I would like to be able to grab the source (information defined for the source addresses, links, etc...) and parse as needed sending the output to a designated files & dirs.

ghostmachine4 · #34 Post by **ghostmachine4** » 05 Nov 2009 22:35

isacmahad wrote:How do I output the content to a specified directory and not the current one in which I'm running from.

i don't have wget right now as i write, but check the man page for output options. if not, just do it on the current directory first and move them to your destination once it finish.

Another question I have would be your thoughts or suggestions on web crawling.

wget already does the job for you on web crawling... automatically.
web crawling is just "emulating" your normal web browsing. If you look inside your web browser, it is "internally" doing things for you. ie, checking the headers, see where you come from, giving you a cookie from the server etc.. you have to study a bit more on HTTP protocol for this.

[/quote]What other tools might you recommend for this kind of task.[/quote]
another tool is curl. Of course, if you are hardcore and want to do your own using programming language, Python comes with modules, like those i mention urllib2, urllib, cookie, httplib etc.... Those module let you program to get web page from sites. Same with Perl.

isacmahad · #35 Post by **isacmahad** » 05 Nov 2009 22:43

In perl is it possible to program around the things I have mentioned(Checking the freshness of a page, finding addresses, and specified content in which to grab.)?

ghostmachine4 · #36 Post by **ghostmachine4** » 05 Nov 2009 22:48

isacmahad wrote:In perl is it possible to program around the things I have mentioned(Checking the freshness of a page, finding addresses, and specified content in which to grab.)?

sorry don't do Perl wrt HTTP. If you want, post your question to comp.lang.perl.

isacmahad · #37 Post by **isacmahad** » 05 Nov 2009 22:52

A bit off topic for one moment. Just out of general paying attention and what not. You seem to spend a large amount of time behind your machine. Might I ask how many years you have been playing around with these boxes? I am observant just not very tech savy. Programs to generate hits 3000+ is rather large for a stoned aged topic like dos.
However I do wish to learn. I enjoy learning about computers, because they seem to be and already are leading to the future of what is.
As quoted somewhere in something. It is not in the silo but in the middle of the dessert with a satellite connection.
I tip my hat to you sir.

ghostmachine4 · #38 Post by **ghostmachine4** » 05 Nov 2009 23:05

isacmahad wrote:Might I ask how many years you have been playing around with these boxes?

15++ years. I am retired.

Programs to generate hits 3000+ is rather large for a stoned aged topic like dos.

don't understand.