How to Copy a website content using wget command in Linux


Similar to wget-ftp-index.html

Need to get a directory listing for a web site which is presented like an ftp site? Here’s my first working run for http:

I did the following:

# Do the following command to create an empty index.html file, this just starts us at index.html.1 rather than index.html
mkdir /tmp/theserver
cd /tmp/theserver
>index.html

# Grab the first page, it will be saved as index.html.1
wget http://theserver

# Now the script I used to start at page #1 and go through until no more pages:
COUNT=1; while :; do FILES=`cat index.html.$COUNT | grep “a href” | sed -e ‘s,^.*=”,,’ -e ‘s,”.*$,,’ | grep /$ | grep -v^/$`; if [ -n “$FILES” ]; then echo “$FILES” | while read; do PAGE=`cat index.html.$COUNT | grep “<b>Index of” | awk ‘{print $3}’ | sed -e ‘s,<.*$,,’ -e ‘s,^/*,/,’ -e ‘s,/*$,/,’`; URL=”http://theserver$PAGE$REPLY”; echo GETTING: $URL; wget $URL; done; fi; COUNT=`expr $COUNT + 1`; if [ ! -e index.html.$COUNT ]; then break; fi; done

# Here I’ve split it for easy reading:

COUNT=1
while :; do
     FILES=`cat index.html.$COUNT | grep "a href" | sed -e 's,^.*=",,' -e 's,".*$,,' | grep /$ | grep -v^/$`
     if [ -n "$FILES" ]; then
          echo "$FILES" | while read; do
               PAGE=`cat index.html.$COUNT | grep "<b>Index of" | awk '{print $3}' | sed -e 's,<.*$,,' -e 's,^/*,/,' -e 's,/*$,/,'`
               URL="http://theserver$PAGE$REPLY"
               echo GETTING: $URL
               wget $URL
          done
     fi
     COUNT=`expr $COUNT + 1`
     if [ ! -e index.html.$COUNT ]; then
          break
     fi
done

These will be saved in 1 directory as index.html.x where x will increment until no more pages.

Here is a sample of the pages it was reading:




Index of /software/Docs



Index of /software/Docs

Then this is searchable any way you want. To begin here’s a sample search:

cd /tmp/theserver
grep linux *

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.