Similar to wget-ftp-index.html
Need to get a directory listing for a web site which is presented like an ftp site? Here’s my first working run for http:
I did the following:
# Do the following command to create an empty index.html file, this just starts us at index.html.1 rather than index.html
mkdir /tmp/theserver
cd /tmp/theserver
>index.html
# Grab the first page, it will be saved as index.html.1
wget http://theserver
# Now the script I used to start at page #1 and go through until no more pages:
COUNT=1; while :; do FILES=`cat index.html.$COUNT | grep “a href” | sed -e ‘s,^.*=”,,’ -e ‘s,”.*$,,’ | grep /$ | grep -v^/$`; if [ -n “$FILES” ]; then echo “$FILES” | while read; do PAGE=`cat index.html.$COUNT | grep “<b>Index of” | awk ‘{print $3}’ | sed -e ‘s,<.*$,,’ -e ‘s,^/*,/,’ -e ‘s,/*$,/,’`; URL=”http://theserver$PAGE$REPLY”; echo GETTING: $URL; wget $URL; done; fi; COUNT=`expr $COUNT + 1`; if [ ! -e index.html.$COUNT ]; then break; fi; done
# Here I’ve split it for easy reading:
COUNT=1 while :; do FILES=`cat index.html.$COUNT | grep "a href" | sed -e 's,^.*=",,' -e 's,".*$,,' | grep /$ | grep -v^/$` if [ -n "$FILES" ]; then echo "$FILES" | while read; do PAGE=`cat index.html.$COUNT | grep "<b>Index of" | awk '{print $3}' | sed -e 's,<.*$,,' -e 's,^/*,/,' -e 's,/*$,/,'` URL="http://theserver$PAGE$REPLY" echo GETTING: $URL wget $URL done fi COUNT=`expr $COUNT + 1` if [ ! -e index.html.$COUNT ]; then break fi done
These will be saved in 1 directory as index.html.x where x will increment until no more pages.
Here is a sample of the pages it was reading:
Index of /software/Docs
Index of /software/Docs |
Then this is searchable any way you want. To begin here’s a sample search:
cd /tmp/theserver
grep linux *