Harvestman control of Out-of-domain links listed

cropped screenshot of the 'Out-of-Domain' control group

Count of current entries in the Spider's list of out-bound links (plus References links)

"Add to an Ignore list" textbox

"Reference" button: Copies a partial URL from the textbox into the references list, adds an asterisk before and after.; Adds the resulting partial URL to the Spider's internal ignore list.
"Ignore" button: Copies a partial URL from the textbox into the ignore list, adds an asterisk before and after.; Adds the resulting partial URL to the Spider's internal ignore list.
"Clear" button: Clears the reference list; Clears the ignore list; Clears the Spider's internal ignore list.

Copy of Spider's list of out-bound links

The References listbox

URLs in the References listbox are ignored by the Spider but included in Harvestman's report.

The Ignore listbox

URLs in the Ignore list are totaly ignored

"Break on match" textbox

Locate a bad link

So, you crawl your site, and you have sixty pages, and one of those pages is trying to call can2can.biz instead of can2can.biz/ (the domain index page is detected as out of domain!); out of sixty pages, how to find it?? Type '*can2can.biz' into the Break on Match textbox, and re-crawl your site. Harvestman 'Crawl all' will stop crawling as soon as the bad URL is added to the list, with the URL and title up there in the control area.

Last edited =