Harvestman Control Section
The controls, top row left to right
- Domain:(textbox) Enter the domain name to be crawled, with no "/"
- GoTo (button) Click this button to initialise the Spider and load the domain. Hitting the enter key in the Domain textbox will click this button.
- Crawl 1 (button) Click to crawl a single page from the Que. Will always crawl if there is a page que'd.
- Crawl all (button) Click to crawl all pages que'd, adding pages to the que as links are found, can run for a very long time. Will exit before the end of que on certain tests.
- ReCrawl (button) Clicking is the same as clicking GoTo and then Crawl all.. a click-saver when reading and re-reading to see if fixes work.
- Help (button) show the help page (brief reminders not as extensive as these pages), click OK to return
Most-recent Page Data, textboxes below Domain
- Reading Page: (textbox) the URL of the last page crawled
- Titled: (textbox) Content of any Title-tag found in the page header
- (un-labeled large textbox) The html of the page as reported by Spider (links are stripped out)
From clicking GoTo till the first page is crawled, this box shows the content of spiders.txt (or "page not found").
- Keywords: (textbox) Content of any meta-keyword-tag found in the page header
Remaining controls, down on right
- No Querys (checkbox) If checked, query strings are not included in links, for one site with heavy .asp use, 12 pages with no query, 200 with
- Avoid (2 textboxes) enter pattern to avoid in URL, "/Sale/" will do nothing!! "*/Sale/*" (with stars) will not crawl the Sales folder. The small box is the count of loaded patterns (*.zip and *.ac* are preloads)
- From cache (checkbox) If this box is checked, the Spider will try to crawl the local cache. Faster and less load on your server, but don't expect to see updates to the page unless you un-check and read live.
- Cache: (textbox) The Spider's cache folder, Do Not Use the same folder as any browser, provide a private-for-Spider folder. If a valid path is provided, the Spider will cache any page fetched from the web
- Limit [100] (checkbox) if checked, the Crawl all button will stop at [100] pages crawled (default is 100, cutoff is editable) (Crawl 1 will still crawl one at a click). A safety when first checking a client's site that might auto-spawn a thousand pages of spam.
- MUST MATCH: (textbox) any text here will require an exact match (stars mean 'anything') in the URL to be Qued. A powerful filter!
- Report writer group
Last edited =