Harvestman Control Section
The controls, top row left to right
- Domain:(textbox) Enter the domain name to be crawled, with no "/"
- GoTo (button) Click this button to initialise the Spider and load the domain. Hitting the enter key in the Domain textbox will click this button.
- Ready and Reading checkboxes: if Ready is checked, there is at least one address que'd for reading. If Reading is checked, the Spider is reading and processing, even if it seems slow
- Crawl 1 (button) Click to crawl a single page from the Que. Will always crawl if there is a page que'd.
- Crawl all (button) Click to crawl all pages que'd (Checks the Reading box), adding pages to the que as links are found, can run for a very long time. Will exit before the end of que on certain tests (will un-check Reading).
- ReCrawl (button) Clicking is the same as clicking GoTo and then Crawl all.. a click-saver when reading and re-reading to see if fixes work.
Controls down on right
- Keywords (checkbox) If checked, the page body normal text is compared to the Keywords list in the head, and errors are reported as "not found". Keywords used as the clickable text in links are NOT considered as part of the body, and so false errors are common.
- No Querys (checkbox) If checked, query strings are not included in links, for one site with heavy .asp use, 12 pages with no query, 200 with
- Avoid (2 textboxes) enter pattern to avoid in URL, "/Sale/" will do nothing!! "*/Sale/*" (with stars) will not crawl the Sales folder. The small box is the count of loaded patterns (*.zip and *.ac* are preloads)
- From cache (checkbox) If this box is checked, the Spider will try to crawl the local cache. Faster and less load on your server, but don't expect to see updates to the page unless you un-check and read live.
- Cache: (textbox) The Spider's cache folder, Do Not Use the same folder as any browser, provide a private-for-Spider folder. If a valid path is provided, the Spider will cache any page fetched from the web
- Limit [100] (checkbox) if checked, the Crawl all button will stop at [100] pages crawled (default is 100, cutoff is editable) (Crawl 1 will still crawl one at a click). A safety when first checking a client's site that might auto-spawn a thousand pages of spam.
- MUST MATCH: (textbox) any text here will require an exact match (stars mean 'anything') in the URL to be Qued. A powerful filter!
- Report writer group
Most-recent Page Data, textboxes below Domain
- Reading Page: (textbox) the URL of the last page started to be crawled
- Titled: (textbox) Content of any Title-tag found in the page header
- (un-labeled large textbox) The html of the page as reported by Spider (links are stripped out)
From clicking GoTo till the first page is crawled, this box shows the content of robots.txt (or "page not found").
- Keywords: (textbox) Content of any meta-keyword-tag found in the page header
Last edited =