.
.
.
Despite efforts to make it seem complex and arcane, XML is both simple and powerful. XML is the markup system used for the RSS feeds behind Podcasting, and is also the markup used by Google and other search-engines to read the structure of our sites.
Begining with version 1.3 build 0, Harvestman will write out a list of all the page-urls it finds in crawling your site, in proper Google-sitemap XML. This, of course, simply provides Google (or other search engine) with the same list of pages as they would get by crawling your site themselves, and, it still requires their bots to actually crawl your pages to get their data.
The Sitemap format supports certain "optional" data tags for the pages reported, and including these optional tags can significantly improve your relationship with the crawlers. The first optional tag is the priority tag, a number from 0.001 to 1.000, indicating that page's importance to the site, in your opinion, if you don't set a value, .500 (dead average) is assumed. The second optional tag is the changefreq tag, a statement from the list "daily", "weekly", "monthly", "yearly", "always" or "never" which reflects your expectation of how often you plan to change the information on the page (use yearly for most pages, weekly or daily for news, and always for generated stuff). The third optional tag is the lastmod tag, which contains the date the page was last really edited; this makes the lastmod tag our power tool for getting Google to notice our edits!!!
The simplest way for you to create all those XML tags for the sitemap is to have Harvestman write the tags for you, using information it scans from your pages.
Harvestman responds to four tags in the head section of your page-
When the Write Sitemap button is clicked, the Sitemap writer module will iterate through the lists of page-urls and XML strings, constructing a properly formated XML file with the optional tags included for those of your pages where values were found (and not including pages you no-listed)
As of 3.1 build 11, Harvestman is case sensitive, "date" works, "Date" does not..
Consider the case where you have a script generated page (the pagefile is a script which presents variable HTML to the visitor) which provides an enlarged picture of one item from your inventory; this page might be linked to from many pages, and yet it could not present meaningful content without a correct querystring.. No-list the page, it will not bloat your sitemap, and the same <meta name="robots" content="noindex"/> will instruct any crawler-bots that find it to also not include it in the site's data. If you have Data-driven, script generated pages, each providing the details of some item, you would want to include all of the urls generated by crawling your site with queries enabled (you can't know when someone will do a Google search for "Size small, extra long sleeves").