Optional Sitemap entries

Google Basic Sitemap XML

Despite efforts to make it seem complex and arcane, XML is both simple and powerful. XML is the markup system used for the RSS feeds behind Podcasting, and is also the markup used by Google and other search-engines to read the structure of our sites.

Begining with version 1.3 build 0, Harvestman will write out a list of all the page-urls it finds in crawling your site, in proper Google-sitemap XML. This, of course, simply provides Google (or other search engine) with the same list of pages as they would get by crawling your site themselves, and, it still requires their bots to actually crawl your pages to get their data.

The Sitemap format supports certain "optional" data tags for the pages reported, and including these optional tags can significantly improve your relationship with the crawlers. The first optional tag is the priority tag, a number from 0.001 to 1.000, indicating that page's importance to the site, in your opinion, if you don't set a value, .500 (dead average) is assumed. The second optional tag is the changefreq tag, a statement from the list "daily", "weekly", "monthly", "yearly", "always" or "never" which reflects your expectation of how often you plan to change the information on the page (use yearly for most pages, weekly or daily for news, and always for generated stuff). The third optional tag is the lastmod tag, which contains the date the page was last really edited; this makes the lastmod tag our power tool for getting Google to notice our edits!!!

Best Practices (to make Googlebot love your site)

(All the best crawler-bots, not just Googlebot)

Use all the optional tags for every page
Always use the slowest changefreq you think reasonable

If you say "weekly" and only update it twice a year, the bots learn to ignore you

if you say "never", you can still use lastmod to tell the Bots that "never" was today
Name your sitemap "Sitemap.xml" and place it in the root folder
Include a line with "Sitemap: http://(put your site here)/Sitemap.xml" in robots.txt, in the root
Place a copy of gss.xsl in the root folder (there is a copy in the zip).
Sign up for Google Webmaster tools and submit your sitemap.
When pages change, make sure to update the date-meta
After significant changes on one or more pages, use Harvestman to make a new Sitemap
Knowing that ALL the bots are reading the robots.txt file, and finding your sitemap, depend on the lastmod tag to notify the bots of page changes.
Re-submit your sitemap only after changes with an urgency that cannot wait for the next regular crawl.

Harvestman's new meta-tags

The simplest way for you to create all those XML tags for the sitemap is to have Harvestman write the tags for you, using information it scans from your pages.

Harvestman responds to four tags in the head section of your page-

<meta name="robots" content="noindex"/>

will place "<robots>noindex</robots>" in the SiteXml list

When the sitemap writer reads the list, the page url will not be included

Any other content in the robots-meta is ignored
<meta name="SMPriority" content="0.500"/>

will place "<priority>0.500</priority>" in the SiteXml list
<meta name="SMFreq" content="yearly"/>

will place "<changefreq>yearly</changefreq>" in the SiteXml list
<meta name="date" content="2009-10-21"/>

will place "<lastmod>2009-10-21</lastmod>" in the SiteXml list

When the Write Sitemap button is clicked, the Sitemap writer module will iterate through the lists of page-urls and XML strings, constructing a properly formated XML file with the optional tags included for those of your pages where values were found (and not including pages you no-listed)

As of 3.1 build 11, Harvestman is case sensitive, "date" works, "Date" does not..

Why nolist

Consider the case where you have a script generated page (the pagefile is a script which presents variable HTML to the visitor) which provides an enlarged picture of one item from your inventory; this page might be linked to from many pages, and yet it could not present meaningful content without a correct querystring.. No-list the page, it will not bloat your sitemap, and the same <meta name="robots" content="noindex"/> will instruct any crawler-bots that find it to also not include it in the site's data. If you have Data-driven, script generated pages, each providing the details of some item, you would want to include all of the urls generated by crawling your site with queries enabled (you can't know when someone will do a Google search for "Size small, extra long sleeves").

Last edited =