Programming

Creating A Google Sitemap For An ASP.NET Website

Monday, September 11th, 2006

Any SEO worth his (or her) salt, knows that Google Sitemaps are a great way to tell Google about the pages that are available on your site. Using a sitemap, you can basically pass Google an inventory of your content and let wee Googlebot crawl your pages without having to rely on your shaky navigational system!

What Is A Google Sitemap?

A Sitemap is a very simple XML file that you use to list the URL of each page in your site. We’ll get to the syntax and structure of the Sitemap XML in a minute, but first let’s look at the data you can record for each URL:

  • URL: You’ll need the URL for each page you list. This is represented by the loc element.
  • Last modified: Not required, but if your ASP.NET application records the date each page was updated, you could include it here. I’m assuming that Google will check this against the last crawl date and crawl the page if it’s been updated recently.
  • Change Frequency: Not required, but allows you to specify how often the page is likely to change. For example, content on your homepage is more likely to be updated than an archived article from 2005. Google say that this tag is considered a hint and not a command and that pages may be crawled more or less frequently than spidered.
  • Priority: You can use this tag to assign a weighting to each page, indicating it’s importance on your site. The valid values are from 0.0 to 1.0. Could be useful if you could find a way to assign priority to your pages.

Sample Sitemap

Here’s a sample sitemap. Notice that it starts with the XML declaration at the top. The first (root) element is urlset and within it, each url is defined by an <url> element. Note that for the moment, I’m just using the URL (loc) and the priority value

<?xml version="1.0" encoding="utf-8" ?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
	<url>
		<loc>http://www.yoursite.com/Pages.aspx?pageid=21</loc>
		<priority>0.5</priority>
	</url>
	<url>
		<loc>http://www.yoursite.com/Pages.aspx?pageid=22</loc>
		<priority>0.5</priority>
	</url>
</urlset>

Creating Our Sitemap File

For this example, I’m going to assume you’ve got a fairly basic content management system, running on ASP.NET. Let’s say each page is stored as an entry in a database table called pages. You have fields for page_id, date_created and date_updated among others. Our sitemap is going to focus on displaying each page’s URI and last updated information.

In Visual Web Developer Express, create a new .aspx page in your website (I called mine Sitemap.aspx for the sake of originality). When the page appears in the editor, strip out all the HTML, except the ASP.NET declaration on the very first line. The page should look something like this:

<%@ Page Language=”VB” AutoEventWireup=”false” CodeFile=”Sitemap.aspx.vb” Inherits=”Sitemap” %>

Now, press F7 to open up the codebehind file, and let’s get to work.

Firstly, add the namespaces you’ll need to access the database, retrieve the data and write an XML file: At the very least, you’ll need System.Data and System.Xml.

Second, create a Page_Load event. We want the sitemap to be generated dynamically each time the page is accessed. Within the Page_Load, we begin by writing the headers of the XML document:

Response.Clear()
Response.ContentType = "application/rss+xml"
Dim objX As New XmlTextWriter(Response.OutputStream, Encoding.UTF8)
objX.WriteStartDocument()

Next, we write the root element, urlset and set the xmlns attribute to import the Google Sitemap schema:

objX.WriteStartElement("urlset")
objX.WriteAttributeString("xmlns", "http://www.google.com/schemas/sitemap/0.84")

At this point, we’ll access the pages table in the database. The script below creates a SqlDataRead which iterates through the available pages and lists them by the date they were updated. As it iterates through the results, it creates a start element (url) using WriteStartElement, within which two other elements are nested using WriteElementString. The url element is finally closed using WriteEndString

Dim oCmd As New SqlClient.SqlCommand("SELECT * FROM pages ORDER BY date_updated DESC", sqlConn)
Dim oRdr As SqlClient.SqlDataReader = oCmd.ExecuteReader
If oRdr.HasRows Then
    While oRdr.Read
        objX.WriteStartElement("url")
        objX.WriteElementString("loc", "http://www.yoursite.com/Pages.aspx?pageid=" & oRdr("page_id"))
        objX.WriteElementString("priority", "0.7")
        objX.WriteEndElement() 'URL
    End While
End If

Finally we close our database objects and publish the XML document using objX.Flush:

oRdr.Close()
oCmd.Dispose()
objX.WriteEndElement() 'URLset
objX.WriteEndDocument()
objX.Flush()
objX.Close()
Response.End()

Now you’re done, you should be able to call up the Sitemap.aspx file in your web browser. Type in the URL (I found Internet Explorer best for this - FireFox tried to download the file for some reason), and the XML document should appear.

Now, to finish the job off, create an account for Google Sitemaps, and follow the instructions there to set up a profile for your site and add your sitemap. After a few hours, Google should start to tell you if it parsed the sitemap correctly.

Strip HTML Tags From A String Using Regular Expressions

Thursday, August 31st, 2006

It’s possible to use regular expressions to remove HTML tags from a string in order to ’sanitize’ the string, so that it can be used in a tag for instance. For this example, I’m using VB.NET.</p> <p>Let’s start with a text string that we’ve retrieved from our database. It’s HTML-formatted, so it contains paragraphs, lists, <code><strong></code> and <code><em></code> elements. We want to remove all the HTML from the string so that we can truncate the text and use a portion to create a dynamic META description for our web-based application.</p> <p>In our codebehind file, we first need to add the regular expression functionality, so at the top of the file add:</p> <p><code>Imports System.Text.RegularExpressions</code></p> <p>Now, create a new function called vbTagStripper:</p> <pre><code>Function vbTagStripper(ByVal inputString As String) As String inputString = Regex.Replace(inputString, "<(.|\n)*?>“, String.Empty) Return inputString End Function</code></pre> <p>When we use the function, we’ll pass in the string we want to sanitize. The Regex.Replace will search the string for instances of the pattern <code><(.|\n)*?></code> and will replace any matches with an empty string, effectively erasing them from the string.</p> <h2>Exploring The Regex Pattern</h2> <p>What does <code><(.|\n)*?></code> mean? Well, the opening and closing <code><></code> represent the start and end of your HTML tags. Within the brackets are the items to search for: the period represents any character, and the <code>\n</code> represents a newline character. The <code>*?</code> immediately after the brackets is an instruction to search for one or more repeats of the items inside the brackets. This should effectively remove every HTML tag in the string.</p> <h2>Calling The vbTagStripper Function</h2> <p>In order to use the function, start off with a string. My string is coming from the database via a SqlDataReader object, so you might display the string in a label by doing the following:</p> <p><code>lblArticle.Text = vbTagStripper(oReader("Article_Text"))</code></p> </div> <p class="postmetadata">Posted in <a href="http://www.interwebworld.co.uk/category/programming/aspnet/" title="View all posts in ASP.NET" rel="category tag">ASP.NET</a>, <a href="http://www.interwebworld.co.uk/category/programming/" title="View all posts in Programming" rel="category tag">Programming</a> | <a href="http://www.interwebworld.co.uk/46/strip-html-tags-from-a-string-using-regular-expressions/#comments" title="Comment on Strip HTML Tags From A String Using Regular Expressions">2 Comments »</a></p> </div> <div class="post"> <h3 id="post-45"><a href="http://www.interwebworld.co.uk/45/how-to-make-a-copy-of-a-sql-server-database-table/" rel="bookmark" title="Permanent Link to How To: Make A Copy Of A SQL Server Database Table">How To: Make A Copy Of A SQL Server Database Table</a></h3> <small>Monday, August 21st, 2006</small> <div class="entry"> <p>I’ve been working on a couple of projects recently where I’ve had to retain legacy databases and integrate them into new websites.</p> <p>In order to do this without damaging the original tables, I find it useful to make a copy of the original database table and use <em>that</em> for the development work. Since I do all of my bespoke CMS development on hosted Microsoft SQL Server databases, I had to hunt down a quick method to copy an existing database table into a new one.</p> <p>Let’s say we have an old table (<code>oldnews</code>) containing news items for a website. The structure is generally sound, but we don’t want to risk any damage to the original table. The following syntax copies the data from <code>oldnews</code> to the new table <code>newnews</code>:</p> <pre><code>INSERT INTO newnews SELECT * FROM oldnews</code></pre> <p>Apparently, this will also copy data into an existing table, so if <code>newnews</code> already exists the data will be placed there. If the table doesn’t already exist, it will be created. At least, that’s how it worked for me!</p> <p>You can even copy table data from a different (local?) database by specifying the full database path:</p> <pre><code>INSERT INTO newnews SELECT * FROM olddb.dbo.oldnews</code></pre> <p>If you’ve got any more insight into table copying techniques, drop your wisdom in the comments!</p> </div> <p class="postmetadata">Posted in <a href="http://www.interwebworld.co.uk/category/programming/databases/" title="View all posts in Databases" rel="category tag">Databases</a>, <a href="http://www.interwebworld.co.uk/category/programming/" title="View all posts in Programming" rel="category tag">Programming</a> | <a href="http://www.interwebworld.co.uk/45/how-to-make-a-copy-of-a-sql-server-database-table/#comments" title="Comment on How To: Make A Copy Of A SQL Server Database Table">1 Comment »</a></p> </div> <div class="navigation"> <div class="alignleft"></div> <div class="alignright"></div> </div> </div> </div><!--Wrapper--> <div id="header"> <div id="headerimg"> <img src="http://www.interwebworld.co.uk/wp-content/themes/interweb/images/logo-sidebar.jpg" alt="Interweb World" /> <h1><a href="http://www.interwebworld.co.uk/">Interweb World</a></h1> <div class="description">how to guides, tech tips and free advice from an over-zealous internet geek</div> <form method="get" id="searchform" action="http://www.interwebworld.co.uk/"> <div><input type="text" value="" name="s" id="s" /> <input type="submit" id="searchsubmit" value="Search" /> </div> </form> <p>You are currently browsing the archives for the Programming category.</p> </div> <div id="sidebar" class="sidebit"> <ul> <li><h2>Categories</h2> <ul> <li class="cat-item cat-item-18"><a href="http://www.interwebworld.co.uk/category/audio/" title="View all posts filed under Audio">Audio</a> </li> <li class="cat-item cat-item-10"><a href="http://www.interwebworld.co.uk/category/blogging-webmastery/" title="Everything about blogging and website management: blog platforms, hosting, blog search engines, site stats">Blogging & Webmastery</a> <ul class='children'> <li class="cat-item cat-item-23"><a href="http://www.interwebworld.co.uk/category/blogging-webmastery/accessibility/" title="View all posts filed under Accessibility">Accessibility</a> </li> <li class="cat-item cat-item-25"><a href="http://www.interwebworld.co.uk/category/blogging-webmastery/making-money-online/" title="Posts about how to go about making money online and strategies to make a good return on your content.">Making Money Online</a> </li> <li class="cat-item cat-item-14"><a href="http://www.interwebworld.co.uk/category/blogging-webmastery/rss-syndication/" title="My reviews and discoveries of syndication technology, RSS readers and other stuff related to being an XML junkie.">RSS & Syndication</a> </li> <li class="cat-item cat-item-19"><a href="http://www.interwebworld.co.uk/category/blogging-webmastery/seo-and-site-marketing/" title="How to master the finer details of getting your site to rank higher in the search engines and attract more traffic to your blog or website.">SEO and Site Marketing</a> </li> <li class="cat-item cat-item-24"><a href="http://www.interwebworld.co.uk/category/blogging-webmastery/web-development/" title="Notes on issues surrounding web development, including programming, XHTML, validation and much more.">Web Development</a> </li> <li class="cat-item cat-item-7"><a href="http://www.interwebworld.co.uk/category/blogging-webmastery/web-hosting/" title="Yep, you guessed it, stuff about hosting your website.">Web Hosting</a> </li> <li class="cat-item cat-item-4"><a href="http://www.interwebworld.co.uk/category/blogging-webmastery/wordpress/" title="Wordpress: The #1 open source blogging platform. I love it and so should you!">WordPress</a> </li> </ul> </li> <li class="cat-item cat-item-1"><a href="http://www.interwebworld.co.uk/category/interweb-world/" title="View all posts filed under Interweb World">Interweb World</a> </li> <li class="cat-item cat-item-2"><a href="http://www.interwebworld.co.uk/category/on-the-web/" title="View all posts filed under On The Web">On The Web</a> </li> <li class="cat-item cat-item-11"><a href="http://www.interwebworld.co.uk/category/operating-systems/" title="Everything I know (and am still discovering) about Microsoft Windows and Linux operating systems!">Operating Systems</a> <ul class='children'> <li class="cat-item cat-item-5"><a href="http://www.interwebworld.co.uk/category/operating-systems/linux/" title="All about Linux: I'm a distro slut, flirting with different distributions all the time, looking to make the break from Microsoft software.">Linux</a> </li> <li class="cat-item cat-item-12"><a href="http://www.interwebworld.co.uk/category/operating-systems/microsoft-windows/" title="Information about the Windows operating system">Microsoft Windows</a> </li> </ul> </li> <li class="cat-item cat-item-20 current-cat"><a href="http://www.interwebworld.co.uk/category/programming/" title="A few of my programming notes from various jobs I've done.">Programming</a> <ul class='children'> <li class="cat-item cat-item-22"><a href="http://www.interwebworld.co.uk/category/programming/aspnet/" title="View all posts filed under ASP.NET">ASP.NET</a> </li> <li class="cat-item cat-item-21"><a href="http://www.interwebworld.co.uk/category/programming/databases/" title="View all posts filed under Databases">Databases</a> </li> </ul> </li> <li class="cat-item cat-item-13"><a href="http://www.interwebworld.co.uk/category/security/" title="View all posts filed under Security">Security</a> </li> <li class="cat-item cat-item-16"><a href="http://www.interwebworld.co.uk/category/software/" title="A dump for reviews, tips, tricks and hacks for all types of software packages.">Software</a> <ul class='children'> <li class="cat-item cat-item-17"><a href="http://www.interwebworld.co.uk/category/software/utilities/" title="Recommended helper applications for use in Windows and Linux">Utilities</a> </li> <li class="cat-item cat-item-3"><a href="http://www.interwebworld.co.uk/category/software/web-applications/" title="The new breed of web apps: Gmail, Bloglines, Rojo, Backpack - reviews, hints and tips">Web Applications</a> </li> <li class="cat-item cat-item-15"><a href="http://www.interwebworld.co.uk/category/software/web-browsers/" title="Internet Explorer, Mozilla FireFox, Opera, Flock - If it can show you web pages, it might be here!">Web Browsers</a> </li> </ul> </li> </ul> </li> <li class="pagenav"><h2>Pages</h2><ul><li class="page_item page-item-2"><a href="http://www.interwebworld.co.uk/about/" title="About & Contact">About & Contact</a></li> </ul></li> <li><h2>Meta</h2> <ul> <li><a href="http://www.interwebworld.co.uk/wp-login.php">Log in</a></li> <li><a href="http://validator.w3.org/check/referer" title="This page validates as XHTML 1.0 Transitional">Valid <abbr title="eXtensible HyperText Markup Language">XHTML</abbr></a></li> <li><a href="http://gmpg.org/xfn/"><abbr title="XHTML Friends Network">XFN</abbr></a></li> <li><a href="http://wordpress.org/" title="Powered by WordPress, state-of-the-art semantic personal publishing platform.">WordPress</a></li> </ul> </li> <li id="linkcat-28" class="linkcat"><h2>Tech Surfing</h2> <ul> <li><a href="http://www.ajaxian.com">Ajaxian</a></li> <li><a href="http://www.scribbledesigns.co.uk" title="Web Designers Northern Ireland">Scribble Designs</a></li> <li><a href="http://www.techcrunch.com/">TechCrunch</a></li> </ul> </li> </ul> </div> <div id="sidebar2" class="sidebit"> <ul> <li> <script type="text/javascript"><!-- google_ad_client = "pub-3914434444747696"; google_ad_width = 120; google_ad_height = 600; google_ad_format = "120x600_as"; google_ad_type = "text_image"; google_ad_channel ="9173501773"; google_color_border = "DBEDD7"; google_color_bg = "DBEDD7"; google_color_link = "E64500"; google_color_text = "003366"; google_color_url = "003366"; //--></script> <script type="text/javascript" src="http://pagead2.googlesyndication.com/pagead/show_ads.js"> </script> </li> </ul> </div> <div id="footer"> <p> Interweb World is proudly powered by <a href="http://wordpress.org/">WordPress</a> <br /><a href="http://feeds.feedburner.com/InterwebWorld">Entries (RSS)</a> and <a href="http://www.interwebworld.co.uk/comments/feed/">Comments (RSS)</a>. <!-- 20 queries. 0.300 seconds. --> </p> <p><a href="http://www.scribbledesigns.co.uk">Scribble Designs: Web Design Northern Ireland</a></p> </div> </div> <br clear="left" /> </div> <!-- Gorgeous design by Michael Heilemann - http://binarybonsai.com/kubrick/ --> <script src="http://www.google-analytics.com/urchin.js" type="text/javascript"> </script> <script type="text/javascript"> _uacct = "UA-54080-8"; urchinTracker(); </script> </body> </html>