Extracting links from a piece of HTML code is a very common task and any programmer would have come across this requirement at some point. I have always used regular expressions to achieve this and it has always worked for me, no complaints there. However, I was just curious to find some other way to do it.

Here is what I did.

used CFHTTP to get the HTML code.

<cfhttp url="http://www.anujgakhar.com" method="get" />
<cfset myVar = cfhttp.FileContent>

Put it all in a CF XML Object

<cfxml variable="myXml">
<Cfoutput>#ToString(myVar)#</Cfoutput>
</cfxml>

Got all links using Xpath

<cfset allLinks = XmlSearch(myXml, "//*[local-name()='a']")>

Put everything inside a CF query.

<cfset mylinks = QueryNew('title,link')>
<cfloop from = "1" to="#ArrayLen(allLinks)#" index="this">
<cfset queryAddRow(myLinks,1)>
<cfset querySetCell(myLinks, "title", allLinks[this].XmlText) />
<cfset querySetCell(myLinks, "link", allLinks[this].XmlAttributes.href) />
</cfloop>
<cfdump var = "#mylinks#">

And it works! I was delighted to see the results. However, the only condition is that the HTML should be valid HTML or XHTML I must say. Well, nothing special I know but atleast I found out which people dont have valid HTML on their sites! ha!