Extracting Links using Xpath

On November 21, 2007, in ColdFusion, by Anuj Gakhar

Extracting links from a piece of HTML code is a very common task and any programmer would have come across this requirement at some point. I have always used regular expressions to achieve this and it has always worked for me, no complaints there. However, I was just curious to find some other way to do it.

Here is what I did.

used CFHTTP to get the HTML code.

[xml]
<cfhttp url="https://www.anujgakhar.com" method="get" />
<cfset myVar = cfhttp.FileContent>[/xml]

Put it all in a CF XML Object

[xml]
<cfxml variable="myXml">
<Cfoutput>#ToString(myVar)#</Cfoutput>
</cfxml>[/xml]

Got all links using Xpath

[xml]
<cfset allLinks = XmlSearch(myXml, "//*[local-name()=’a’]")>[/xml]

Put everything inside a CF query.

[xml]
<cfset mylinks = QueryNew(‘title,link’)>
<cfloop from = "1" to="#ArrayLen(allLinks)#" index="this">
<cfset queryAddRow(myLinks,1)>
<cfset querySetCell(myLinks, "title", allLinks[this].XmlText) />
<cfset querySetCell(myLinks, "link", allLinks[this].XmlAttributes.href) />
</cfloop>
<cfdump var = "#mylinks#">[/xml]

And it works! I was delighted to see the results. However, the only condition is that the HTML should be valid HTML or XHTML I must say. Well, nothing special I know but atleast I found out which people dont have valid HTML on their sites! ha!

Tagged with: Xpath • XML

17 Responses to Extracting Links using Xpath

Robert Gatti says:

November 24, 2007 at 7:14 am

Thanks for the great idea! I never thought about using XPath in parsing HTML documents, I’ve always used a simple regular expression. Unfortunately, I think that the majority of websites out there are not HTML nor XHTML valid. With the number of Internet Explorer bugs and the crazy solutions to them, also the number of people that just don’t care about anything other than IE; I just don’t see the vast majority of websites validating correctly. Again, thanks for the great idea.

Reply
Anuj Gakhar says:

November 26, 2007 at 10:04 am

Thats right Robert. Majority of the websites wont have valid XHTML.
However, this was just something I was playing around with in my very rare free time 🙂
But glad to know you liked the idea. Cheers!

Reply
Dan G. Switzer, II says:

November 26, 2007 at 9:01 pm

You can always run the output through jTidy to make the output XHTML compliant. For the sole purpose of finding links, this should work pretty well.

Reply
Anuj Gakhar says:

November 27, 2007 at 1:22 pm

havent used jTidy before, does it give your a tidied up output on the fly ?

Reply
Dan G. Switzer, II says:

November 27, 2007 at 2:07 pm

@Anuj:

Per the jTidy site:
http://jtidy.sourceforge.net/

“JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.”

In a nutshell, you can use jTidy to clean up HTML generated by Word or any Rich Text HTML editor. Obviously if the HTML is really malformed, you can run into issues with the layout not being exactly what the user might expect (since malformed HTML can produce all sorts of interesting layout issues.)

However, for the purpose of extracting links contained in anchor tags, it should do a really good job of that. jTidy will ensure the HTML is XHTML compliant–making it possible to parse with XPATH.

Reply
Anuj Gakhar says:

November 27, 2007 at 2:52 pm

Thanks Dan. I like it. This basically means we cna set it up as a CFX tag and then parse any HTML document as DOM. I know PHP has some inbuilt functions for DOM parsing but with jTidy we can do it in CF as well. Thanks again!

Reply
Dan G. Switzer, II says:

November 27, 2007 at 2:59 pm

@Anuj:

The DOM parsing in jTidy isn’t very good. If you actually want to parse it into an actual DOM tree item, you’ll probably want to use another DOM parser (although you’ll still want to use jTidy to clean up the output first.)

Also, jTidy is pretty easy to interface to from CF directly. Jeff Coughlin posted a solution of using jTidy to automatically build a TOC from heading tags within some CMS generated markup:

http://www.jeffcoughlin.com/?pg=9&fn=3&id=1

Reply
Anuj Gakhar says:

November 27, 2007 at 3:14 pm

oh thats very nice! I like the example he demonstrated…..and thansk to you for saving my time as i was going to start testing out jTidy if you didnt send me this example!

Do you know of any other DOM parsers that can be used ?

Reply
Dan G. Switzer, II says:

November 27, 2007 at 3:41 pm

@Anuj:

Check out the jTidy Sourceforge forums. There’s a few discussions on the list of various DOM parsers that people have used.

Also, I should clarify and say that I have successfully used the jTidy DOM parser. It’s just not particular robust and can at times be frustrating to work with. I’m hacking this code together from CFC I wrote to give off an example, so I’m not sure it works as-is:

// an HTML string to parse
sHtml = “hello worldAll your base are belong to us!”;

// init jTidy
jTidy = createObject(“java”, “org.w3c.tidy.Tidy”);
jTidyConfigObj = createObject(“java”, “org.w3c.tidy.Configuration”);

// set configuration items
jTidy.setCharEncoding(jTidyConfigObj.utf8);
jTidy.setMakeClean(true);
jTidy.setDropFontTags(true);
jTidy.setXHTML(true);
jTidy.setRawOut(true);
jTidy.setSmartIndent(true);
jTidy.setWord2000(true);
jTidy.setDropEmptyParas(true);
jTidy.setShowWarnings(false);
jTidy.setFixComments(true);
jTidy.setQuiet(true);

// read the HTML string as a Java string
oReadBuffer = CreateObject(“java”,”java.lang.String”).init(sHtml).getBytes();
// convert string to ByteArrayInputStream–which is needed by jTidy
oHtmlBAIS = createobject(“java”,”java.io.ByteArrayInputStream”).init(oReadBuffer);

// do the parsing (take an input stream and make it print pretty)
oTidyDOM = jTidy.parseDOM(oHtmlBAIS, javaCast(“null”, “”));

// close the BAIS stream
oHtmlBAIS.close();

// get all the paragraph tags
oParagraphTags = oTidyDOM.getElementsByTagName(javaCast(“string”, “p”));

The “oParagraphTags” would contain all the p tags in the document.

I’ve actually used jTidy to create a CFC that converts HTML to a Plain Text document. I needed the functionality for e-mailing content from a knowledge base system for non-HTML based clients.

Reply
Dan G. Switzer, II says:

November 27, 2007 at 3:43 pm

Crud, the sHtml string didn’t show up properly. It should be:

sHtml = “[html][body][p]hello world[/p][p]All your base are belong to us![/p][/body][/html]”;

Just replace the brackets ([ and ]) with greater than and less than characters.

Reply
Anuj Gakhar says:

November 27, 2007 at 3:49 pm

yeah I saw that in the downloadable code in that example you sent over earlier. Cool stuff! I’ve got something to play with for next couple of days 🙂
cheers mate!

Reply
Anuj Gakhar says:

November 27, 2007 at 4:31 pm

in my example above, I could also do this :-

<cfset allLinks= XmlSearch(myXml,”//*[starts-with(name(),
‘a’) and string-length(name()) = 1]”) />

instead of

<cfset allLinks = XmlSearch(myXml, “//*[local-name()=’a’]”)>>

Cool!

Reply
ana gomez says:

May 27, 2008 at 2:54 am

That is very useful code. Thanks

Reply
house dj says:

September 10, 2008 at 1:55 pm

Hi, it doesn’t work for me. Is this an indicator for an invalid html-code? please say NO! My Webmaster says to me, that my page is valid now! Thanks for your Help.

Reply
Anuj Gakhar says:

September 10, 2008 at 3:47 pm

@house dj, that probably means invalid html. whats the URL you are trying to parse ?

Reply
Paradores says:

September 15, 2008 at 1:58 pm

Really handy code – thanks alot

Reply
Anuj Gakhar says:

September 15, 2008 at 2:43 pm

Thanks Paradores

Reply

Anuj Gakhar