Extracting Links using Xpath

On November 21, 2007, in ColdFusion, by Anuj Gakhar

Extracting links from a piece of HTML code is a very common task and any programmer would have come across this requirement at some point. I have always used regular expressions to achieve this and it has always worked for me, no complaints there. However, I was just curious to find some other way to do it.

Here is what I did.

used CFHTTP to get the HTML code.

[xml]
<cfhttp url="https://www.anujgakhar.com" method="get" />
<cfset myVar = cfhttp.FileContent>[/xml]

Put it all in a CF XML Object

[xml]
<cfxml variable="myXml">
<Cfoutput>#ToString(myVar)#</Cfoutput>
</cfxml>[/xml]

Got all links using Xpath

[xml]
<cfset allLinks = XmlSearch(myXml, "//*[local-name()=’a’]")>[/xml]

Put everything inside a CF query.

[xml]
<cfset mylinks = QueryNew(‘title,link’)>
<cfloop from = "1" to="#ArrayLen(allLinks)#" index="this">
<cfset queryAddRow(myLinks,1)>
<cfset querySetCell(myLinks, "title", allLinks[this].XmlText) />
<cfset querySetCell(myLinks, "link", allLinks[this].XmlAttributes.href) />
</cfloop>
<cfdump var = "#mylinks#">[/xml]

And it works! I was delighted to see the results. However, the only condition is that the HTML should be valid HTML or XHTML I must say. Well, nothing special I know but atleast I found out which people dont have valid HTML on their sites! ha!

Tagged with:  

17 Responses to Extracting Links using Xpath

  1. Robert Gatti says:

    Thanks for the great idea! I never thought about using XPath in parsing HTML documents, I’ve always used a simple regular expression. Unfortunately, I think that the majority of websites out there are not HTML nor XHTML valid. With the number of Internet Explorer bugs and the crazy solutions to them, also the number of people that just don’t care about anything other than IE; I just don’t see the vast majority of websites validating correctly. Again, thanks for the great idea.

  2. Anuj Gakhar says:

    Thats right Robert. Majority of the websites wont have valid XHTML.
    However, this was just something I was playing around with in my very rare free time 🙂
    But glad to know you liked the idea. Cheers!

  3. You can always run the output through jTidy to make the output XHTML compliant. For the sole purpose of finding links, this should work pretty well.

  4. Anuj Gakhar says:

    havent used jTidy before, does it give your a tidied up output on the fly ?

  5. @Anuj:

    Per the jTidy site:
    http://jtidy.sourceforge.net/

    “JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.”

    In a nutshell, you can use jTidy to clean up HTML generated by Word or any Rich Text HTML editor. Obviously if the HTML is really malformed, you can run into issues with the layout not being exactly what the user might expect (since malformed HTML can produce all sorts of interesting layout issues.)

    However, for the purpose of extracting links contained in anchor tags, it should do a really good job of that. jTidy will ensure the HTML is XHTML compliant–making it possible to parse with XPATH.

  6. Anuj Gakhar says:

    Thanks Dan. I like it. This basically means we cna set it up as a CFX tag and then parse any HTML document as DOM. I know PHP has some inbuilt functions for DOM parsing but with jTidy we can do it in CF as well. Thanks again!

  7. @Anuj:

    The DOM parsing in jTidy isn’t very good. If you actually want to parse it into an actual DOM tree item, you’ll probably want to use another DOM parser (although you’ll still want to use jTidy to clean up the output first.)

    Also, jTidy is pretty easy to interface to from CF directly. Jeff Coughlin posted a solution of using jTidy to automatically build a TOC from heading tags within some CMS generated markup:

    http://www.jeffcoughlin.com/?pg=9&fn=3&id=1

  8. Anuj Gakhar says:

    oh thats very nice! I like the example he demonstrated…..and thansk to you for saving my time as i was going to start testing out jTidy if you didnt send me this example!

    Do you know of any other DOM parsers that can be used ?

  9. @Anuj:

    Check out the jTidy Sourceforge forums. There’s a few discussions on the list of various DOM parsers that people have used.

    Also, I should clarify and say that I have successfully used the jTidy DOM parser. It’s just not particular robust and can at times be frustrating to work with. I’m hacking this code together from CFC I wrote to give off an example, so I’m not sure it works as-is:

    // an HTML string to parse
    sHtml = “hello worldAll your base are belong to us!”;

    // init jTidy
    jTidy = createObject(“java”, “org.w3c.tidy.Tidy”);
    jTidyConfigObj = createObject(“java”, “org.w3c.tidy.Configuration”);

    // set configuration items
    jTidy.setCharEncoding(jTidyConfigObj.utf8);
    jTidy.setMakeClean(true);
    jTidy.setDropFontTags(true);
    jTidy.setXHTML(true);
    jTidy.setRawOut(true);
    jTidy.setSmartIndent(true);
    jTidy.setWord2000(true);
    jTidy.setDropEmptyParas(true);
    jTidy.setShowWarnings(false);
    jTidy.setFixComments(true);
    jTidy.setQuiet(true);

    // read the HTML string as a Java string
    oReadBuffer = CreateObject(“java”,”java.lang.String”).init(sHtml).getBytes();
    // convert string to ByteArrayInputStream–which is needed by jTidy
    oHtmlBAIS = createobject(“java”,”java.io.ByteArrayInputStream”).init(oReadBuffer);

    // do the parsing (take an input stream and make it print pretty)
    oTidyDOM = jTidy.parseDOM(oHtmlBAIS, javaCast(“null”, “”));

    // close the BAIS stream
    oHtmlBAIS.close();

    // get all the paragraph tags
    oParagraphTags = oTidyDOM.getElementsByTagName(javaCast(“string”, “p”));

    The “oParagraphTags” would contain all the p tags in the document.

    I’ve actually used jTidy to create a CFC that converts HTML to a Plain Text document. I needed the functionality for e-mailing content from a knowledge base system for non-HTML based clients.

  10. Crud, the sHtml string didn’t show up properly. It should be:

    sHtml = “[html][body][p]hello world[/p][p]All your base are belong to us![/p][/body][/html]”;

    Just replace the brackets ([ and ]) with greater than and less than characters.

  11. Anuj Gakhar says:

    yeah I saw that in the downloadable code in that example you sent over earlier. Cool stuff! I’ve got something to play with for next couple of days 🙂
    cheers mate!

  12. Anuj Gakhar says:

    in my example above, I could also do this :-

    <cfset allLinks= XmlSearch(myXml,”//*[starts-with(name(),
    ‘a’) and string-length(name()) = 1]”) />

    instead of

    <cfset allLinks = XmlSearch(myXml, “//*[local-name()=’a’]”)>>

    Cool!

  13. ana gomez says:

    That is very useful code. Thanks

  14. house dj says:

    Hi, it doesn’t work for me. Is this an indicator for an invalid html-code? please say NO! My Webmaster says to me, that my page is valid now! Thanks for your Help.

  15. Anuj Gakhar says:

    @house dj, that probably means invalid html. whats the URL you are trying to parse ?

  16. Paradores says:

    Really handy code – thanks alot

Leave a Reply to Anuj Gakhar Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© 2011 Anuj Gakhar
%d bloggers like this: