Tuesday, July 15, 2003

Relative to Absolute Links

As I mentioned in my previous entry, I gave a bit of thought to how one can convert links from relative to absolute in any given piece of HTML.

Scanning the HTML manually was not an option. HTML has comments, irregular syntax and character encodings, so doing it in an ad-hoc manner is guaranteed to produce bad results.

What I wanted is an HTML parser and a HTML producer. However, while there are many such beasts in the Java world, I couldn't find any C# ones and my weblog solution was written in C#.

Finally, it dawned on me: why not use the HTML parser/producer built into Internet Explorer? IE6 already knows how to convert HTML into a DOM and vice-versa, and is a fairly tested and robust solution. Once that light bulb lit up, I produced the following code in less than an hour:

private static string ConvertToAbsoluteUrls (string html, Uri relativeLocation) {
    IHTMLDocument2 doc = new HTMLDocumentClass ();
    doc.write (new object [] { html });
    doc.close ();

    foreach (IHTMLAnchorElement anchor in doc.links) {
        IHTMLElement element = (IHTMLElement)anchor;
        string href = (string)element.getAttribute ("href", 2);
        if (href != null) {
            Uri addr = new Uri (relativeLocation, href);
            anchor.href = addr.AbsoluteUri;
        }
    }

    foreach (IHTMLImgElement image in doc.images) {
        IHTMLElement element = (IHTMLElement)image;
        string src = (string)element.getAttribute ("src", 2);
        if (src != null) {
            Uri addr = new Uri (relativeLocation, src);
            image.src = addr.AbsoluteUri;
        }
    }

    string ret = doc.body.innerHTML;

    return ret;
}

Feel free to use this function in your code.

   12:26 AM

Content of this site is © Dejan Jelovic. All rights reserved.