Dec 09 2011

HTML Parsers and other libraries: don’t write your own till you know, and have tested, what already exists

Category: java,open sourceUlrich Palha @ 3:53 pm

It seemed simple…

I needed to extract some HTML content for one of the projects I was working on. It seemed easy enough to retrieve the content and parse out the title tag, so I quickly threw together the following code to retrieve the title tag.

	public String getTitleFromUrl(String url) {
		String html = getHtml(url);
		return getTitle(html);
	}

	private String getTitle(String html) {
		Pattern pattern = Pattern.compile("<title>(.+)</title>");
		Matcher matcher = pattern.matcher(html);
		String title = "";
		if (matcher.find()) {
			title = matcher.group(1);
		}
		return title;
	}

	private String getHtml(String url) {
		InputStream in = null;
		String html = "";
		try {
			in = new URL(url).openStream();
			html = IOUtils.toString(in);
		} catch ...
		} finally {
			IOUtils.closeQuietly(in);
		}
		return html;
	}

But, was it worth the time?

Next, I needed to retrieve all of the external pointing links in the HTML. I soon realized that this would be more complicated than the title (with relative URLs and potential malformed HTML).

Since it was not core to my project, I did not want to spend time writing an HTML parser and find myself shaving a yak, so I looked for an open source solution. There were plenty of Java HTML parsers to choose from. I settled on jsoup.

The code to retrieve the title tag was now much more concise.

	public String getTitleFromUrl(String url) {
		Document doc = null;
		try {
			doc = Jsoup.connect(url).get();
		} catch ...
		}
		return doc.title();
	}

Know and use the libraries

This reminded me of Effective Java Item 47: Know and use the libraries.

Generally speaking, library code is likely to be better than code that you’d write yourself and is likely to improve over time…. By using a standard library, you take advantage of the knowledge of the experts who wrote it and the experience of those who used it before you.

Joshua Bloch was referring to the Java Standard libraries, but I expect that the same advice applies to jsoup, which has been around for almost 2 years and is actively maintained.

Conclusion

Before you start writing code that is not core to your application, find out what already exists and see if it will meet your needs. You should definitely not write a Java HTML parser without first looking at the many libraries that already do this: jsoup is a great place to start.

Tags: , , ,