タイトルタグの情報を取りたいだけだったけど、もう少し試してみた。
Java: JDK 1.6
使用した外部ライブラリ: HTML Parser
HTML Parser の準備
HTML Parser から HTMLParser Version 1.6 (Release Build Jun 10, 2006) をダウンロード。
解凍して以下の4つのJARファイルをクラスパスに通す。
htmlparser.jar
htmllexer.jar
thumbelina.jar
filterbuilder.jar.
サンプルコード
ページのタイトルやリンク一覧、画像一覧、コメント一覧など抜き出すサンプルのソースコード。
import java.util.*;
import org.htmlparser.*;
import org.htmlparser.filters.*;
import org.htmlparser.http.*;
import org.htmlparser.util.*;
import org.htmlparser.tags.*;
import org.htmlparser.nodes.*;
public class WebPageInfo {
public static void main(String[] args) throws Exception {
String url = "http://www.nilab.info/";
WebPageInfo w = new WebPageInfo(url);
System.out.println("URL: " + url);
System.out.println("Title: " + w.getTitle());
String[] links = w.getLinkUrls();
for (int i = 0; i < links.length; i++) {
System.out.println("Link: " + links[i]);
}
String[] images = w.getImageUrls();
for (int i = 0; i < images.length; i++) {
System.out.println("Image: " + images[i]);
}
String[] texts = w.getTexts();
for (int i = 0; i < texts.length; i++) {
System.out.println("Text: " + texts[i]);
}
String[] comments = w.getComments();
for (int i = 0; i < comments.length; i++) {
System.out.println("Comment: " + comments[i]);
}
// String[] tags = w.getTags();
// for (int i = 0; i < tags.length; i++) {
// System.out.println("Tag: " + tags[i]);
// }
}
private String title;
private List<String> text = new ArrayList<String>();
private List<String> comment = new ArrayList<String>();
private List<LinkTag> linktag = new ArrayList<LinkTag>();
private List<ImageTag> imagetag = new ArrayList<ImageTag>();
private List<TagNode> tag = new ArrayList<TagNode>();
public WebPageInfo(String url) throws Exception {
Parser.getConnectionManager().setRedirectionProcessingEnabled(true);
Parser.getConnectionManager().setCookieProcessingEnabled(true);
Parser parser = new Parser();
// URLを指定
parser.setResource(url);
// 抽出対象を指定
NodeFilter filter = null;
// AndFilter, OrFilter, RegexFilter, etc...
// filter = new TagNameFilter("A");
// パース
try {
parse(parser, filter);
} catch (EncodingChangeException ece) {
// Parser#reset した後でリトライすると適切なエンコーディングで処理してくれるらしい(未検証)
ece.printStackTrace();
parser.reset();
parse(parser, filter);
}
}
private void parse(Parser parser, NodeFilter filter) throws ParserException {
NodeList list = parser.parse(filter);
NodeIterator i = list.elements();
while (i.hasMoreNodes()) {
analyze(i.nextNode());
}
}
public String getTitle() {
return title;
}
public String[] getTexts() {
return (String[]) text.toArray(new String[text.size()]);
}
public String[] getComments() {
return (String[]) comment.toArray(new String[comment.size()]);
}
public String[] getLinkUrls() {
String[] urls = new String[linktag.size()];
for (int i = 0; i < linktag.size(); i++) {
urls[i] = linktag.get(i).getLink(); // 絶対URL
}
return urls;
}
public String[] getImageUrls() {
String[] urls = new String[imagetag.size()];
for (int i = 0; i < imagetag.size(); i++) {
urls[i] = imagetag.get(i).getImageURL(); // 絶対URL
}
return urls;
}
public String[] getTags() {
String[] tegs = new String[tag.size()];
for (int i = 0; i < tag.size(); i++) {
tegs[i] = tag.get(i).toHtml();
}
return tegs;
}
private void analyze(Node node) throws ParserException {
if (node instanceof TextNode) { // Text
text.add(((TextNode) node).getText());
} else if (node instanceof RemarkNode) { // Remark
comment.add(((RemarkNode) node).getText());
} else if (node instanceof TagNode) { // Tag
if (node instanceof TitleTag) {
title = ((TitleTag) node).getTitle();
} else if (node instanceof LinkTag) {
linktag.add((LinkTag) node);
} else if (node instanceof ImageTag) {
imagetag.add((ImageTag) node);
}
tag.add((TagNode) node);
}
// 再帰
NodeList children = node.getChildren();
if (children != null) {
NodeIterator i = children.elements();
while (i.hasMoreNodes())
analyze(i.nextNode());
}
}
}
サンプルコードの出力結果
<title>タグ(TitleTag)の中身はテキスト(TextNode)としても処理されている。
空白や改行もそのまま入っている。
URL: http://www.nilab.info/
Title: NI-Lab.
Link: http://www.nilab.info/
Link: http://www.nilab.info/en/
Link: http://www.nilab.info/wii/
Link: http://www.nilab.info/m/
Link: http://www.nilab.info/iphone/
Link: http://www.nilab.info/#etc
Link:
Link: http://friendfeed.com/nilab
Link: http://nilab.timelog.jp/
Link: http://twitter.com/nilab
(中略)
Image: http://www.nilab.info/zlashdot32x32.png
Image: http://www.nilab.info/zlashdot32x32.png
Image: http://www.nilab.info/cheapjap32x32.png
Image: http://www.nilab.info/music32x32.png
Image: http://www.nilab.info/d32x32.png
Image: http://www.nilab.info/d32x32.png
Image: http://www.nilab.info/image/mailaddress.png
Image: http://www.pagerank.net/pagerank.gif
Image: http://www.trackword.net/img/minilogoh.gif
Image: http://www.nilab.info/kohaku/kohaku/oxfordminidictionary.jpg
(中略)
Text: NI-Lab.
Text:
(中略)
Text: FriendFeed
Text:
Text: Timelog
Text:
Text: Twitter
Text:
Text: Amebaなう(アメーバなう)
Text:
Text: Wassr
Text:
Text: アバウトミー : @nifty
Text:
Text: NI-Lab. - はてな
Text: -
Text: 電子栞
Text: -
Text: RSS バキューム★
(中略)
Comment: Hatena Account Auto-Discovery : since 2005-12-23
Comment: OpenID delegate by TypeKey : since 2005-12-23
Comment: NI-Lab. is N-9 Irritation Laboratory. since 2010-01-11
Comment: NI-Lab. is NInja Laboratory. since 2009-06-25
Comment: NI-Lab. is Nasi goreng Itadakimasu Laboratory. since 2009-04-26
Comment: NI-Lab. is Niigata Injection Laboratory. since 2008-11-11
Comment: NI-Lab. is Neanderthal Ibuprofen Laboratory. since 2008-10-12
Comment: NI-Lab. is NEET Iitaidakechaunkato Laboratory. since 2007-07-13
Comment: NI-Lab. is Nested Izzat Laboratory. since 2006-07-01
Comment: NI-Lab. is Needless Inqilab Laboratory. since 2005-10-16
Comment: NI-Lab. is Networked Intelligent Lifeform Assembled for Battle. (http://www.cyborgname.com/cyborger.cgi?acronym=NI-Lab.&robotchoice=governor2k3) since 2005-05-03
Comment: NI-Lab. is Negative Investigation Laboratory.
Comment: NI-Lab. is Nyanko Itaraiina Laboratory.
Comment: NI-Lab. is Next Iterative Labratory.
Comment: SiteSearch Google : since 2007-03-04
Comment: SiteSearch Google
Comment: SiteSearch Yahoo : since 2007-10-08
Comment: SiteSearch Yahoo
Comment: copyright
Comment: free link
Comment: etc
Comment: Search Google
Comment: Search Yahoo! JAPAN
Comment: Amazone.co.jp Search
Comment: goo dictionary
Comment: 便利なリンク
Comment: Google PageRank 2006-11-11 追加
Comment: added since 2005-07-15
Comment: added track word since 2006-01-19
Comment: added Google Analytics 2006-04-13
参考
-HTML Parser ← 今回つかったライブラリの公式サイト
-情報検索 / Web検索システム演習 - 東京電機大学 - HTML Parser ← このページのサンプルがすごく参考になった。
-Java用のHTMLパーサ・ライブラリ「HTMLParser 1.5」リリース | エンタープライズ | マイコミジャーナル
tags: zlashdot Java Html Java Parser
Posted by NI-Lab. (@nilab)