タイトルタグの情報を取りたいだけだったけど、もう少し試してみた。

Java: JDK 1.6
使用した外部ライブラリ: HTML Parser

HTML Parser の準備

HTML Parser から HTMLParser Version 1.6 (Release Build Jun 10, 2006) をダウンロード。

解凍して以下の4つのJARファイルをクラスパスに通す。

htmlparser.jar
htmllexer.jar
thumbelina.jar
filterbuilder.jar.

サンプルコード

ページのタイトルやリンク一覧、画像一覧、コメント一覧など抜き出すサンプルのソースコード。


import java.util.*;
import org.htmlparser.*;
import org.htmlparser.filters.*;
import org.htmlparser.http.*;
import org.htmlparser.util.*;
import org.htmlparser.tags.*;
import org.htmlparser.nodes.*;
 
public class WebPageInfo {
 
  public static void main(String[] args) throws Exception {
 
    String url = "http://www.nilab.info/";
    WebPageInfo w = new WebPageInfo(url);
 
    System.out.println("URL: " + url);
    System.out.println("Title: " + w.getTitle());
 
    String[] links = w.getLinkUrls();
    for (int i = 0; i < links.length; i++) {
      System.out.println("Link: " + links[i]);
    }
 
    String[] images = w.getImageUrls();
    for (int i = 0; i < images.length; i++) {
      System.out.println("Image: " + images[i]);
    }
 
    String[] texts = w.getTexts();
    for (int i = 0; i < texts.length; i++) {
      System.out.println("Text: " + texts[i]);
    }
 
    String[] comments = w.getComments();
    for (int i = 0; i < comments.length; i++) {
      System.out.println("Comment: " + comments[i]);
    }
    
    // String[] tags = w.getTags();
    // for (int i = 0; i < tags.length; i++) {
    //   System.out.println("Tag: " + tags[i]);
    // }
  }
 
  private String title;
  private List<String> text = new ArrayList<String>();
  private List<String> comment = new ArrayList<String>();
  private List<LinkTag> linktag = new ArrayList<LinkTag>();
  private List<ImageTag> imagetag = new ArrayList<ImageTag>();
  private List<TagNode> tag = new ArrayList<TagNode>();
 
  public WebPageInfo(String url) throws Exception {
 
    Parser.getConnectionManager().setRedirectionProcessingEnabled(true);
    Parser.getConnectionManager().setCookieProcessingEnabled(true);
 
    Parser parser = new Parser();
 
    // URLを指定
    parser.setResource(url);
 
    // 抽出対象を指定
    NodeFilter filter = null;
    // AndFilter, OrFilter, RegexFilter, etc...
    // filter = new TagNameFilter("A");
 
    // パース
    try {
      parse(parser, filter);
    } catch (EncodingChangeException ece) {
      // Parser#reset した後でリトライすると適切なエンコーディングで処理してくれるらしい(未検証)
      ece.printStackTrace();
      parser.reset();
      parse(parser, filter);
    }
  }
 
  private void parse(Parser parser, NodeFilter filter) throws ParserException {
    NodeList list = parser.parse(filter);
    NodeIterator i = list.elements();
    while (i.hasMoreNodes()) {
      analyze(i.nextNode());
    }
  }
 
  public String getTitle() {
    return title;
  }
 
  public String[] getTexts() {
    return (String[]) text.toArray(new String[text.size()]);
  }
 
  public String[] getComments() {
    return (String[]) comment.toArray(new String[comment.size()]);
  }
 
  public String[] getLinkUrls() {
    String[] urls = new String[linktag.size()];
    for (int i = 0; i < linktag.size(); i++) {
      urls[i] = linktag.get(i).getLink(); // 絶対URL
    }
    return urls;
  }
 
  public String[] getImageUrls() {
    String[] urls = new String[imagetag.size()];
    for (int i = 0; i < imagetag.size(); i++) {
      urls[i] = imagetag.get(i).getImageURL(); // 絶対URL
    }
    return urls;
  }
 
  public String[] getTags() {
    String[] tegs = new String[tag.size()];
    for (int i = 0; i < tag.size(); i++) {
      tegs[i] = tag.get(i).toHtml();
    }
    return tegs;
  }
 
  private void analyze(Node node) throws ParserException {
 
    if (node instanceof TextNode) { // Text
      text.add(((TextNode) node).getText());
 
    } else if (node instanceof RemarkNode) { // Remark
      comment.add(((RemarkNode) node).getText());
 
    } else if (node instanceof TagNode) { // Tag
 
      if (node instanceof TitleTag) {
        title = ((TitleTag) node).getTitle();
 
      } else if (node instanceof LinkTag) {
        linktag.add((LinkTag) node);
 
      } else if (node instanceof ImageTag) {
        imagetag.add((ImageTag) node);
 
      }
 
      tag.add((TagNode) node);
    }
 
    // 再帰
    NodeList children = node.getChildren();
    if (children != null) {
      NodeIterator i = children.elements();
      while (i.hasMoreNodes())
        analyze(i.nextNode());
    }
  }
}

サンプルコードの出力結果

<title>タグ(TitleTag)の中身はテキスト(TextNode)としても処理されている。
空白や改行もそのまま入っている。


URL: http://www.nilab.info/
Title: NI-Lab.
Link: http://www.nilab.info/
Link: http://www.nilab.info/en/
Link: http://www.nilab.info/wii/
Link: http://www.nilab.info/m/
Link: http://www.nilab.info/iphone/
Link: http://www.nilab.info/#etc
Link: 
Link: http://friendfeed.com/nilab
Link: http://nilab.timelog.jp/
Link: http://twitter.com/nilab
(中略)
Image: http://www.nilab.info/zlashdot32x32.png
Image: http://www.nilab.info/zlashdot32x32.png
Image: http://www.nilab.info/cheapjap32x32.png
Image: http://www.nilab.info/music32x32.png
Image: http://www.nilab.info/d32x32.png
Image: http://www.nilab.info/d32x32.png
Image: http://www.nilab.info/image/mailaddress.png
Image: http://www.pagerank.net/pagerank.gif
Image: http://www.trackword.net/img/minilogoh.gif
Image: http://www.nilab.info/kohaku/kohaku/oxfordminidictionary.jpg
(中略)
Text: NI-Lab.
Text: 
	&nbsp;&nbsp;&nbsp;
(中略)
Text: FriendFeed
Text: 
	
Text: Timelog
Text: 
	
Text: Twitter
Text: 
	
Text: Amebaなう(アメーバなう)
Text: 
	
Text: Wassr
Text: 
	
Text: アバウトミー : @nifty
Text: 
	
Text: NI-Lab. - はてな
Text:  - 
Text: 電子栞
Text:  - 
Text: RSS バキューム★
(中略)
Comment:  Hatena Account Auto-Discovery : since 2005-12-23 
Comment:  OpenID delegate by TypeKey : since 2005-12-23 
Comment:  NI-Lab. is N-9 Irritation Laboratory. since 2010-01-11 
Comment:  NI-Lab. is NInja Laboratory. since 2009-06-25 
Comment:  NI-Lab. is Nasi goreng Itadakimasu Laboratory. since 2009-04-26 
Comment:  NI-Lab. is Niigata Injection Laboratory. since 2008-11-11 
Comment:  NI-Lab. is Neanderthal Ibuprofen Laboratory. since 2008-10-12 
Comment:  NI-Lab. is NEET Iitaidakechaunkato Laboratory. since 2007-07-13 
Comment:  NI-Lab. is Nested Izzat Laboratory. since 2006-07-01 
Comment:  NI-Lab. is Needless Inqilab Laboratory. since 2005-10-16 
Comment:  NI-Lab. is Networked Intelligent Lifeform Assembled for Battle. (http://www.cyborgname.com/cyborger.cgi?acronym=NI-Lab.&robotchoice=governor2k3) since 2005-05-03 
Comment:  NI-Lab. is Negative Investigation Laboratory. 
Comment:  NI-Lab. is Nyanko Itaraiina Laboratory. 
Comment:  NI-Lab. is Next Iterative Labratory. 
Comment:  SiteSearch Google : since 2007-03-04 
Comment:  SiteSearch Google 
Comment:  SiteSearch Yahoo : since 2007-10-08 
Comment:  SiteSearch Yahoo 
Comment:  copyright 
Comment:  free link 
Comment:  etc 
Comment:  Search Google 
Comment:  Search Yahoo! JAPAN 
Comment:  Amazone.co.jp Search 
Comment:  goo dictionary 
Comment:  便利なリンク 
Comment:  Google PageRank 2006-11-11 追加 
Comment:  added since 2005-07-15 
Comment:  added track word since 2006-01-19 
Comment:  added Google Analytics 2006-04-13 

参考

-HTML Parser ← 今回つかったライブラリの公式サイト
-情報検索 / Web検索システム演習 - 東京電機大学 - HTML Parser ← このページのサンプルがすごく参考になった。
-Java用のHTMLパーサ・ライブラリ「HTMLParser 1.5」リリース | エンタープライズ | マイコミジャーナル

tags: zlashdot Java Html Java Parser

Posted by NI-Lab. (@nilab)