[ヅ] JavaでテキストからURLを抽出する正規表現 (2010-01-13)

JavaでテキストからURLを抽出する正規表現 (2010-01-13)

Java標準の正規表現ライブラリを使って、文字列が含む複数のURLを取り出す処理を書いてみた。
環境は JDK 1.6 + Window XP.

ソースコード

普通なら STANDARD_URL_MATCH_PATTERN (標準的なURL抽出パターン) で足りると思うが、URLのパラメータに日本語そのままを含む場合は GREEDY_URL_MATCH_PATTERN (貪欲なURL抽出パターン) を使う。


import java.util.*;
import java.util.regex.*;
 
public class UrlExtractor {
 
  public static void main(String[] args) {
    String text =
      "abcde " +
      "<br> </br> & ' \"" +
      "http://www.nilab.info/ " +
      "http://WWW.NILAB.INFO/ " +
      "HTTP://WWW.NILAB.INFO/ " +
      "http://www.nilab.info/index.html " +
      "http://www.nilab.info/in dex.html " +
      "xyzhttp://www.nilab.info/hoge " +
      "https://www.nilab.info/ " +
      "xyzhttps://www.nilab.info/abc " +
      "http://www.nilab.info/ http://nilab.info/index.html こんにちは " +
      "http://www.nilab.info/redirect.cgi?http://nilab.info/index.html " +
      "http://www.nilab.info/wiki?こんにちは " +
      "<a href=http://www.nilab.info/>ホームページ</a> " +
      "<a href=\"http://www.nilab.info/\">ホームページ</a> " +
      "hello http://localhost/test.cgi?%E3%81%82%E3%81%84%E3%81%86%E3%81%88%E3%81%8A good-bye " +
      "";
    
    System.out.println("STANDARD --------------------------------------------------");
    print(extract(STANDARD_URL_MATCH_PATTERN, text));
    
    System.out.println("GREEDY --------------------------------------------------");
    print(extract(GREEDY_URL_MATCH_PATTERN, text));
  }
  
  // 標準的なURL抽出パターン
  private static final Pattern STANDARD_URL_MATCH_PATTERN = Pattern.compile("(http://|https://){1}[\\w\\.\\-/:\\#\\?\\=\\&\\;\\%\\~\\+]+", Pattern.CASE_INSENSITIVE);
 
  // 貪欲なURL抽出パターン
  private static final Pattern GREEDY_URL_MATCH_PATTERN = Pattern.compile("(http|https):([^\\x00-\\x20()\"<>\\x7F-\\xFF])*", Pattern.CASE_INSENSITIVE);
  
  public static String[] extract(Pattern pattern, String text){
    List<String> list = new ArrayList<String>();
    Matcher matcher = pattern.matcher(text);
    while(matcher.find()){
      list.add(matcher.group());
    }
    return list.toArray(new String[list.size()]);
  }
  
  private static void print(String[] s){
    for(int i=0; i<s.length; i++){
      System.out.println(s[i]);
    }
  }
 
}

出力結果


STANDARD --------------------------------------------------
http://www.nilab.info/
http://WWW.NILAB.INFO/
HTTP://WWW.NILAB.INFO/
http://www.nilab.info/index.html
http://www.nilab.info/in
http://www.nilab.info/hoge
https://www.nilab.info/
https://www.nilab.info/abc
http://www.nilab.info/
http://nilab.info/index.html
http://www.nilab.info/redirect.cgi?http://nilab.info/index.html
http://www.nilab.info/wiki?
http://www.nilab.info/
http://www.nilab.info/
http://localhost/test.cgi?%E3%81%82%E3%81%84%E3%81%86%E3%81%88%E3%81%8A
GREEDY --------------------------------------------------
http://www.nilab.info/
http://WWW.NILAB.INFO/
HTTP://WWW.NILAB.INFO/
http://www.nilab.info/index.html
http://www.nilab.info/in
http://www.nilab.info/hoge
https://www.nilab.info/
https://www.nilab.info/abc
http://www.nilab.info/
http://nilab.info/index.html
http://www.nilab.info/redirect.cgi?http://nilab.info/index.html
http://www.nilab.info/wiki?こんにちは
http://www.nilab.info/
http://www.nilab.info/
http://localhost/test.cgi?%E3%81%82%E3%81%84%E3%81%86%E3%81%88%E3%81%8A

参考

-java.util.regex (Java Platform SE 6)
-634 - 正規表現エディタ - URL抽出処理（正規表現）
-Java 文字列内のURLをリンクに変換／ Chat&Messenger
-メールからURL抽出してみた - kikouの日記

# もっといい感じの正規表現があれば欲しい。。。

tags: zlashdot Java Java RegularExpression

Posted by NI-Lab. (@nilab)

ヅラッシュ！ by NI-Lab.

JavaでテキストからURLを抽出する正規表現 (2010-01-13)

ソースコード

出力結果

参考