[ヅ] JavaでテキストからURLを抽出する正規表現 を使ってツイート文字列からURLを抜きだして、

短縮URLサービスは90個以上あるそうで、個別のAPIを使ってたら埒があかない。以下のコードは、bit.ly, t.co, goo.gl などでテスト済みだけど、あらゆる短縮URLサービスで利用可能だと思います。

Javaでbit.ly等の短縮URLを展開する方法 « 来栖川電算

という素晴らしいコードがあったので、これを一部改造して使わせていただく。

ソースコード。


import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;
 
public class UrlExpander {
 
  public static void main(String[] args) throws Exception {
    String[] samples = {
      "テスト的な何か。 http://t.co/CFzUpytJ",
      "テスト的な何か。 http://t.co/AebrSPWJ / ヅラッシュ! http://t.co/4E452uqL",
    };
    for(String text : samples){
      Map<String, String> urlPairs = getExpandedUrls(text);
      String replaced = replaceUrls(text, urlPairs);
      System.out.println("source  : " + text);
      System.out.println("replaced: " + replaced);
    }
  }
 
  // テキストからURLを抽出して短縮URLを展開する
  public static Map<String, String> getExpandedUrls(String text) throws IOException, ProtocolException {
    // URLを抽出
    String[] urls = UrlExtractor.extract(UrlExtractor.STANDARD_URL_MATCH_PATTERN, text);
    // 短縮URLを展開
    Map<String, String> result = new HashMap<String, String>();
    for(String url : urls){
      String expandedUrl = UrlUtility.expandUrl(new URL(url)).toExternalForm();
      result.put(url, expandedUrl);
    }
    return result;
  }
 
  // 短縮URLを展開したURLへ置換する
  public static String replaceUrls(String text, Map<String, String> urlPairs){
    for(Map.Entry<String, String> entry : urlPairs.entrySet()) {
      text = text.replaceAll(
        Matcher.quoteReplacement(entry.getKey()),
        Matcher.quoteReplacement(entry.getValue()));
    }
    return text;
  }
 
  // Ref. - [ヅ] JavaでテキストからURLを抽出する正規表現
  //        http://www.nilab.info/z3/20100113_zlashdot_001094.html
  private static class UrlExtractor {

    // 標準的なURL抽出パターン
    private static final Pattern STANDARD_URL_MATCH_PATTERN = Pattern.compile(
      "(http://|https://){1}[\\w\\.\\-/:\\#\\?\\=\\&\\;\\%\\~\\+]+",
      Pattern.CASE_INSENSITIVE);
  
    // 貪欲なURL抽出パターン
    private static final Pattern GREEDY_URL_MATCH_PATTERN = Pattern.compile(
      "(http|https):([^\\x00-\\x20()\"<>\\x7F-\\xFF])*",
      Pattern.CASE_INSENSITIVE);
 
    public static String[] extract(Pattern pattern, String text) {
      List<String> list = new ArrayList<String>();
      Matcher matcher = pattern.matcher(text);
      while (matcher.find()) {
        list.add(matcher.group());
      }
      return list.toArray(new String[list.size()]);
    }
  }
 
  // Ref. - Javaでbit.ly等の短縮URLを展開する方法 << 来栖川電算
  //        http://kurusugawa.jp/2011/05/27/fast-universal-java-url-expander/
  private static class UrlUtility {
 
    public static URL expandUrl(URL aUrl) throws IOException, ProtocolException {
      final URLConnection tURLConnection = aUrl.openConnection(Proxy.NO_PROXY);
      if (!(tURLConnection instanceof HttpURLConnection)) {
        return aUrl;
      }
      final HttpURLConnection tHttpURLConnection = (HttpURLConnection) tURLConnection;
      tHttpURLConnection.setRequestMethod("HEAD");
      tHttpURLConnection.setInstanceFollowRedirects(false);
      tHttpURLConnection.connect();
 
      final String tExpandedUrl;
      final String tLocation = tHttpURLConnection.getHeaderField("Location");
      if (tLocation != null && tLocation.startsWith("http")) {
        final int tResponseCode = tHttpURLConnection.getResponseCode();
        if (tResponseCode == HttpURLConnection.HTTP_MOVED_PERM || tResponseCode == HttpURLConnection.HTTP_MOVED_TEMP) {
          tExpandedUrl = expandUrl(new URL(encode(tLocation))).toExternalForm();
        } else {
          tExpandedUrl = tLocation;
        }
      } else {
        tExpandedUrl = tHttpURLConnection.getURL().toExternalForm();
      }
 
      return new URL(encode(tExpandedUrl));
    }
 
    // @formatter:off
    private static final String[] HEX = {
      "80","81","82","83","84","85","86","87","88","89","8A","8B","8C","8D","8E","8F",
      "90","91","92","93","94","95","96","97","98","99","9A","9B","9C","9D","9E","9F",
      "A0","A1","A2","A3","A4","A5","A6","A7","A8","A9","AA","AB","AC","AD","AE","AF",
      "B0","B1","B2","B3","B4","B5","B6","B7","B8","B9","BA","BB","BC","BD","BE","BF",
      "C0","C1","C2","C3","C4","C5","C6","C7","C8","C9","CA","CB","CC","CD","CE","CF",
      "D0","D1","D2","D3","D4","D5","D6","D7","D8","D9","DA","DB","DC","DD","DE","DF",
      "E0","E1","E2","E3","E4","E5","E6","E7","E8","E9","EA","EB","EC","ED","EE","EF",
      "F0","F1","F2","F3","F4","F5","F6","F7","F8","F9","FA","FB","FC","FD","FE","FF",
      "00","01","02","03","04","05","06","07","08","09","0A","0B","0C","0D","0E","0F",
      "10","11","12","13","14","15","16","17","18","19","1A","1B","1C","1D","1E","1F",
      "20","21","22","23","24","25","26","27","28","29","2A","2B","2C","2D","2E","2F",
      "30","31","32","33","34","35","36","37","38","39","3A","3B","3C","3D","3E","3F",
      "40","41","42","43","44","45","46","47","48","49","4A","4B","4C","4D","4E","4F",
      "50","51","52","53","54","55","56","57","58","59","5A","5B","5C","5D","5E","5F",
      "60","61","62","63","64","65","66","67","68","69","6A","6B","6C","6D","6E","6F",
      "70","71","72","73","74","75","76","77","78","79","7A","7B","7C","7D","7E","7F",
    };
    // @formatter:on
    
    private static String encode(String aUrl) throws UnsupportedEncodingException {
      final byte[] tBytes = aUrl.getBytes("ISO-8859-1");
      final int tLength = tBytes.length;
      final StringBuilder tBuilder = new StringBuilder(tLength * 3);
      for (int tIndex = 0; tIndex < tLength; tIndex++) {
        final int tIntAt = (int) tBytes[tIndex];
        if (tIntAt < 0) {
          tBuilder.append('%');
          tBuilder.append(HEX[tIntAt + 128]);
        } else {
          tBuilder.append((char) tIntAt);
        }
      }
      return tBuilder.toString();
    }
  }
}

実行結果。


source  : テスト的な何か。 http://t.co/CFzUpytJ
replaced: テスト的な何か。 http://www.nilab.info/z3/
source  : テスト的な何か。 http://t.co/AebrSPWJ / ヅラッシュ! http://t.co/4E452uqL
replaced: テスト的な何か。 http://www.nilab.info/ / ヅラッシュ! http://www.nilab.info/z3/

短縮URLが元のURLに展開される。

これら短縮URLのうち http://t.co/4E452uqL だけは http://htn.to/ghMHEQ という短縮URLの短縮URLという多段短縮URLになっている。

実際の短縮URLと展開されるURL:
- http://t.co/CFzUpytJ → http://www.nilab.info/z3/
- http://t.co/AebrSPWJ → http://www.nilab.info/
- http://t.co/4E452uqL → http://htn.to/ghMHEQ → http://www.nilab.info/z3/

以下、参考情報。

以下のことを考慮する必要がありました。

・展開結果のURLがマルチバイト文字を含んでいることがある
 →URLエンコードすることで対処
・展開結果のURLのホストサーバがIISだった場合は、HttpURLConnection#getURL()の結果でパスが省略される
 →tHttpURLConnection.setInstanceFollowRedirects(false)として、IISへのアクセスを行わずLocationヘッダを取り出すことで対処
・遅い
 →URL#openConnection(Proxy.NO_PROXY)を指定することで対処
 →HEADメソッドを使ってBODYを無視することで対処
・展開不要なURLの判定が難しい
 →とりあえず展開を試みて、LocationヘッダかHttpURLConnection#getURL()のどちらかを利用することで対処
・多段階のリダイレクトが行われることがある
 →HTTPステータスコードが301, 302あいだは再帰的にexpandUrlを呼び出すことで対処

Javaでbit.ly等の短縮URLを展開する方法 « 来栖川電算

また、コメント欄によると Amazon.co.jp や Amazon S3 の URL に HEAD メソッドでアクセスすると 403 Forbidden になってしまうという問題があるので、その場合は


tHttpURLConnection.setRequestMethod("HEAD");

を削除すると動作するとのこと。

ただ、HEADとちがってGETでコンテンツを全部取得しようとするため処理速度が落ちる・通信量が増えるはず。

tags: java twitter

Posted by NI-Lab. (@nilab)