[ヅ] JavaでテキストからURLを抽出する正規表現 を使ってツイート文字列からURLを抜きだして、
短縮URLサービスは90個以上あるそうで、個別のAPIを使ってたら埒があかない。以下のコードは、bit.ly, t.co, goo.gl などでテスト済みだけど、あらゆる短縮URLサービスで利用可能だと思います。
Javaでbit.ly等の短縮URLを展開する方法 « 来栖川電算
という素晴らしいコードがあったので、これを一部改造して使わせていただく。
ソースコード。
import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;
public class UrlExpander {
public static void main(String[] args) throws Exception {
String[] samples = {
"テスト的な何か。 http://t.co/CFzUpytJ",
"テスト的な何か。 http://t.co/AebrSPWJ / ヅラッシュ! http://t.co/4E452uqL",
};
for(String text : samples){
Map<String, String> urlPairs = getExpandedUrls(text);
String replaced = replaceUrls(text, urlPairs);
System.out.println("source : " + text);
System.out.println("replaced: " + replaced);
}
}
// テキストからURLを抽出して短縮URLを展開する
public static Map<String, String> getExpandedUrls(String text) throws IOException, ProtocolException {
// URLを抽出
String[] urls = UrlExtractor.extract(UrlExtractor.STANDARD_URL_MATCH_PATTERN, text);
// 短縮URLを展開
Map<String, String> result = new HashMap<String, String>();
for(String url : urls){
String expandedUrl = UrlUtility.expandUrl(new URL(url)).toExternalForm();
result.put(url, expandedUrl);
}
return result;
}
// 短縮URLを展開したURLへ置換する
public static String replaceUrls(String text, Map<String, String> urlPairs){
for(Map.Entry<String, String> entry : urlPairs.entrySet()) {
text = text.replaceAll(
Matcher.quoteReplacement(entry.getKey()),
Matcher.quoteReplacement(entry.getValue()));
}
return text;
}
// Ref. - [ヅ] JavaでテキストからURLを抽出する正規表現
// http://www.nilab.info/z3/20100113_zlashdot_001094.html
private static class UrlExtractor {
// 標準的なURL抽出パターン
private static final Pattern STANDARD_URL_MATCH_PATTERN = Pattern.compile(
"(http://|https://){1}[\\w\\.\\-/:\\#\\?\\=\\&\\;\\%\\~\\+]+",
Pattern.CASE_INSENSITIVE);
// 貪欲なURL抽出パターン
private static final Pattern GREEDY_URL_MATCH_PATTERN = Pattern.compile(
"(http|https):([^\\x00-\\x20()\"<>\\x7F-\\xFF])*",
Pattern.CASE_INSENSITIVE);
public static String[] extract(Pattern pattern, String text) {
List<String> list = new ArrayList<String>();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
list.add(matcher.group());
}
return list.toArray(new String[list.size()]);
}
}
// Ref. - Javaでbit.ly等の短縮URLを展開する方法 << 来栖川電算
// http://kurusugawa.jp/2011/05/27/fast-universal-java-url-expander/
private static class UrlUtility {
public static URL expandUrl(URL aUrl) throws IOException, ProtocolException {
final URLConnection tURLConnection = aUrl.openConnection(Proxy.NO_PROXY);
if (!(tURLConnection instanceof HttpURLConnection)) {
return aUrl;
}
final HttpURLConnection tHttpURLConnection = (HttpURLConnection) tURLConnection;
tHttpURLConnection.setRequestMethod("HEAD");
tHttpURLConnection.setInstanceFollowRedirects(false);
tHttpURLConnection.connect();
final String tExpandedUrl;
final String tLocation = tHttpURLConnection.getHeaderField("Location");
if (tLocation != null && tLocation.startsWith("http")) {
final int tResponseCode = tHttpURLConnection.getResponseCode();
if (tResponseCode == HttpURLConnection.HTTP_MOVED_PERM || tResponseCode == HttpURLConnection.HTTP_MOVED_TEMP) {
tExpandedUrl = expandUrl(new URL(encode(tLocation))).toExternalForm();
} else {
tExpandedUrl = tLocation;
}
} else {
tExpandedUrl = tHttpURLConnection.getURL().toExternalForm();
}
return new URL(encode(tExpandedUrl));
}
// @formatter:off
private static final String[] HEX = {
"80","81","82","83","84","85","86","87","88","89","8A","8B","8C","8D","8E","8F",
"90","91","92","93","94","95","96","97","98","99","9A","9B","9C","9D","9E","9F",
"A0","A1","A2","A3","A4","A5","A6","A7","A8","A9","AA","AB","AC","AD","AE","AF",
"B0","B1","B2","B3","B4","B5","B6","B7","B8","B9","BA","BB","BC","BD","BE","BF",
"C0","C1","C2","C3","C4","C5","C6","C7","C8","C9","CA","CB","CC","CD","CE","CF",
"D0","D1","D2","D3","D4","D5","D6","D7","D8","D9","DA","DB","DC","DD","DE","DF",
"E0","E1","E2","E3","E4","E5","E6","E7","E8","E9","EA","EB","EC","ED","EE","EF",
"F0","F1","F2","F3","F4","F5","F6","F7","F8","F9","FA","FB","FC","FD","FE","FF",
"00","01","02","03","04","05","06","07","08","09","0A","0B","0C","0D","0E","0F",
"10","11","12","13","14","15","16","17","18","19","1A","1B","1C","1D","1E","1F",
"20","21","22","23","24","25","26","27","28","29","2A","2B","2C","2D","2E","2F",
"30","31","32","33","34","35","36","37","38","39","3A","3B","3C","3D","3E","3F",
"40","41","42","43","44","45","46","47","48","49","4A","4B","4C","4D","4E","4F",
"50","51","52","53","54","55","56","57","58","59","5A","5B","5C","5D","5E","5F",
"60","61","62","63","64","65","66","67","68","69","6A","6B","6C","6D","6E","6F",
"70","71","72","73","74","75","76","77","78","79","7A","7B","7C","7D","7E","7F",
};
// @formatter:on
private static String encode(String aUrl) throws UnsupportedEncodingException {
final byte[] tBytes = aUrl.getBytes("ISO-8859-1");
final int tLength = tBytes.length;
final StringBuilder tBuilder = new StringBuilder(tLength * 3);
for (int tIndex = 0; tIndex < tLength; tIndex++) {
final int tIntAt = (int) tBytes[tIndex];
if (tIntAt < 0) {
tBuilder.append('%');
tBuilder.append(HEX[tIntAt + 128]);
} else {
tBuilder.append((char) tIntAt);
}
}
return tBuilder.toString();
}
}
}
実行結果。
source : テスト的な何か。 http://t.co/CFzUpytJ
replaced: テスト的な何か。 http://www.nilab.info/z3/
source : テスト的な何か。 http://t.co/AebrSPWJ / ヅラッシュ! http://t.co/4E452uqL
replaced: テスト的な何か。 http://www.nilab.info/ / ヅラッシュ! http://www.nilab.info/z3/
短縮URLが元のURLに展開される。
これら短縮URLのうち http://t.co/4E452uqL だけは http://htn.to/ghMHEQ という短縮URLの短縮URLという多段短縮URLになっている。
実際の短縮URLと展開されるURL:
- http://t.co/CFzUpytJ → http://www.nilab.info/z3/
- http://t.co/AebrSPWJ → http://www.nilab.info/
- http://t.co/4E452uqL → http://htn.to/ghMHEQ
→ http://www.nilab.info/z3/
以下、参考情報。
以下のことを考慮する必要がありました。
・展開結果のURLがマルチバイト文字を含んでいることがある
→URLエンコードすることで対処
・展開結果のURLのホストサーバがIISだった場合は、HttpURLConnection#getURL()の結果でパスが省略される
→tHttpURLConnection.setInstanceFollowRedirects(false)として、IISへのアクセスを行わずLocationヘッダを取り出すことで対処
・遅い
→URL#openConnection(Proxy.NO_PROXY)を指定することで対処
→HEADメソッドを使ってBODYを無視することで対処
・展開不要なURLの判定が難しい
→とりあえず展開を試みて、LocationヘッダかHttpURLConnection#getURL()のどちらかを利用することで対処
・多段階のリダイレクトが行われることがある
→HTTPステータスコードが301, 302あいだは再帰的にexpandUrlを呼び出すことで対処
Javaでbit.ly等の短縮URLを展開する方法 « 来栖川電算
また、コメント欄によると Amazon.co.jp や Amazon S3 の URL に HEAD メソッドでアクセスすると 403 Forbidden になってしまうという問題があるので、その場合は
tHttpURLConnection.setRequestMethod("HEAD");
を削除すると動作するとのこと。
ただ、HEADとちがってGETでコンテンツを全部取得しようとするため処理速度が落ちる・通信量が増えるはず。
tags: java twitter
Posted by NI-Lab. (@nilab)