複数の動画をまとめて処理したかったので作ってみた。
HTMLを取得してURLを拾うだけ。


import java.io.*;
import java.net.*;
import java.util.*;
import java.util.regex.*;
 
public class Ytd {
 
  // 引数に video_id (Qn-TcgJbVuM という感じの文字列)を指定
  public static void main(String[] args) throws Exception {
 
    HttpURLConnection.setFollowRedirects(true);
 
    String[] video_id = args;
    //String[] video_id = {"Qn-TcgJbVuM"};
    for(int i=0; i<video_id.length; i++){
      System.out.println(getURL(video_id[i]));
    }
  }
 
  private static String getURL(String video_id) throws Exception {
 
    video_id = URLEncoder.encode(video_id);
 
    URL url = new URL("http://youtube.com/watch?v=" + video_id);
    //URL url = new URL("http://www.youtube.com/v/" + youtube_id);
    //URL url = new URL("http://www.youtube.com/watch_video?v=" + youtube_id);
 
    URLConnection con = url.openConnection();
    con.connect();
 
    // HTMLを取得して動画に対応するt値を取得
    String t = null;
    {
      InputStream is = con.getInputStream();
      InputStreamReader isr = new InputStreamReader(is, "UTF-8");
      BufferedReader br = new BufferedReader(isr);
 
      // 最小一致は .+?
      Pattern pattern = Pattern.compile(".*&t=(.+?)&.*");
      String line;
      while((line = br.readLine()) != null){
         Matcher m = pattern.matcher(line);
         if(m.matches()){
           t = m.group(1);
           break;
         }
      }
    }
 
    String u =
      "http://www.youtube.com/get_video?" +
      "video_id=" + video_id + "&t=" + t;
 
    return u;
  }
 
}

失敗する video_id もあるみたいだけど、とりあえずここまで。
出力された URL を使って、ダウンロードツール等で一気にダウンロード。ただし、wget だと HTTP リダイレクトで失敗するっぽいので Irvine とかで。

参考にしたのは google videoやyoutubeとかの動画を落として保存。 これは、ソースが公開されているのですごく参考になった。

Java の正規表現がよくわからないので Javaプログラミング よろづ話 - 正規表現によるマッチング を参考に。

その他、紆余曲折メモ

Perl でやろうとして挫折。


#!/usr/local/bin/perl
 
# Ref.
#   - HTTPモジュール
#     http://homepage3.nifty.com/hippo2000/perltips/HTTP.html
#
# HTTP::Message <--- HTTP::Response 継承関係
#
 
use strict;
use warnings;
 
use FileHandle;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
 
my $movie_url_prefix = 'http://www.youtube.com/v/';
my $id = 'Qn-TcgJbVuM';
 
my $movie_url = $movie_url_prefix . $id;
print "$movie_url\n";
 
# get atom feed
my $ua = new LWP::UserAgent;
my $req = new HTTP::Request('GET', $movie_url);
my $res = $ua->request($req);
my $loc = $res->header('Location');
my $headers = $res->headers();
print $headers->as_string();
 
my $server = $res->header('Transfer-Encoding');
my $req2 = HTTP::Request->new(GET => $loc);
my $res2 = $ua->request($req2);
my $content = $res2->content;
 
my $fh = FileHandle->new("> $id");
if(defined $fh){
  print $fh $content;
  $fh->close;
}else{
  die 'Couldn\'t open the file' . $id;
}

実験したときに使った Java のコード。


// リクエストヘッダ調査用
private static void printRequestHeaders(URLConnection con){
  System.err.println("Request Headers:");
  Map req_headers = con.getRequestProperties();
  for(Iterator it = req_headers.keySet().iterator(); it.hasNext();){
    String key = (String)it.next();
    System.err.println(key + ": " + req_headers.get(key));
  }
  System.err.println();
}
 
// レスポンスヘッダ調査用
private static void printResponseHeaders(URLConnection con){
  System.err.println("Response Headers:");
  Map res_headers = con.getHeaderFields();
  for(Iterator it = res_headers.keySet().iterator(); it.hasNext();){
    String key = (String)it.next();
    System.err.println(key + ": " + res_headers.get(key));
  }
  System.err.println();
}
 
private static byte[] toBytes(InputStream src)throws IOException{
 
  ArrayList blist = new ArrayList();
  int b;
  while((b = src.read()) != -1){
    blist.add(new Byte((byte)b));
  }
  
  byte[] bytes = new byte[blist.size()]; 
  for(int i=0; i<bytes.length; i++){
    bytes[i] = ((Byte)blist.get(i)).byteValue();
  }
 
  return bytes;
}

テストターゲットに YouTube - 或るはてなブックマーカーの挑戦 を設定して、

http://www.youtube.com/watch?v=Qn-TcgJbVuM へアクセスすると http://www.youtube.com/get_video?video_id=Qn-TcgJbVuM&t=OEgsToPDskIX4Xt1GLR0a17wAixFHTqN が埋め込まれたHTML を取得するが、 HTTP/1.x 303 See Other なコンテンツらしく、最終的には http://youtube-617.vo.llnwd.net/d1/00/F7/Qn-TcgJbVuM.flv をダウンロードしている。

Mozilla Firefox の Live HTTP Headers で HTTP のヘッダ状況を調査。


----------------------------------------------------------
http://www.youtube.com/watch?v=Qn-TcgJbVuM
 
GET /watch?v=Qn-TcgJbVuM HTTP/1.1
Host: www.youtube.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; ja; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: ja,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: EUC-JP,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
 
HTTP/1.x 200 OK
Date: Mon, 14 Aug 2006 23:06:28 GMT
Server: Apache
Set-Cookie: watched_video_id_list=1e6813a4cc0dd78074042e5437d25bf8WwEAAABzCwAAAFFuLVRjZ0piVnVN; path=/; domain=.youtube.com
Set-Cookie: VISITOR_INFO1_LIVE=oZr3pU92K3g; path=/; domain=.youtube.com; expires=Thu, 11-Aug-2016 23:06:28 GMT
Content-Encoding: gzip
Cache-Control: no-cache
Keep-Alive: timeout=300
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
----------------------------------------------------------
http://www.youtube.com/player2.swf?video_id=Qn-TcgJbVuM&l=262&t=OEgsToPDskIX4Xt1GLR0a17wAixFHTqN&nc=13369344
 
GET /player2.swf?video_id=Qn-TcgJbVuM&l=262&t=OEgsToPDskIX4Xt1GLR0a17wAixFHTqN&nc=13369344 HTTP/1.1
Host: www.youtube.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; ja; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: ja,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: EUC-JP,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://www.youtube.com/watch?v=Qn-TcgJbVuM
Cookie: watched_video_id_list=1e6813a4cc0dd78074042e5437d25bf8WwEAAABzCwAAAFFuLVRjZ0piVnVN; VISITOR_INFO1_LIVE=oZr3pU92K3g
 
HTTP/1.x 200 OK
Date: Mon, 14 Aug 2006 23:06:40 GMT
Server: Apache
Last-Modified: Fri, 21 Jul 2006 21:46:23 GMT
Etag: "41d3-6d75e9c0"
Accept-Ranges: bytes
Content-Length: 16851
Keep-Alive: timeout=300
Connection: Keep-Alive
Content-Type: application/x-shockwave-flash
----------------------------------------------------------
http://www.youtube.com/get_video?video_id=Qn-TcgJbVuM&t=OEgsToPDskIX4Xt1GLR0a17wAixFHTqN
 
GET /get_video?video_id=Qn-TcgJbVuM&t=OEgsToPDskIX4Xt1GLR0a17wAixFHTqN HTTP/1.1
Host: www.youtube.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; ja; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: ja,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: EUC-JP,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: watched_video_id_list=1e6813a4cc0dd78074042e5437d25bf8WwEAAABzCwAAAFFuLVRjZ0piVnVN; VISITOR_INFO1_LIVE=oZr3pU92K3g
 
HTTP/1.x 303 See Other
Date: Mon, 14 Aug 2006 23:06:49 GMT
Server: Apache
Cache-Control: no-cache
Location: http://youtube-617.vo.llnwd.net/d1/00/F7/Qn-TcgJbVuM.flv
Keep-Alive: timeout=300
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
----------------------------------------------------------
http://youtube-617.vo.llnwd.net/d1/00/F7/Qn-TcgJbVuM.flv
 
GET /d1/00/F7/Qn-TcgJbVuM.flv HTTP/1.1
Host: youtube-617.vo.llnwd.net
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; ja; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: ja,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: EUC-JP,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
 
HTTP/1.x 200 OK
Server: Apache/2.0.54
Accept-Ranges: bytes
Content-Length: 10980203
Content-Type: video/flv
Age: 46208
Date: Mon, 14 Aug 2006 23:06:50 GMT
Last-Modified: Fri, 11 Aug 2006 00:26:40 GMT
Connection: close
----------------------------------------------------------

YouTube 外のウェブページに動画を object タグで貼る仕組み Embed で http://www.youtube.com/v/Qn-TcgJbVuM を取得。
http://www.youtube.com/v/Qn-TcgJbVuM ヘアクセスすると、HTTP/1.x 302 Found リダイレクトで http://www.youtube.com/watch_video?v=Qn-TcgJbVuM へ飛んで、そこからまた HTTP/1.x 303 See Other で http://www.youtube.com/p.swf?video_id=Qn-TcgJbVuM&eurl=&iurl=http%3A//sjl-static11.sjl.youtube.com/vi/Qn-TcgJbVuM/2.jpg&t=OEgsToPDskKk8n8_GUvdjAjxYg00gHYB へ飛んでいる。
このへんの動きを Perl や Java でブラウザぽいアクセスをさせようとしたけど失敗。


----------------------------------------------------------
http://www.youtube.com/v/Qn-TcgJbVuM
 
GET /v/Qn-TcgJbVuM HTTP/1.1
Host: www.youtube.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; ja; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: ja,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: EUC-JP,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
 
HTTP/1.x 302 Found
Date: Mon, 14 Aug 2006 23:08:55 GMT
Server: Apache
Location: http://www.youtube.com/watch_video?v=Qn-TcgJbVuM
Content-Length: 297
Keep-Alive: timeout=300
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
----------------------------------------------------------
http://www.youtube.com/watch_video?v=Qn-TcgJbVuM
 
GET /watch_video?v=Qn-TcgJbVuM HTTP/1.1
Host: www.youtube.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; ja; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: ja,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: EUC-JP,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
 
HTTP/1.x 303 See Other
Date: Mon, 14 Aug 2006 23:08:07 GMT
Server: Apache
Cache-Control: no-cache
Location: /p.swf?video_id=Qn-TcgJbVuM&eurl=&iurl=http%3A//sjl-static11.sjl.youtube.com/vi/Qn-TcgJbVuM/2.jpg&t=OEgsToPDskKk8n8_GUvdjAjxYg00gHYB
Keep-Alive: timeout=300
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/plain
----------------------------------------------------------
http://www.youtube.com/p.swf?video_id=Qn-TcgJbVuM&eurl=&iurl=http%3A//sjl-static11.sjl.youtube.com/vi/Qn-TcgJbVuM/2.jpg&t=OEgsToPDskKk8n8_GUvdjAjxYg00gHYB
 
GET /p.swf?video_id=Qn-TcgJbVuM&eurl=&iurl=http%3A//sjl-static11.sjl.youtube.com/vi/Qn-TcgJbVuM/2.jpg&t=OEgsToPDskKk8n8_GUvdjAjxYg00gHYB HTTP/1.1
Host: www.youtube.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; ja; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: ja,en;q=0.7,en-us;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: EUC-JP,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
 
HTTP/1.x 200 OK
Date: Mon, 14 Aug 2006 23:08:59 GMT
Server: Apache
Last-Modified: Tue, 06 Jun 2006 21:43:23 GMT
Etag: "2fcf-23df74c0"
Accept-Ranges: bytes
Content-Length: 12239
Keep-Alive: timeout=300
Connection: Keep-Alive
Content-Type: application/x-shockwave-flash
----------------------------------------------------------

HTTP 302 とか HTTP 303 にて、フラッシュコンテンツも同時に配信しているらしく、ダウンロードツール等は、そのまま不要なコンテンツをダウンロードしてしまうことがある。

ついでに、SWFファイルの先頭何バイトかのマジックナンバーの豆知識。

swfファイルのヘッダ部分は

1. 先頭3バイトに "CWS"
2. 4バイトめにバージョン番号
3. ...

となっていて、例えばFlashMXで作成したswfはバージョン番号が0x06以下となりますが、
ここをバイナリエディタなどでFlash8相当の0x08にすることで新しい機能が動作するようになります。

7bit | AS memo

コメント

お BufferedReader クローズしわすれ。

tags: zlashdot Movie Java YouTube

Posted by NI-Lab. (@nilab)