試す環境は Debian GNU/Linux Lenny で。
$ uname -mrvs
Linux 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64
$ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux]
$ dpkg -l | grep hpricot
ii libhpricot-ruby 0.6-2 A fast, enjoyable HTML parser
ii libhpricot-ruby1.8 0.6-2 A fast, enjoyable HTML parser
サンプルコード。
a href, img src, embed src, param value などを XXXXXXXXXX に置換する。
$ cat ./replace_urls.rb
#!/usr/bin/env ruby
$KCODE='u'
require 'hpricot'
require 'open-uri'
def replace_urls(doc)
begin
(doc/:a).each{|elem|
elem[:href] = 'XXXXXXXXXX'
}
(doc/:img).each{|elem|
elem[:src] = 'XXXXXXXXXX'
}
(doc/:embed).each{|elem|
elem[:src] = 'XXXXXXXXXX'
}
(doc/:param).each{|elem|
if elem[:name] == 'movie'
elem[:value] = 'XXXXXXXXXX'
end
}
return doc.to_html
rescue
$stderr.puts $!.inspect
end
end
url = 'http://www.nilab.info/lab/0/link_image_flash.html'
doc = open(url)
puts '--- original html ----------------------'
puts doc.string
result = replace_urls(Hpricot(doc))
puts '--- replaced html ----------------------'
puts result
実行結果。
元の形を崩さずに指定した部分の属性だけ置き換えることができている……と思ったら文法的におかしい『</object>』の後の『</p>』が消えていた。
それに『<!DOCTYPE html>』が『<!DOCTYPE html SYSTEM>』になってる。
$ ruby ./replace_urls.rb
--- original html ----------------------
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>link image flash (実験サンプル用HTML)</title>
</head>
<body>
<h1>link image flash (実験サンプル用HTML)</h1>
<p><img src="http://www.nilab.info/z3/z3_profile.jpg" /></p>
<object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" width="550" height="400" id="20090405_FireSmokeFx" align="middle"><param name="allowScriptAccess" value="sameDomain" /><param name="allowFullScreen" value="false" /><param name="movie" value="http://www.nilab.info/zurazure2/20090405_FireSmokeFx.swf" /><param name="quality" value="high" /><param name="bgcolor" value="#333333" /> <embed src="http://www.nilab.info/zurazure2/20090405_FireSmokeFx.swf" quality="high" bgcolor="#333333" width="550" height="400" name="20090405_FireSmokeFx" align="middle" allowScriptAccess="sameDomain" allowFullScreen="false" type="application/x-shockwave-flash" pluginspage="http://www.adobe.com/go/getflashplayer_jp" /></object></p>
<p><a href="/">NI-Lab.</a></p>
</body>
</html>
--- replaced html ----------------------
<!DOCTYPE html SYSTEM>
<html>
<head>
<meta charset="UTF-8" />
<title>link image flash (実験サンプル用HTML)</title>
</head>
<body>
<h1>link image flash (実験サンプル用HTML)</h1>
<p><img src="XXXXXXXXXX" /></p>
<object id="20090405_FireSmokeFx" align="middle" height="400" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="550" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0"><param name="allowScriptAccess" value="sameDomain" /><param name="allowFullScreen" value="false" /><param name="movie" value="XXXXXXXXXX" /><param name="quality" value="high" /><param name="bgcolor" value="#333333" /> <embed name="20090405_FireSmokeFx" pluginspage="http://www.adobe.com/go/getflashplayer_jp" allowfullscreen="false" src="XXXXXXXXXX" allowscriptaccess="sameDomain" type="application/x-shockwave-flash" align="middle" height="400" quality="high" width="550" bgcolor="#333333"></embed></object>
<p><a href="XXXXXXXXXX">NI-Lab.</a></p>
</body>
</html>
tags: ruby html_parser hpricot
Posted by NI-Lab. (@nilab)