試す環境は Debian GNU/Linux Lenny で。


$ uname -mrvs
Linux 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64
 
$ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [x86_64-linux]
 
$ dpkg -l | grep hpricot
ii  libhpricot-ruby                     0.6-2                      A fast, enjoyable HTML parser
ii  libhpricot-ruby1.8                  0.6-2                      A fast, enjoyable HTML parser

サンプルコード。
a href, img src, embed src, param value などを XXXXXXXXXX に置換する。


$ cat ./replace_urls.rb
#!/usr/bin/env ruby
$KCODE='u'
 
require 'hpricot'
require 'open-uri'
 
def replace_urls(doc)
  begin
    (doc/:a).each{|elem|
      elem[:href] = 'XXXXXXXXXX'
    }
    (doc/:img).each{|elem|
      elem[:src] = 'XXXXXXXXXX'
    }
    (doc/:embed).each{|elem|
      elem[:src] = 'XXXXXXXXXX'
    }
    (doc/:param).each{|elem|
      if elem[:name] == 'movie'
        elem[:value] = 'XXXXXXXXXX'
      end
    }
    return doc.to_html
  rescue
    $stderr.puts $!.inspect
  end
end
 
url = 'http://www.nilab.info/lab/0/link_image_flash.html'
doc = open(url)
 
puts '--- original html ----------------------'
puts doc.string
 
result = replace_urls(Hpricot(doc))
 
puts '--- replaced html ----------------------'
puts result

実行結果。
元の形を崩さずに指定した部分の属性だけ置き換えることができている……と思ったら文法的におかしい『</object>』の後の『</p>』が消えていた。
それに『<!DOCTYPE html>』が『<!DOCTYPE html SYSTEM>』になってる。


$ ruby ./replace_urls.rb
--- original html ----------------------
<!DOCTYPE html>
<html>
        <head>
                <meta charset="UTF-8">
                <title>link image flash (実験サンプル用HTML)</title>
        </head>
        <body>
                <h1>link image flash (実験サンプル用HTML)</h1>
 
                <p><img src="http://www.nilab.info/z3/z3_profile.jpg" /></p>
 
                <object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" width="550" height="400" id="20090405_FireSmokeFx" align="middle"><param name="allowScriptAccess" value="sameDomain" /><param name="allowFullScreen" value="false" /><param name="movie" value="http://www.nilab.info/zurazure2/20090405_FireSmokeFx.swf" /><param name="quality" value="high" /><param name="bgcolor" value="#333333" />     <embed src="http://www.nilab.info/zurazure2/20090405_FireSmokeFx.swf" quality="high" bgcolor="#333333" width="550" height="400" name="20090405_FireSmokeFx" align="middle" allowScriptAccess="sameDomain" allowFullScreen="false" type="application/x-shockwave-flash" pluginspage="http://www.adobe.com/go/getflashplayer_jp" /></object></p>
 
                <p><a href="/">NI-Lab.</a></p>
        </body>
</html>
--- replaced html ----------------------
<!DOCTYPE html SYSTEM>
<html>
        <head>
                <meta charset="UTF-8" />
                <title>link image flash (実験サンプル用HTML)</title>
        </head>
        <body>
                <h1>link image flash (実験サンプル用HTML)</h1>
 
                <p><img src="XXXXXXXXXX" /></p>
 
                <object id="20090405_FireSmokeFx" align="middle" height="400" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="550" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0"><param name="allowScriptAccess" value="sameDomain" /><param name="allowFullScreen" value="false" /><param name="movie" value="XXXXXXXXXX" /><param name="quality" value="high" /><param name="bgcolor" value="#333333" />   <embed name="20090405_FireSmokeFx" pluginspage="http://www.adobe.com/go/getflashplayer_jp" allowfullscreen="false" src="XXXXXXXXXX" allowscriptaccess="sameDomain" type="application/x-shockwave-flash" align="middle" height="400" quality="high" width="550" bgcolor="#333333"></embed></object>
 
                <p><a href="XXXXXXXXXX">NI-Lab.</a></p>
        </body>
</html>

tags: ruby html_parser hpricot

Posted by NI-Lab. (@nilab)