Thursday, September 30, 2010

Nokogiri HTML and XML parser

Nokogiri gem, a new HTML, XML, SAX and Reader parser for Ruby.

It parses and searches XML/HTML faster than Hpricot(Hpricot being the current de facto Ruby HTML parser) and boasts XPath support, CSS3 selector support (a big deal, because CSS3 selectors are mega powerful) and the ability to be used as a "drop in" replacement for Hpricot.

On an Hpricot vs Nokogiri benchmark, Nokogiri clocked in at 7 times faster at initially loading an XML document, 5 times faster at searching for content based on an XPath, and 1.62 times faster at searching for content via a CSS-based search.

Here is the example :

require 'nokogiri'

require 'open-uri'

# Get a Nokogiri::HTML:Document for the page we’re #interested in...

doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))

#Do funky things with it using Nokogiri::XML::Node methods...

# Search for nodes by css

doc.css('h3.r a.l').each do |link|

puts link.content

end

####

# Search for nodes by xpath

doc.xpath('//h3/a[@class="l"]').each do |link|

puts link.content

end

####

# Or mix and match.

doc.search('h3.r a.l', '//h3/a[@class="l"]').each do |link|

puts link.content

end

Source : http://nokogiri.org/


No comments:

Post a Comment