Web scraping with Ruby (for beginner 2023)

Here is how you can scrape website content using Ruby.

Love to watch a video instead?

Tools

Request website data with HTTP Client
Parsing the HTML content

Ruby has built in net/http for performing simple get request. While for the parsing, we will use Nokogiri, a Ruby Gem that makes it easy to work with XML and HTML.

Link: Nokogiri

How to scrape website content with Ruby

Install Nokogiri

gem install nokogiri

Create a new ruby file, name it whatever your want.

touch scrape.rb

Load a file with simple get request using net/http.

require 'net/http'

url = "https://www.google.com" # Must be absolute URL
response = Net::HTTP.get_response(URI(url))

if response.code != "200"
    puts "Error: #{response.code}"
    exit
end

puts response.body

We’re using get_response instead of get method, so we can check the status code first, in case it failed.

Try to run it, make sure everything works fine.

ruby scrape.rb

Using nokogiri, we can easily parse the HTML document

require 'net/http'
require 'nokogiri'

url = "https://www.google.com"
response = Net::HTTP.get_response(URI(url))

if response.code != "200"
    puts "Error: #{response.code}"
    exit
end

doc = Nokogiri::HTML(response.body)

print(doc)

Now it’s up to you which element you want to retrieve from the parsed document. For example to get meta data

# get meta data
doc.css('meta').each do |meta|
  puts meta
end

What is CSS in nokogiri ?

CSS is a selector we could use, not for “css” , rather any element you want. Including meta tag as above, or specific node like:

doc.css('nav ul.menu li a', 'article h2').each do | link |
  puts link.content
end

You can search for a single node, or multiple values separated by comma.

Real example, scraping w3school homepage

In this example, we want to get all available subjects in w3schools.com

First, we need to know where is this info located, by opening the site directly.

sample ruby web scraping

When this tutorial created, the lessons are wrapped in a div called subtopnav . Each lesson is wrapped in a tag.

Remember, the owner of the website can always change the site structure

require 'net/http'
require 'nokogiri'

url = "https://www.w3schools.com/"
response = Net::HTTP.get_response(URI(url))


if response.code != "200"
    puts "Error: #{response.code}"
    exit
end

doc = Nokogiri::HTML(response.body)

doc.css('#subtopnav a').each do | link |
  puts link.content
end

If you want the whole tag information, you can just print link first.
Content will print the text content only.