Web scraping with Ruby (for beginner 2023)
Beginner guide on how to scrape website content with Ruby
Here is how you can scrape website content using Ruby.
Love to watch a video instead?
Tools
- Request website data with HTTP Client
- Parsing the HTML content
Ruby has built in net/http
for performing simple get request. While for the parsing, we will use Nokogiri, a Ruby Gem that makes it easy to work with XML and HTML.
Link: Nokogiri
How to scrape website content with Ruby
Install Nokogiri
gem install nokogiri
Create a new ruby file, name it whatever your want.
touch scrape.rb
Load a file with simple get request using net/http.
require 'net/http'
url = "https://www.google.com" # Must be absolute URL
response = Net::HTTP.get_response(URI(url))
if response.code != "200"
puts "Error: #{response.code}"
exit
end
puts response.body
We’re using get_response
instead of get
method, so we can check the status code first, in case it failed.
Try to run it, make sure everything works fine.
ruby scrape.rb
Using nokogiri, we can easily parse the HTML document
require 'net/http'
require 'nokogiri'
url = "https://www.google.com"
response = Net::HTTP.get_response(URI(url))
if response.code != "200"
puts "Error: #{response.code}"
exit
end
doc = Nokogiri::HTML(response.body)
print(doc)
Now it’s up to you which element you want to retrieve from the parsed document. For example to get meta data
# get meta data
doc.css('meta').each do |meta|
puts meta
end
What is CSS in nokogiri ?
CSS is a selector we could use, not for “css” , rather any element you want. Including meta tag as above, or specific node like:
doc.css('nav ul.menu li a', 'article h2').each do | link |
puts link.content
end
You can search for a single node, or multiple values separated by comma.
Real example, scraping w3school homepage
In this example, we want to get all available subjects in w3schools.com
First, we need to know where is this info located, by opening the site directly.
When this tutorial created, the lessons are wrapped in a div called subtopnav . Each lesson is wrapped in a tag.
Remember, the owner of the website can always change the site structure
require 'net/http'
require 'nokogiri'
url = "https://www.w3schools.com/"
response = Net::HTTP.get_response(URI(url))
if response.code != "200"
puts "Error: #{response.code}"
exit
end
doc = Nokogiri::HTML(response.body)
doc.css('#subtopnav a').each do | link |
puts link.content
end
If you want the whole tag information, you can just print link first.
Content
will print the text content only.