Faster ruby scraping program with threads
Learn how to make faster web scraping program with Ruby using Threads
We’ve learned how to make a simple web scraping program with Ruby in the previous article. In this article, we will learn how to make it faster using threads.
This is only useful when you need to fetch multiple resources at once. If you only need to fetch a single resource, it’s better to use the normal way.
What is Thread in Ruby?
“Threads are the Ruby implementation for a concurrent programming model.” - Ruby official docs on Thread
Study case
Single resource scraping
Let’s say we have this simple program to fetch website’s title
require 'net/http'
require 'nokogiri'
url = "https://www.google.com"
response = Net::HTTP.get_response(URI(url))
if response.code != "200"
puts "Error: #{response.code}"
exit
end
doc = Nokogiri::HTML(response.body)
print(doc.title)
We’re using Nokogiri and Net/HTTP as discussed at web scraping with ruby for beginner article.
Multiple resources scraping (The slow way)
Now let’s say we want to fetch multiple websites at once. We wrap the resources in array and iterate through it.
require 'net/http'
require 'nokogiri'
urls = [
"https://www.w3schools.com/",
"https://www.google.com/",
"https://www.youtube.com/",
"https://www.wikipedia.com/",
"https://onebite.dev/"
]
urls.each do |url|
response = Net::HTTP.get_response(URI(url))
if response.code != "200"
puts "Error fetching #{url}: #{response.code}"
else
doc = Nokogiri::HTML(response.body)
puts "Successfully fetched content from #{url}"
# For demonstration purposes, we're printing only the title of each website.
# You can print the entire content by using `puts doc`.
puts "Title: #{doc.title}"
end
end
Benchmark results (running 3 times):
- Program finished at 4393.287 milliseconds
- Program finished at 3686.382 milliseconds
- Program finished at 3237.197 milliseconds
Multiple resources scraping (With thread)
Let’s make it faster! We can use thread to fetch multiple resources at once. We can use Thread.new
to create a new thread.
We can also use Thread.join
to wait for the thread to finish.
require 'net/http'
require 'nokogiri'
beginning_time = Time.now
urls = [
"https://www.w3schools.com/",
"https://www.google.com/",
"https://www.youtube.com/",
"https://www.wikipedia.com/",
"https://onebite.dev/"
]
threads = []
urls.each do |url|
threads << Thread.new do
response = Net::HTTP.get_response(URI(url))
if response.code != "200"
puts "Error fetching #{url}: #{response.code}"
else
doc = Nokogiri::HTML(response.body)
puts "Successfully fetched content from #{url}"
# For demonstration purposes, we're printing only the title of each website.
# You can print the entire content by using `puts doc`.
puts "Title: #{doc.title}"
end
end
end
# Join all threads to ensure they complete before the main thread exits
threads.each(&:join)
end_time = Time.now
puts "Program finished at #{(end_time - beginning_time)*1000} milliseconds"
- Program finished at 799.802 milliseconds
- Program finished at 647.711 milliseconds
- Program finished at 763.01 milliseconds
It’s faster up to 5 times!
Benchmark results
This is the benchmark results between normal loop and using thread
Method | Run # | Time (milliseconds) |
---|---|---|
Using Thread | 1 | 799.802 |
Using Thread | 2 | 647.711 |
Using Thread | 3 | 763.01 |
Normal Loop | 1 | 4393.287 |
Normal Loop | 2 | 3686.382 |
Normal Loop | 3 | 3237.197 |