Faster ruby scraping program with threads

We’ve learned how to make a simple web scraping program with Ruby in the previous article. In this article, we will learn how to make it faster using threads.

This is only useful when you need to fetch multiple resources at once. If you only need to fetch a single resource, it’s better to use the normal way.

What is Thread in Ruby?

“Threads are the Ruby implementation for a concurrent programming model.” - Ruby official docs on Thread

Study case

Single resource scraping

Let’s say we have this simple program to fetch website’s title

require 'net/http'
require 'nokogiri'

url = "https://www.google.com"
response = Net::HTTP.get_response(URI(url))

if response.code != "200"
    puts "Error: #{response.code}"
    exit
end

doc = Nokogiri::HTML(response.body)

print(doc.title)

We’re using Nokogiri and Net/HTTP as discussed at web scraping with ruby for beginner article.

Multiple resources scraping (The slow way)

Now let’s say we want to fetch multiple websites at once. We wrap the resources in array and iterate through it.

require 'net/http'
require 'nokogiri'

urls = [
    "https://www.w3schools.com/",
    "https://www.google.com/",
    "https://www.youtube.com/",
    "https://www.wikipedia.com/",
    "https://onebite.dev/"   
]

urls.each do |url|
    response = Net::HTTP.get_response(URI(url))

    if response.code != "200"
        puts "Error fetching #{url}: #{response.code}"
    else
        doc = Nokogiri::HTML(response.body)
        puts "Successfully fetched content from #{url}"
        # For demonstration purposes, we're printing only the title of each website.
        # You can print the entire content by using `puts doc`.
        puts "Title: #{doc.title}"
    end
end

Benchmark results (running 3 times):

Program finished at 4393.287 milliseconds
Program finished at 3686.382 milliseconds
Program finished at 3237.197 milliseconds

Multiple resources scraping (With thread)

Let’s make it faster! We can use thread to fetch multiple resources at once. We can use Thread.new to create a new thread.

We can also use Thread.join to wait for the thread to finish.

require 'net/http'
require 'nokogiri'

beginning_time = Time.now

urls = [
    "https://www.w3schools.com/",
    "https://www.google.com/",
    "https://www.youtube.com/",
    "https://www.wikipedia.com/",
    "https://onebite.dev/"   
]

threads = []

urls.each do |url|
    threads << Thread.new do
        response = Net::HTTP.get_response(URI(url))

        if response.code != "200"
            puts "Error fetching #{url}: #{response.code}"
        else
            doc = Nokogiri::HTML(response.body)
            puts "Successfully fetched content from #{url}"
            # For demonstration purposes, we're printing only the title of each website.
            # You can print the entire content by using `puts doc`.
            puts "Title: #{doc.title}"
        end
    end
end

# Join all threads to ensure they complete before the main thread exits
threads.each(&:join)


end_time = Time.now
puts "Program finished at #{(end_time - beginning_time)*1000} milliseconds"

Program finished at 799.802 milliseconds
Program finished at 647.711 milliseconds
Program finished at 763.01 milliseconds

It’s faster up to 5 times!

Benchmark results

This is the benchmark results between normal loop and using thread

Method	Run #	Time (milliseconds)
Using Thread	1	799.802
Using Thread	2	647.711
Using Thread	3	763.01
Normal Loop	1	4393.287
Normal Loop	2	3686.382
Normal Loop	3	3237.197