OneBite.Dev - Coding blog in a bite size

Faster ruby scraping program with threads

Learn how to make faster web scraping program with Ruby using Threads

We’ve learned how to make a simple web scraping program with Ruby in the previous article. In this article, we will learn how to make it faster using threads.

This is only useful when you need to fetch multiple resources at once. If you only need to fetch a single resource, it’s better to use the normal way.

What is Thread in Ruby?

“Threads are the Ruby implementation for a concurrent programming model.” - Ruby official docs on Thread

Study case

Single resource scraping

Let’s say we have this simple program to fetch website’s title

require 'net/http'
require 'nokogiri'

url = "https://www.google.com"
response = Net::HTTP.get_response(URI(url))

if response.code != "200"
    puts "Error: #{response.code}"
    exit
end

doc = Nokogiri::HTML(response.body)

print(doc.title)

We’re using Nokogiri and Net/HTTP as discussed at web scraping with ruby for beginner article.

Multiple resources scraping (The slow way)

Now let’s say we want to fetch multiple websites at once. We wrap the resources in array and iterate through it.

require 'net/http'
require 'nokogiri'

urls = [
    "https://www.w3schools.com/",
    "https://www.google.com/",
    "https://www.youtube.com/",
    "https://www.wikipedia.com/",
    "https://onebite.dev/"   
]

urls.each do |url|
    response = Net::HTTP.get_response(URI(url))

    if response.code != "200"
        puts "Error fetching #{url}: #{response.code}"
    else
        doc = Nokogiri::HTML(response.body)
        puts "Successfully fetched content from #{url}"
        # For demonstration purposes, we're printing only the title of each website.
        # You can print the entire content by using `puts doc`.
        puts "Title: #{doc.title}"
    end
end

Benchmark results (running 3 times):

Multiple resources scraping (With thread)

Let’s make it faster! We can use thread to fetch multiple resources at once. We can use Thread.new to create a new thread.

We can also use Thread.join to wait for the thread to finish.

require 'net/http'
require 'nokogiri'

beginning_time = Time.now

urls = [
    "https://www.w3schools.com/",
    "https://www.google.com/",
    "https://www.youtube.com/",
    "https://www.wikipedia.com/",
    "https://onebite.dev/"   
]

threads = []

urls.each do |url|
    threads << Thread.new do
        response = Net::HTTP.get_response(URI(url))

        if response.code != "200"
            puts "Error fetching #{url}: #{response.code}"
        else
            doc = Nokogiri::HTML(response.body)
            puts "Successfully fetched content from #{url}"
            # For demonstration purposes, we're printing only the title of each website.
            # You can print the entire content by using `puts doc`.
            puts "Title: #{doc.title}"
        end
    end
end

# Join all threads to ensure they complete before the main thread exits
threads.each(&:join)


end_time = Time.now
puts "Program finished at #{(end_time - beginning_time)*1000} milliseconds"

It’s faster up to 5 times!

Benchmark results

This is the benchmark results between normal loop and using thread

MethodRun #Time (milliseconds)
Using Thread1799.802
Using Thread2647.711
Using Thread3763.01
Normal Loop14393.287
Normal Loop23686.382
Normal Loop33237.197
← Scraping google search re...
ruby scraping