OneBite.Dev - Coding blog in a bite size

How to prevent block or forbidden on Newspaper3k scraping

How to bypass blocked or forbidden to accessed content when scraping with Newspaper3K in python

Python has a great library that can nicely scrape an Article called Newspaper https://newspaper.readthedocs.io

But sometimes it could show these error on any url:

it probably a protection from the site owner.

Here’s how we can bypass this by setting up User agent

from newspaper import Config

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

# OPtional
# add your proxy information
PROXIES = {
  'http': "http://ip_address:port_number",
  'https': "https://ip_address:port_number"
}


config = Config()
config.browser_user_agent = USER_AGENT
config.proxies = PROXIES #optional
config.request_timeout = 10

It’s an optional to give proxy information.

scraping python