How to prevent block or forbidden on Newspaper3k scraping
How to bypass blocked or forbidden to accessed content when scraping with Newspaper3K in python
Python has a great library that can nicely scrape an Article called Newspaper https://newspaper.readthedocs.io
But sometimes it could show these error on any url:
- HTTP 400 Bad Request error
- HTTP 403 Forbidden client error
- HTTP 406 Not Acceptable client error
it probably a protection from the site owner.
Here’s how we can bypass this by setting up User agent
from newspaper import Config
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
# OPtional
# add your proxy information
PROXIES = {
'http': "http://ip_address:port_number",
'https': "https://ip_address:port_number"
}
config = Config()
config.browser_user_agent = USER_AGENT
config.proxies = PROXIES #optional
config.request_timeout = 10
It’s an optional to give proxy information.