You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mnmkng opened this issue
Oct 20, 2020
· 3 comments
Labels
EpicAn epic is a large body of work that can be broken down into a number of smaller issues.t-toolingIssues with this label are in the ownership of the tooling team.
There are several issues now which are related to the way we handle HTTP status codes in crawlers.
CheerioCrawler throws an exception when it encounters a 500+ status code and processes 400+ status codes.
PuppeteerCrawler does not throw an exception for any status code.
SessionPool makes both crawlers throw on 401, 403 and 429 status codes.
None of the above is configurable. We need to design an easy to understand process and configuration for the handling of status codes. Maybe it could all be left to SessionPool by making useSessionPool true by default. Or we could have two configurable layers and add throwOnStatusCodes option to crawlers and also retireSessionOnStatusCodes to SessionPool.
The text was updated successfully, but these errors were encountered:
mnmkng
added
the
Epic
An epic is a large body of work that can be broken down into a number of smaller issues.
label
Oct 20, 2020
Right. We've run into a situation where we need to handle a 403 explicitly, and have so far come up empty on how that could be best achieved.
It doesn't appear any of our handler code is ever invoked when this happens. While I could edit crawler_utils.js to make it so it is, does anyone know of a simpler workaround?
Yeah, this is long overdue and we still have not found the time to add those features. A better, although a similarly awkward workaround as editing the crawler_utils.js would be this:
const{STATUS_CODES_BLOCKED}=require('apify/build/constants');// It looks like this: [401, 403, 429], so you could:deleteSTATUS_CODES_BLOCKED[1];
It's important to modify the array in place. You can also inject your own custom status codes. It's internal, so it can stop working at any point without a notice, but for the time being, it should solve your problem.
EpicAn epic is a large body of work that can be broken down into a number of smaller issues.t-toolingIssues with this label are in the ownership of the tooling team.
There are several issues now which are related to the way we handle HTTP status codes in crawlers.
CheerioCrawler
throws an exception when it encounters a 500+ status code and processes 400+ status codes.PuppeteerCrawler
does not throw an exception for any status code.SessionPool
makes both crawlers throw on401
,403
and429
status codes.None of the above is configurable. We need to design an easy to understand process and configuration for the handling of status codes. Maybe it could all be left to
SessionPool
by makinguseSessionPool
true by default. Or we could have two configurable layers and addthrowOnStatusCodes
option to crawlers and alsoretireSessionOnStatusCodes
toSessionPool
.The text was updated successfully, but these errors were encountered: