Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTTP status code handling #812

Open
mnmkng opened this issue Oct 20, 2020 · 3 comments
Open

Improve HTTP status code handling #812

mnmkng opened this issue Oct 20, 2020 · 3 comments
Labels
Epic An epic is a large body of work that can be broken down into a number of smaller issues. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@mnmkng
Copy link
Member

mnmkng commented Oct 20, 2020

There are several issues now which are related to the way we handle HTTP status codes in crawlers.

  • CheerioCrawler throws an exception when it encounters a 500+ status code and processes 400+ status codes.
  • PuppeteerCrawler does not throw an exception for any status code.
  • SessionPool makes both crawlers throw on 401, 403 and 429 status codes.

None of the above is configurable. We need to design an easy to understand process and configuration for the handling of status codes. Maybe it could all be left to SessionPool by making useSessionPool true by default. Or we could have two configurable layers and add throwOnStatusCodes option to crawlers and also retireSessionOnStatusCodes to SessionPool.

@mnmkng mnmkng added the Epic An epic is a large body of work that can be broken down into a number of smaller issues. label Oct 20, 2020
@Wintereise
Copy link

Right. We've run into a situation where we need to handle a 403 explicitly, and have so far come up empty on how that could be best achieved.

It doesn't appear any of our handler code is ever invoked when this happens. While I could edit crawler_utils.js to make it so it is, does anyone know of a simpler workaround?

@mnmkng
Copy link
Member Author

mnmkng commented Apr 10, 2021

Yeah, this is long overdue and we still have not found the time to add those features. A better, although a similarly awkward workaround as editing the crawler_utils.js would be this:

const { STATUS_CODES_BLOCKED } = require('apify/build/constants');

// It looks like this: [401, 403, 429], so you could:
delete STATUS_CODES_BLOCKED[1];

It's important to modify the array in place. You can also inject your own custom status codes. It's internal, so it can stop working at any point without a notice, but for the time being, it should solve your problem.

@B4nan
Copy link
Member

B4nan commented Jul 27, 2022

#1423 adds configurability for the session pool:

const crawler = new CheerioCrawler({
  sessionPoolOptions: { blockedStatusCodes: [[401, 403, 429, 500] },
  // ...
});

will be available in crawlee 3.0.2

@mtrunkat mtrunkat added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic An epic is a large body of work that can be broken down into a number of smaller issues. t-tooling Issues with this label are in the ownership of the tooling team.
4 participants