You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when crawling pages with Pseudo URLs, it often happens that the crawler spends most of its time enqueueing thousands of pages in the request queue and the user has no way of limiting this behavior. They may set the maxRequestsPerCrawl option, but that only limits the pages actually crawled, not the requests enqueued. Thus, the user may end up with 100 pages crawled and thousands in the queue.
This will be especially important when switching to the per-request priced persistent queue.
We could add an enqueuedRequests property to RequestQueue that would get initialized automatically to current value from storage and then increment itself in memory with each added request.
We would also add an options.requestLimit configuration property to RequestQueue. After reaching this limit, .addRequest() would return null or something and prevent enqueueing of more requests.
I think this is a good idea. Maybe I'd call the option differently, e.g. maxRequestCount to make it more clear.
One note: if the new request has forefront: true, shall we enqueue it or not when limit is reached? To be perfectly logically correct, we should, since it means the request has some kind of a priority.
Currently, when crawling pages with Pseudo URLs, it often happens that the crawler spends most of its time enqueueing thousands of pages in the request queue and the user has no way of limiting this behavior. They may set the
maxRequestsPerCrawl
option, but that only limits the pages actually crawled, not the requests enqueued. Thus, the user may end up with 100 pages crawled and thousands in the queue.This will be especially important when switching to the per-request priced persistent queue.
We could add an
enqueuedRequests
property toRequestQueue
that would get initialized automatically to current value from storage and then increment itself in memory with each added request.We would also add an
options.requestLimit
configuration property toRequestQueue
. After reaching this limit,.addRequest()
would returnnull
or something and prevent enqueueing of more requests.@mtrunkat @jancurn
The text was updated successfully, but these errors were encountered: