-14

Stack Exchange is implementing four new restrictions to prevent unauthorized automated access to the network. These restrictions should not affect day-to-day network usage; however, they are still meaningful changes, and we need your help ensuring that authorized network use remains uninhibited.

Here are the restrictions we are adding to the platform:

  1. We will restrict the amount of content that Microsoft Bing stores after crawling the site (details here). This action is not expected to significantly decrease the relevance of search results, but may impact auxiliary functionality such as Bing Chat and full-page previews.
  2. We will heavily rate-limit access to the platform from Microsoft’s IP addresses. This includes Microsoft Azure and, therefore, will impact any community member running a service that accesses the site via Microsoft Azure. However, traffic will generally be exempt from this restriction if it uses the API - see the list of exemptions below. We may also implement similar restrictions for other cloud platforms (e.g. AWS and GCP) at a later date.
  3. We will alter the robots.txt file on our chat servers to disallow all indexing, except indexing by known and recognized search engines.
  4. We will restrict the rate at which users can submit automated requests to the site through scripts and code without using the API, e.g., actions originating from sources other than human interaction with a browser.

Of these steps, #1 (Bing search) is unlikely to directly impact the users of the site, barring marginal impacts to search relevance for users using Bing.

For #2 (MS Azure), we do not know of any community projects that are utilizing Microsoft Azure as their primary means to access the network, but if there are any such projects, we would like to know about them so that we can find a way to ensure your access is unimpeded.

For #3, we do not expect any substantial impact to community members. We are intentionally leaving Stack Exchange Chat open to indexing by common search engines so that users will be minimally impacted by the change.

Finally, #4 - scripts and automated requests. Well, that’s the big one, eh? The change we are making will impact userscripts that issue a high volume of requests to the site. Here are some details we can share with you:

  • We are evaluating what an appropriate rate-limit would be to match our desired level of restriction. Our initial guess is that the new rate-limit will be set to around 60 requests per minute.
  • This new rate-limit will not apply to site access through the API. Rate-limiting for the API is managed separately.
  • Users accessing the site through a browser normally should not encounter any issues after this change.
  • Users who have userscripts installed that do not issue a high volume of requests should not be impacted by this change. For example, scripts that are purely graphical enhancements will not be impacted.
  • Users who have userscripts installed that do issue a high volume of requests may encounter unexpected behavior while navigating the site. We do not know yet exactly how disruptive these behaviors are liable to be, and we are evaluating our options for minimally disruptive approaches that still achieve our rate-limiting goals.
  • We do not plan on implementing stricter general use rate-limits. In other words, we will only limit traffic if it comes from non-human sources. You don’t need to worry about encountering new rate-limits through normal usage of the site. (Unfortunately I’m unable to share details as to how this is implemented.)

Exemptions to the new rate-limits

There are three categories of exemptions to these new rate-limits. Traffic falling into any of the following three buckets will not be restricted by these new rate-limits:

  1. Traffic that reaches the network through the API will not be restricted by these new rate-limits.
  2. In order to minimize disruption to curator tools, automated traffic to chat (a.k.a. “Bonfire”) will not be subject to the new rate-limits.
  3. In order to minimize disruption to moderator workflow, moderators will be fully exempt from these new rate-limits on any site they moderate.

We have created the exemption list above to help keep the potential for community impact small. We do not want to impact moderator userscripts and other 3rd party tools moderators use to get work done, so moderators are fully exempt from these new restrictions on the site they moderate. Similarly, we know that curators route a significant amount of curation activity through Chat (“Bonfire”), so we are fully excluding Chat from the new rate-limits to minimize impact.

We need your help to make sure this change is as smooth as possible.

We need your help. I’ll be real with you: we’re making this change. We are currently monitoring the expected impact of the restrictions we are drafting. We will implement restrictions on Microsoft Azure and related services over the next few weeks. Bullet #4 above will be implemented no earlier than the last week of October. When it goes live, we want the impact on you to be as minimal as possible – none at all, if we can manage it.

We hope the above exceptions help to considerably minimize the potential for curator impact. However, these exceptions may not provide complete coverage for site curators. We’d like to avoid as much disruption to the curator workflow as we can. We strongly recommend userscript developers switch over to API usage as soon as possible, but we recognize that this isn’t the easiest change to make, especially on relatively short notice. To this end, we need to start… (drumroll)... a list.

Do you know of any commonly-used curator userscripts or bots that could submit a meaningful amount of traffic to the network? To be on the safe side, we’re interested in reviewing any userscript, program, or bot that may submit more than 10 requests per minute.

I have created a community wiki answer below. Edit any userscript, program, bot, or service into that answer to add it to our list to review.

With any luck, we’ll be able to get this change out as smoothly as possible and with as little disruption as we can manage. Thanks for your help in advance!

11
  • 18
    Google isn’t satisfied with buying the data from you, they need you to also sabotage their competitors, eh? Who doesn’t love anticompetitive monopolies? Google’s actions have destroyed the internet, and their search engine doesn’t even work any more (search for “Facebook” and be shocked by how few results you get - Google has literally thrown out 99% of the internet in recent years), but god forbid anyone be allowed to compete fairly with their broken product.
    – Jeremy
    Commented 5 hours ago
  • 1
    Usually on Stack Overflow it helps to include relevant information from links within the post, it feels like it might be relevant that you might be trying to stop content from showing up in bing chat that your first link talks about? or is this really just about money again.
    – Sayse
    Commented 4 hours ago
  • 10
    TOTALLY not related to the data dump kill project, right? Commented 4 hours ago
  • @Sayse it's always about the money Commented 4 hours ago
  • 2
    So we're just giving up on that whole day dream of building a library of knowledge for the public good, are we? If the only way someone can find what they need in your repository of knowledge is to pay for the tool to search it, it's not for the public good; it's for profit. It wouldn't be so bad, except you're basically paywalling content that was donated to you because we were under the impression that we were doing something to advance knowledge for everyone, not just the people who could pay.
    – ColleenV
    Commented 2 hours ago
  • 3
    Would it be possible to do this in a way that doesn't affect the entire network but excludes things like review queues, the Staging Ground, user pages (maybe tag pages as well) etc and other things commonly accessed by userscripts but not from unwanted means? If you just want to protect posts from unwanted GenAI training, why not just rate-limit public posts for automated traffic and maybe question lists if necessary?
    – dan1st
    Commented 2 hours ago
  • 3
    @dan1st Part of the goal of compiling a list of key curator utilities is to assess the correct approach to minimizing disruption. Allowing access to certain parts of the site is an option for mitigating the impact. For example, an exemption for Chat / Bonfire was not part of the first-draft plan (a long time ago), but after a deeper review of community tools, it became clear that categorically exempting Chat was the most expedient way to avoid community disruption. Put another way, I'm reiterating what's in the post: please, list your tools, and we'll evaluate what we can do.
    – Slate StaffMod
    Commented 1 hour ago
  • To be clear, for Bing search, will it be given the nocache or noarchive tag? Commented 56 mins ago
  • @Sonic We are using NOARCHIVE, not NOCACHE.
    – Slate StaffMod
    Commented 47 mins ago
  • 1
    @Slate Perfect, thanks! Since the discontinuation of Google cache, Bing cache is now the largest provider of cached search results on the Web, so it'd be really bad to lose that. Glad to confirm it's not going away. Commented 42 mins ago
  • @Sonic For sure. Just to make explicit then, to the extent possible, we really don't want to disrupt the quality of Stack Exchange's search results.
    – Slate StaffMod
    Commented 36 mins ago

6 Answers 6

20

The following is a table of scripts, services, bots, and programs used by curators to aid in maintaining the network. We will review this table and communicate with community members as needed to ensure continuity of service.

As moderators will be granted a blanket exception to the new rate-limits, there is no need to include such services if they are moderator-only and used exclusively on the site one moderates. There is also no need to list a community-driven tool if it utilizes the API for all requests (excluding Chat activity), or is otherwise covered by the exemptions listed in the main post. Finally, if your userscript or tool submits fewer than 10 automated requests to the site per minute, there is no need to list it here, either.

To include a row, please add:

  • The name of your script or tool and a canonical link to where it may be found.
  • The function that your script or tool performs.
  • The reach that script or tool has across the network (e.g., who would be impacted if it went down)
  • Who the point of contact is for that script or tool

Thank you to the network’s moderators, who helped start this list.

Script/bot name and link Function Est. impact / reach Point of contact
Charcoal Spam and abuse mitigation Used heavily network-wide Charcoal HQ
SOBotics Various (moderation/curation) Large-scale automated access to SO Various/SO chatroom
Global Flag Summary Viewing one's flag statuses on all sites at once Script makes a request to all per-site flag history pages at once Stack Apps post
SE Data Dump Transformer Automatically downloads and optionally converts the new-style data dumps from all SE sites Significant community and archival value by ensuring continued access to the data dump Zoe
12
  • 1
    Please accept this so that it would be on top. Commented 4 hours ago
  • 5
    @ShadowWizard "You can accept your own answer in 2 days" - lol. I tried. I'll come back and do it in 47 hours.
    – Slate StaffMod
    Commented 4 hours ago
  • Pretty sure that staff can get a magic button for that, I'll ask around. Commented 4 hours ago
  • 7
    Let's just voting-mob it to the top instead. ;) - Side note: I don't think self-answers, even if CW, can be pinned to the top via acceptance (see: meta.stackexchange.com/a/5235)
    – Spevacus Mod
    Commented 4 hours ago
  • I thought so too, but apparently not.
    – Slate StaffMod
    Commented 4 hours ago
  • @Spevacus ah, oops. :] Commented 4 hours ago
  • @Slate why don't you ask for one...i'd imagine that could be useful sometimes...
    – Starship
    Commented 4 hours ago
  • 4
    The magic button is UPDATE Posts SET OwnerUserId = <another staff member> WHERE Id = 403003
    – Glorfindel Mod
    Commented 3 hours ago
  • @Glorfindel Don't even need to do that. Staff have a post mod menu item to change the post owner that's built in.
    – Catija
    Commented 1 hour ago
  • @Catija I'd actually forgotten that we can reassign ownership. So I can do it. I just need to wait until tomorrow to ask at work for a willing <s>inbox notification victim</s> volunteer to graciously take on this answer's ownership.
    – Slate StaffMod
    Commented 45 mins ago
  • 1
    I volunteer as tribute
    – Cesar M StaffMod
    Commented 40 mins ago
  • 1
    @CesarM You are now the post owner for this answer. Congratulations! May the community have mercy on your inbox.
    – Slate StaffMod
    Commented 39 mins ago
16

Since SE decided to respond to everyone except me on the internal announcement, this answer is a direct and mostly unmodified copy of the questions I posted on the internal announcement [mod-only link] on 2024-08-22. A couple notes have been added after the fact to account for answers provided to other people that further emphasise the various points I've outlined. Additionally, a couple extra notes have been added, as a form of bot detection was rolled out some time around 2024-08-30 that resulted in problems a week later (2024-09-06) that killed boson. This rollout was apparently unrelated to this change, but it took several days and calling out SE6 to actually figure that out.


TL;DR:

  1. What is a request and how do you count it? The post demonstrates how one singular webpage load can be as much as nearly 60 requests entirely on its own, and that most page loads invokes more than one request, which drastically reduces the number of effective page loads before blocks appear.
  2. Have you accounted for power users with a significant amount of traffic from, among other things, mass-opening search pages and running self-hosted stackapps that don't take the form of userscripts?
    • Even though individual applications may not exceed 60 or even just 10 requests per minute, combined traffic from power users can trivially exceed this in active periods, especially when operating at a scale.
  3. While not formatted as a question, how this is implemented actually matters. See both questions 1 and 2, and the rest of this post. This point isn't possible to summarise.
    • Failure to actually correctly detect traffic can and likely will result in applications like the comment archive being taken offline, because its traffic is combined with my traffic and everything else I host and use. A moderator exemption will not help the other things I host that make requests that aren't authed under my account.5
  4. How exactly is the block itself implemented? Is the offending user slapped with an IP ban? This affects whether or not this change screws over shared networks, including (but not limited to) workplace networks and VPNs
  5. The API is mentioned as a migration target, but it isn't exhaustive enough, nor are extensions to it planned for some of the cure functionality that risks being affected by this change - also not formatted as a question, but the API is not a fully viable replacement for plain requests to the main site.

See the rest of the post for context and details.


I cannot add anything to the list, as it either can't be reviewed or doesn't meet the threshold to be added to the list, but as a self-hoster, I am heavily impacted by this.

We do not plan on implementing stricter general use rate-limits. In other words, we will only limit traffic if it comes from non-human sources. [...] (Unfortunately I’m unable to share details as to how this is implemented.)

See, this worries me. If you [SE] can't tell us how you tell bots from non-bots, that means there's a good chance of incorrect classifications in both directions. I have no idea how you're implementing this, but if you [SE], for example, make the mistake of relying on stuff "humans normally do", you [SE] can and will run into problems with adblock users.

I'm taking an educated guess here, because this is the only trivial way to axe stuff like Selenium without affecting identical-looking real users. Going by user-agent is another obvious choice, but this doesn't block selenium and similar tooling.

There's also the third option of using dark magic in CloudFlare, in which case, I'm completely screwed for reasons I'll describe in the bullet point on Boson momentarily.


There's two problems I see with this if the incorrect classifications are bad enough:

  1. Any IP with more than a couple people actively using the site can and will be slapped with a block, even if they're normal users
    • Not going after IPs makes it trivial to tell how the bot detections are made, which means it's trivial to bypass without using VPNs.
    • VPNs may also be disproportionately affected here; there are many moderators who regularly use VPNs for all or most internet use who will get slapped with blocks.
  2. Any IP with a sufficient number of stackapps running (of any type) can and will run into problems.

I fall solidly into category #2, and occasionally #1 (via both work and varying VPN use).

On my network, I host the following stackapps:

  • Boson, the bot running the comment archive (2 API requests per minute + 2 * n chat messages, n >= 0, + multiple requests to log in)

    • Very occasionally, CloudFlare decides to slap Boson's use of the API, which very occasionally sends it into a reboot-relog spiral. I'm pretty sure I've mitigated this, but there are still plenty of failure modes I haven't accounted for. When this happens, I often then get slapped with a recaptcha that I have to manually solve to get boson back up. Also interestingly, CloudFlare only ever slaps API use, and not chat use, even though the number of chat messages posted (in a minute where there are new comments to post to chat) exceeds that of the number of API requests made by a potentially significant margin.

      Also, generally speaking, chatbots on multiple chat domains have to do three logins in a short period of time. Based on observational experience, somewhere between 4 and 8 logins is a CloudFlare trigger.

      The number of CF-related problems has gone drastically up in the last few months as a result of CF being, well, CF. It detects perfectly normal traffic as being suspicious on a semi-regular basis, and this kills my tools and is annoying to recover from, because a few of the blocks are hard to get around. Finishing up the note from earlier in this answer, if dark CloudFlare magic is used to implement this system, I will have problems within the first few hours of this being released, because that's just how CloudFlare works.

      Don't you just love CloudFlare? /s

      Editor's note: On 2024-09-06, an apparently unrelated bot detection change happened that independently killed boson. It initially looked like parts of the rate limiting change, but SE has denied this. They also denied making any recent changes to bot detection, and instead suspected CloudFlare might've done something that broke it separately. In either case, this highlights my point; CloudFlare-based bot detection will break stuff, intentionally or otherwise. If they don't even have to enable something that can break community bots and tooling, CF itself is a problem as long as it's operational.

      For context, based on the internet archive and observations from devtools, the specific bot detection system used is JavaScript detections, which requires a full browser environment to run. This is not an option with Boson, because it's fully headless, and written in C++ where I'm not masochistic enough to set up webdriver support.

      IA observations suggest this particular form of bot detection was enabled on 2024-08-30. It's still unclear why it took a week to run into problems - it might be a coincidence, or it might've only become a problem on 2024-09-06 for complicated CF config reasons I'm not going to pretend to understand.

      If you too would like to get blocked from bot detection, you can curl https://stackoverflow.com/login (intentionally 404 - 404 pages result in far more aggressive blocks than any other pages) four times. Also, as a preemtive note, SE was notified about the details of this several times (after the fact, however7) when they took an interest after being called out in public

  • An unnamed, partly unregistered, closed-source bulk-action comment moderation tool. It does 1 API request per minute (and has for the last 1-2 years). When actively used1, it does up to 15-20 per minute, with a combination of API requests (the main comment deletion method) and via the undocumented bulk deletion endpoint[2][3], because this thing is designed for moderation at a scale.

    • In addition to the requests already listed, in certain configurations, it too goes through an automated login process.
    • The majority of the requests are API requests, however, but the login required to make bulk deletion work cannot be done through the API
    • Just like boson, the comment collection process semi-regularly gets killed by CloudFlare during API calls.
  • Every quarter, I'll be downloading the data dump - and likely have to repeat several requests, because the download system appears to be flaky under load. This is done automatically through Stack Exchange data dump downloader and transformer, which makes somewhere between 10 and 20 page requests per second, including redirects, and with peaks during the login process for reasons that will be shown later.

    Editor's note: The data dump downloader shouldn't be affected. Based on science done on the JS detection bot killer from 2024-09-06 to 2024-09-07, Selenium is unlikely to be affected. If the rate limit becomes a problem, I'll artificially lower the request volume until it cooperates. Please open an issue on GitHub if it breaks anyway - ensuring we have continued access to the data dump is the only priority I have left atm.

  • I have an RSS feed set up to read a meta feed, running every 15 minutes

  • I had plans to host a few more stackapps as well, but those plans were delayed by the strike and strike fallout. If those plans continue at some point, I'll be making ✨ even more requests ✨, as these were also planned in the form of chatbots, and chatbots cannot be moved over to the API.

  • I very occasionally run various informal tooling to monitor Stuff. These are in the form of bash scripts that use curl, and ntfy to tell me if whatever it is I'm looking for has happened. These are all applications where the API is an infeasible strategy for something this tiny. The last such script I made ran every 6 hours

  • Whenever I do bot/stackapp development, the number of requests goes up significantly for a short period of time (read: up to several hours) due to higher-than-normal request rates for debugging purposes

Editor's note: Most of these, with the exception of Boson (and the data dump downloader), were axed a few days ahead of the initial announcement deadline as an attempt to load shed after not getting any response from SE for nearly two weeks, and the initial deadline approaching rapidly. Boson, as previously mentioned, was later killed when an unrelated bot detection change appeared to take the place of the rate limit change. All the applications listed here (again, except the data dump downloader) are now disabled and/or axed, and all my future plans for stackapps are scrapped due to the lack of response from SE.

In addition, I run a crapton of userscripts (including a few very request-heavy userscripts), and actively use SO and chat. Under my normal use, I can load several pages per minute, and depending on how you count requests (a problem I'm commenting on later), this totals potentially hundreds of web requests.

During burninations, I also load a full pagesize 50 search page worth of questions to delete. This means that over the course of around 20 seconds, my normal use can burn through around 55 page loads, not including requests to delete questions. Mass-opening search pages is a semi-common use-case as well, and will result in problems with the currently proposed limits - 60/min is extremely restrictive for power users.

Though the vast majority of these individual things does not exceed 60 requests per minute, when you combine this activity, I have a problem. If any of my activity is incorrectly (or even correctly in the case of the bots I have running) identified as automated, I get yeeted and my tools get killed. If said killing is done by CloudFlare, recovery is going to be a pain in the ass.

With limits as strict as the proposed limits, it'll become significantly harder to self-host multiple stackapps without running into even more rate limiting problems. The vast majority of current anti-whatever tooling is IP-based, so if this system is too, the activity will be totalled, and I will eventually and inevitably get IP-blocked.

Even though some applications have a very infrequent usage time, if Something Happens:tm: or the run times happen to overlap, I can trivially exceed the request limit. This is especially true of logins following an internet outage or site outage, as an outage that kills my scripts forces relogs, and again, logging in is expensive when done at a scale. It's already annoying enough with CloudFlare getting in the way.

While a moderator excemption will reduce some of the problem, if the result is an IP block, I have a4 bot account that does a chunk of these requests. Other requests again are fully unauthenticated, but these make up an extreme minority of the total number of requests made.

Editor's note: as demonstrated by the bot detection rollout earlier in September, while there may be exemptions here, that apparently does not extend to anything else that could break bots and tooling. We may be safe from the rate limit, but not necessarily the next bot detection system they enable quietly, or have enabled for them by CloudFlare.

Misc. other problems

To this end, moderators will be granted a unilateral exception to the new rate-limit on any site they moderate

Moderators are far from the only people to have a use pattern like mine - there's lots of active users doing a lot in the network, or on their favourite site.


Note that we strongly recommend userscript developers switch over to API usage as soon as possible

There are multiple things there aren't endpoints for. There's no way to log in, there's no way to download the data dumps, there's no way to log into chat, there's no way to post to chat, etc.

Advanced Flagging, for example, posts feedback to certain bots by sending messages in chat. It does so with manual non-API calls, because there's simply no way to do this via the API.

Exclusions of endpoints for some of these things are either implied to be, or have explicitly been said to be by design. As convenient as it would be, as it currently stands, the API is not a viable substitute for everything userscripts or bots need to do. This is especially true for certain actions given the heavy rate limiting, tiny quota, and small to non-existent capacity for bulk API requests.

Editor's note: SE answered someone else internally who asked about an API for chat around 3 hours after I posted my questions, where they confirmed they would not be adding chat support to the API in the foreseeable future. This further underlines my point; not everything can just be switched over to the API and be expected to work. This will result in stuff breaking.

What is a request?

Here's a few samples of requests made from various pages. Note that non-SO domains are omitted, including gravatar, metasmoke, ads, cookielaw, and third-party JS CDNs. Also note that in this entire section, I'll be referring specifically to stackoverflow.com, but that's just because I don't feel like writing a placeholder for any domain in the network. Whenever I refer to stackoverflow.com, it can be substituted with any network URL

Requests from a random question page with no answers:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
13 2 cdn.sstatic.net
4a 2 i.sstatic.net
3 stackoverflow.com
2 qa.sockets.stackexchange.com
1 1 api.stackexchange.com

Editor's note: After the release of the previously mentioned unrelated bot detection and the (also unrelated) recent tags experiment, requests to stackoverflow.com went from 3 to 7, meaning 9 questions/min is enough to get rate limited. The bot detection adds 3 requests alone, so all the counts are now out of date - and they're on the lower end of things.

Requests from a random question page with 4 answers:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
13 2 cdn.sstatic.net
12a 3 i.sstatic.net
3 stackoverflow.com
2 qa.sockets.stackexchange.com

Requests from the flag dashboard:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
31a i.sstatic.net
8 2 cdn.sstatic.net
1 stackoverflow.com
1 qa.sockets.stackexchange.com

Requests made when expanding any post with flags:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
3 stackoverflow.com
1 i.sstatic.net

Requests from https://stackoverflow.com/admin/show-suspicious-votes:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
18 2 cdn.sstatic.net
33a i.sstatic.net
1 stackoverflow.com
1 qa.sockets.stackexchange.com

Requests from the login page through a completed login:

Requests made (incl. userscripts, excl. blocked and failed) Requests blocked (uBlock) Failed requests Userscript requests Domain
1 1 askubuntu.com
30 6 cdn.sstatic.net
18 i.sstatic.net
1 1 mathoverflow.net
1 1 serverfault.net
1 1 stackapps.com
1 1 stackexchange.com
6 stackoverflow.com
1 1 superuser.com

a: Changes based on the number of users on the page. On large Q&A threads or any form of listing page, this can get big.

While mods are allegedly exempt, this still shows a pretty big problem; how do you count one request? Even if it's just requests to stackoverflow.com, question pages amplify one request to 3, and that's just with the current requests made. Depending on how you count the number of requests, just loading one singular strategic page (probably a search page with pagesize=50) can be enough to cap out the entire request limit. The vast majority of pages in the network do some amount of request amplification.

The login page in particular is, by far, the worst. One singular login makes requests to every other site in the network, meaning the total succeeded requests account for 12 requests just on that one action.

This leads to the question proposed in the section header: what is a request, and how is it counted? How do you [SE] plan to ensure that the already restrictive 60 requests/min limit doesn't affect normal users?

Footnotes

  1. Admittedly, it hasn't for over a year because strike and the following lack of motivation to keep going
  2. Part of the manual tooling deals with bulk operations on identical comments. I could queue them up to the mountain of a backlog I generate when reviewing comments, or I could tank a large number of problematic comments in one request, drastically freeing up quota for other parts of the application.
  3. This doesn't run all the time either, and is in a separate module from the bulk of the API requests
  4. Technically two, but one of them are not in use yet because strike.
  5. I also just realised that I can bypass this by running all the stackapps under my account, but this means running live bot accounts with moderator access, and insecurely storing credentials on a server exposed to the internet. For obvious reasons, this is a bad idea, but if self-hosting isn't accounted for, this is the only choice I have to maintain the comment archive.
  6. I still stand by my comments saying Stack Exchange, Inc. killed the comment archive - whether or not it's related to the change announced here or not, someone at Stack Exchange, Inc. rolled out bot detection either intentionally, or accidentally by willingly continuing to use CloudFlare in spite of it already having been disruptive to bots and other community tools. Granted, the challenge platform detection in particular is far more disruptive than other parts of CloudFlare, but CloudFlare has killed API access for me several times (and I'm not counting the bug on 2024-08-16 - I'm exclusively talking about rate limiting blocks by CF under normal operation, and while complying with the backoff signals in API responses)
  7. SE was not notified ahead of time, because due to the initial rollout timeline for the rate limit coinciding with the bot blocking kicking in, and being ignored for three weeks straight internally at the time, I assumed this was intentional, and opted to ensure the comment archive could remain functional somewhere over writing a bug report. There's a lot more private details as to why the bug report came after the fact (read: after SE suddenly decided to respond to me), which SE has been told in detail multiple times.
10
  • 4
    Hi Zoe, it appears as though the vast majority of your answer assumes that the rate-limits we are implementing will not work as designed, e.g. cannot properly differentiate between automated and human requests. If the system is not working as intended, you should file a bug report so we can fix it. If you suspect the system is rate-limiting in an unreasonably aggressive way, and the blocks are interfering with your normal usage of the site, open a support or feature request on Meta; alternatively, send in a support email, and be sure to include relevant diagnostic information.
    – Slate StaffMod
    Commented 5 hours ago
  • 2
    Additionally, as mentioned in the original post, I am not able to answer questions about how the block functions on a technical level. My aim is to share enough information so that you may understand whether the system is behaving properly, without needing to know the exact details of its operation. Unfortunately, however, I can't even partially answer those particular questions as you have written them, because they are seeking the information I have already said I cannot provide.
    – Slate StaffMod
    Commented 5 hours ago
  • 1
    Finally, your question about how requests are counted is again predicated on the notion that the implementation will not work as designed. However, it cannot be broken in the way suggested here, because a casual user using the site can easily and quickly surpass 60 requests per minute sum total with only a few page loads. If the implementation does not work as designed, please file a bug report. If it does work as designed, then it should be fairly clear what a "request" is to someone writing a script that sends requests. Thank you for your feedback.
    – Slate StaffMod
    Commented 5 hours ago
  • 8
    @Slate "Finally, your question about how requests are counted is again predicated on the notion that the implementation will not work as designed." - the data dump downloader is browser-based. It's right in the middle of a grey area. It's automation, but a full browser that loads stuff like a normal browser. One request isn't straight-forward in this context. If it's affected, users can be too. Commented 4 hours ago
  • 1
    I mean, at some point if you take enough steps that purposefully evade these rate-limits, it's inevitable that you're going to find grey area. I can tell you that browser emulation goes against the spirit and goal of these rate-limits, if nothing else. If we have a reliable way to restrict browser-emulated automated requests, I expect we probably will do that, provided the browser-emulated requests don't fall under the exceptions above. But we are also being very careful to avoid impacting regular user participation, and that is of course the highest priority.
    – Slate StaffMod
    Commented 4 hours ago
  • On a more practical level, there are all sorts of browser emulation approaches and I'm not sure exactly how any given approach is going to be handled. You're sort of venturing into inconsistent and unsupported territory with that (and, frankly? this is true even aside from the rate-limits). If there's something specific you want us to test to ensure continued functionality, send it over and I'll see if we can make time to at least evaluate it.
    – Slate StaffMod
    Commented 4 hours ago
  • 7
    It isn't browser emulation, it's a full browser. It's using the full version of Firefox through Selenium, a common browser automation tool that also happens to be relatively common in scraping (it also supports chrome, though my tool does not support anything but FF). The source code is at github.com/LunarWatcher/se-data-dump-transformer, along with usage instructions Commented 4 hours ago
  • 1
    All right, I'll add it to the list above for folks to review. Thanks. I'm leaving the estimated impact / reach blank, because you can speak far better to the impact of your project than I can.
    – Slate StaffMod
    Commented 4 hours ago
  • 7
    @Slate Zoe didn't quite make that assumption, but I do. The rate-limits you're implementing cannot work as designed, because all requests on the internet are issued by computer programs: there is fundamentally no way to distinguish between "computer program with a human sitting behind it" and "computer program running on a laptop in a cupboard". This is a well-studied problem, and to my knowledge, nobody has ever made this work. We can probably assume that it is not generally understood how to do so.
    – wizzwizz4
    Commented 2 hours ago
  • 3
    I respect that view, @wizzwizz4, and I am familiar with these challenges too. Unfortunately the most I can say is: if you encounter issues accessing the network, you should submit a bug report or support request. If you believe that it is not an "if," but a "when," then I want you to know that I understand the rationale and respect the claim, but my advice does not change.
    – Slate StaffMod
    Commented 1 hour ago
14

We need your help

Many of us don't want to help you sell our content, which we agreed was under CC BY-SA, i.e. allowing commercial use. You're killing the data dump (which already was incomplete, e.g. no images), and now you're killing scraping.

The original spirit of Stack Exchange was to share knowledge to anyone for free (as opposed to, e.g. Experts Exchange). Nowadays, the current spirit of Stack Exchange Inc. is how to maximize profits by selling that content.

1
  • 8
    For what little it's worth, there is an unofficial community reupload of currently both versions of the new-style data dump (2024-06-30, with the versions uploaded on 2024-08-05 and 2024-08-29 respectively). This does not change the situation, but it at least means we continue having access for a little longer, even if SE doesn't seem to want it Commented 3 hours ago
12

Why are Bing & Azure being singled out? What about AWS, GCP, and Alibaba clouds/Search providers?

This strikes me as anti-competitive behavior stemming from Stack Overflow's recent buddy-buddy relationship with Google.

Is this related to an exclusivity or contractual agreement with Google in order to reduce effectiveness of Bing search, and devalue the Azure cloud for services that interact with Stack Overflow?

image of the Monopoly game board

5
  • It is inaccurate to claim other cloud providers will not be subject to analogous rules. While I cannot comment materially on our overall strategy or our relationship with any business in particular, the post does specify that "[we] may also implement similar restrictions for other cloud platforms (e.g. AWS and GCP) at a later date." Based on the information available to me, in the long term, I would expect any service to experience rate-limits if it acceses the network via automated means without prior authorization or a stated policy exemption.
    – Slate StaffMod
    Commented 2 hours ago
  • 5
    @Slate -- saying you "may" do something merely leaves the door open that you might. It certainly doesn't convey that you will or that you intend to do a thing. The post does NOT say that you actually plan to implement changes evenly regardless of the source of the traffic.
    – AMtwo
    Commented 2 hours ago
  • 1
    @Slate I'm also curious if Stack is working with other search providers to ensure parity across them, or is Bing the only search engine targeted by the changes?
    – AMtwo
    Commented 2 hours ago
  • 4
    Assuming at least some non-completely-bad faith/with the benefit of the doubt, it would be possible that "Bing Chat and full-page previews" are considered bigger problems (in the perspective of SE Inc) than other things right now so that might be why these are limited. I won't make statements on whether that's ok (ethically, legally or whatever) or not though.
    – dan1st
    Commented 2 hours ago
  • very well could simply be a case of "lets test this on just one of the major targets before we hit all of them"
    – Kevin B
    Commented 17 mins ago
3

Could high-rep users be exempted from this? I'd imagine that the vast vast vast majority of problematic automated requests come from lower rep users (or more likely 1 rep/non-registered users) and the vast majority of non-problematic automated requests come from higher rep users.

This would solve a good portion of the problems with this idea.

7
  • 13
    We have actually already taken this into account! In general, high-rep users should not see any additional rate-limiting, even when submitting automated requests to the site e.g. via userscripts. The exact implementation of these rate-limits is less black-and-white than stated in the original post, and more complex exceptions exist than the ones strictly listed there. However, due to the associated complexity, please keep in mind that we can't guarantee those exceptions will be completely infallible.
    – Slate StaffMod
    Commented 5 hours ago
  • @Slate would you mind saying how high rep is high rep? 100? 1000? 10000? Is it rep one site or rep on the site in question? Anyways, I do appreciate that you thought of this.
    – Starship
    Commented 4 hours ago
  • 3
    Pretty sure the exact details are kept hidden or vague on purpose, as there are high rep users who become trolls, and might use that info for bad stuff. Commented 4 hours ago
  • 1
    Okay...well if so could I get something of very general range at least. Because I think that depending on the answer to that how much a help this is is a big difference. @ShadowWizard
    – Starship
    Commented 4 hours ago
  • 1
    @Starship I would imagine it's linked to the Trusted User privilege, but we're not going to get any further details. (As much as I dislike the Cloudflare setup, this is one of those things I agree with secrecy about: it is literally security by obscurity, and only effective as long as the obscurity remains.)
    – wizzwizz4
    Commented 2 hours ago
  • While it might solve many problems, there are also many legitimate sockpuppets for automation reasons and these probably don't have that much reputation, e.g. Natty (that might use the API though, I don't know).
    – dan1st
    Commented 1 hour ago
  • @dan1st I didn't say it would solve everything but its a step in the right direction, is it not?
    – Starship
    Commented 54 mins ago
1

We will heavily rate-limit access to the platform from Microsoft’s IP addresses.

(How) will this impact people using services like Windows 365 to access SE via a browser?

1
  • 2
    The folks I need to ask have left work for the day. I'll get back to you on this, probably tomorrow.
    – Slate StaffMod
    Commented 41 mins ago

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .