-274

Update October 29th, 2024

Our rate-limit for automated traffic coming from AWS is enabled, and is operating as expected.

Update October 21st, 2024

Our rate-limit for automated traffic coming from GCP is now enabled, and is operating as expected. Per our last update, we are proceeding on schedule to rate-limit AWS beginning October 28th.

Update October 7th, 2024

We have finished our temporary logging period, and as of today (October 7th), rate-limits have been enabled for Microsoft Azure IP address ranges (item #2 below). We have also implemented restrictions on what Microsoft stores after crawling the site, and restricted the indexing of Stack Exchange Chat ("Bonfire") servers (item #1 and #3 below).

Our current plan is to turn on rate-limits for Google Cloud Platform (GCP) on October 21st and rate-limits for Amazon Web Services (AWS) on October 28th. The general rate-limit for automated requests from scrapers (item #4 below) is currently scheduled for early November.

At this time, based on our pre-launch monitoring, we do not believe our efforts will impact a significant number of community tools. If your tools experience any Cloudflare-related blocks, we would strongly encourage you to post a support request here on Meta Stack Exchange so we can review your report promptly and offer any assistance we're able to. Please make sure to include a ray ID from the Cloudflare block when you do. For more information on Cloudflare blocks and ray IDs, please see this FAQ post.


Stack Exchange is implementing four new restrictions to prevent unauthorized automated access to the network. These restrictions should not affect day-to-day network usage; however, they are still meaningful changes, and we need your help ensuring that authorized network use remains uninhibited.

Here are the restrictions we are adding to the platform:

  1. We will restrict the amount of content that Microsoft Bing stores after crawling the site (details here). This action is not expected to significantly decrease the relevance of search results, but may impact auxiliary functionality such as Bing Chat and full-page previews.
  2. We will heavily rate-limit access to the platform from Microsoft’s IP addresses. This includes Microsoft Azure and, therefore, will impact any community member running a service that accesses the site via Microsoft Azure. However, traffic will generally be exempt from this restriction if it uses the API - see the list of exemptions below. We are planning to implement analogous restrictions for other cloud platforms, e.g. AWS and GCP, at a later date.
  3. We will alter the robots.txt file on our chat servers to disallow all indexing, except indexing by known and recognized search engines.
  4. We will restrict the rate at which users can submit automated requests to the site through scripts and code without using the API, e.g., actions originating from sources other than human interaction with a browser.

Of these steps, #1 (Bing search) is unlikely to directly impact the users of the site, barring marginal impacts to search relevance for users using Bing.

For #2 (MS Azure), we do not know of any community projects that are utilizing Microsoft Azure as their primary means to access the network, but if there are any such projects, we would like to know about them so that we can find a way to ensure your access is unimpeded.

For #3, we do not expect any substantial impact to community members. We are intentionally leaving Stack Exchange Chat open to indexing by common search engines so that users will be minimally impacted by the change.

Finally, #4 - scripts and automated requests. Well, that’s the big one, eh? The change we are making will impact userscripts that issue a high volume of requests to the site. Here are some details we can share with you:

  • We are evaluating what an appropriate rate-limit would be to match our desired level of restriction. Our initial guess is that the new rate-limit will be set to around 60 requests per minute.
  • This new rate-limit will not apply to site access through the API. Rate-limiting for the API is managed separately.
  • Users accessing the site through a browser normally should not encounter any issues after this change.
  • Users who have userscripts installed that do not issue a high volume of requests should not be impacted by this change. For example, scripts that are purely graphical enhancements will not be impacted.
  • Users who have userscripts installed that do issue a high volume of requests may encounter unexpected behavior while navigating the site. We do not know yet exactly how disruptive these behaviors are liable to be, and we are evaluating our options for minimally disruptive approaches that still achieve our rate-limiting goals.
  • We do not plan on implementing stricter general use rate-limits. In other words, we will only limit traffic if it comes from non-human sources. You don’t need to worry about encountering new rate-limits through normal usage of the site. (Unfortunately I’m unable to share details as to how this is implemented.)

Exemptions to the new rate-limits

There are three categories of exemptions to these new rate-limits. Traffic falling into any of the following three buckets will not be restricted by these new rate-limits:

  1. Traffic that reaches the network through the API will not be restricted by these new rate-limits.
  2. In order to minimize disruption to curator tools, automated traffic to chat (a.k.a. “Bonfire”) will not be subject to the new rate-limits.
  3. In order to minimize disruption to moderator workflow, moderators will be fully exempt from these new rate-limits on any site they moderate.

We have created the exemption list above to help keep the potential for community impact small. We do not want to impact moderator userscripts and other 3rd party tools moderators use to get work done, so moderators are fully exempt from these new restrictions on the site they moderate. Similarly, we know that curators route a significant amount of curation activity through Chat (“Bonfire”), so we are fully excluding Chat from the new rate-limits to minimize impact.

We need your help to make sure this change is as smooth as possible.

We need your help. I’ll be real with you: we’re making this change. We are currently monitoring the expected impact of the restrictions we are drafting. We will implement restrictions on Microsoft Azure and related services over the next few weeks. Bullet #4 above will be implemented no earlier than the last week of October. When it goes live, we want the impact on you to be as minimal as possible – none at all, if we can manage it.

We hope the above exceptions help to considerably minimize the potential for curator impact. However, these exceptions may not provide complete coverage for site curators. We’d like to avoid as much disruption to the curator workflow as we can. We strongly recommend userscript developers switch over to API usage as soon as possible, but we recognize that this isn’t the easiest change to make, especially on relatively short notice. To this end, we need to start… (drumroll)... a list.

Do you know of any commonly-used curator userscripts or bots that could submit a meaningful amount of traffic to the network? To be on the safe side, we’re interested in reviewing any userscript, program, or bot that may submit more than 10 requests per minute.

I have created a community wiki answer below. Edit any userscript, program, bot, or service into that answer to add it to our list to review.

With any luck, we’ll be able to get this change out as smoothly as possible and with as little disruption as we can manage. Thanks for your help in advance!

32
  • 213
    TOTALLY not related to the data dump kill project, right? Commented Sep 19 at 18:15
  • 211
    So we're just giving up on that whole day dream of building a library of knowledge for the public good, are we? If the only way someone can find what they need in your repository of knowledge is to pay for the tool to search it, it's not for the public good; it's for profit. It wouldn't be so bad, except you're basically paywalling content that was donated to you because we were under the impression that we were doing something to advance knowledge for everyone, not just the people who could pay.
    – ColleenV
    Commented Sep 19 at 19:55
  • 6
    Would it be possible to do this in a way that doesn't affect the entire network but excludes things like review queues, the Staging Ground, user pages (maybe tag pages as well) etc and other things commonly accessed by userscripts but not from unwanted means? If you just want to protect posts from unwanted GenAI training, why not just rate-limit public posts for automated traffic and maybe question lists if necessary?
    – dan1st
    Commented Sep 19 at 19:59
  • 8
    @dan1st Part of the goal of compiling a list of key curator utilities is to assess the correct approach to minimizing disruption. Allowing access to certain parts of the site is an option for mitigating the impact. For example, an exemption for Chat / Bonfire was not part of the first-draft plan (a long time ago), but after a deeper review of community tools, it became clear that categorically exempting Chat was the most expedient way to avoid community disruption. Put another way, I'm reiterating what's in the post: please, list your tools, and we'll evaluate what we can do.
    – Slate StaffMod
    Commented Sep 19 at 20:42
  • 7
    @Sonic We are using NOARCHIVE, not NOCACHE.
    – Slate StaffMod
    Commented Sep 19 at 21:43
  • 25
    @Slate Perfect, thanks! Since the discontinuation of Google cache, Bing cache is now the largest provider of cached search results on the Web, so it'd be really bad to lose that. Glad to confirm it's not going away. Commented Sep 19 at 21:48
  • 91
    "except indexing by known and recognized search engines" - which are those?
    – Bergi
    Commented Sep 19 at 22:47
  • 11
    @HannahVernon - I'm sure thats what it is too, but there would be a big difference between "we're trying to stop bing chat because we've seen undeniable evidence that se content has been used nefariously to do xyz" vs "GeminiAI are going to stop paying us if we're giving away our milk for free". It definitely is clear which one of these it is more likely to be though
    – Sayse
    Commented Sep 20 at 8:15
  • 32
    @Sayse it is CLEARLY "how can we gate keep access to a Creative Common freely reproducible resource". Just look back at the last months..the dump mess and so on. Personally I find it quite insulting that we are feed these corporate lies filled post, like we were so stupid to not see the real intent behind these moves. Commented Sep 20 at 8:31
  • 86
    So this post says what SE is doing, but what I can't figure out is why this is being done. What problem are you solving? Are there too many search engines that overload your servers? Or are you very upset by language models learning on the Q&A content? Or what?
    – Ruslan
    Commented Sep 20 at 9:38
  • 52
    robots.txt only keeps honest people out
    – S.S. Anne
    Commented Sep 20 at 18:44
  • 55
    What is the purpose of this change? What kind of abuse was occurring that this is trying to prevent? This just seems to be a completely arbitrary unnecessary change to me, maybe I am missing something.
    – CPlus
    Commented Sep 21 at 15:49
  • 28
    Just curious -- exactly which authorization do I need to access CC BY-SA content? In bulk, if I desire? By script, if I choose? I know that you are desperately seeking to protect what you consider your intellectual property -- but it isn't, is it? The aggregation may be more than its parts, but it's still CC BY-SA, isn't it? Commented Sep 22 at 8:12
  • 15
    "Let's be a bit more realistic in our criticism here." SE is simply putting their own profit way in front of the mission of building a knowledge library free for everyone and do not deserve another bit of content unless they start providing data dumps without limitations again. This would be my realistic criticism here, but who knows, maybe it's also too strong. Commented Sep 30 at 21:04
  • 23
    @CPlus Exactly. They call it "community data protection", but what they really mean is "profit protection". Microsoft didn't pay (or not enough), so they're blocking Microsoft. See stackoverflow.blog/2024/09/30/ongoing-community-data-protection Of course, they're hiding their true motives behind the usual "community" and "socially responsible" jargon, but given the context, it becomes clear enough. Commented Oct 2 at 10:53

14 Answers 14

58

The following is a table of scripts, services, bots, and programs used by curators to aid in maintaining the network. We will review this table and communicate with community members as needed to ensure continuity of service.

As moderators will be granted a blanket exception to the new rate-limits, there is no need to include such services if they are moderator-only and used exclusively on the site one moderates. There is also no need to list a community-driven tool if it utilizes the API for all requests (excluding Chat activity), or is otherwise covered by the exemptions listed in the main post. Finally, if your userscript or tool submits fewer than 10 automated requests to the site per minute, there is no need to list it here, either.

To include a row, please add:

  • The name of your script or tool and a canonical link to where it may be found.
  • The function that your script or tool performs.
  • The reach that script or tool has across the network (e.g., who would be impacted if it went down)
  • Who the point of contact is for that script or tool

Thank you to the network’s moderators, who helped start this list.

Script/bot name and link Function Est. impact / reach Point of contact
Charcoal Spam and abuse mitigation Used heavily network-wide Charcoal HQ
SOBotics Various (moderation/curation) Large-scale automated access to SO Various/SO chatroom
Global Flag Summary Viewing one's flag statuses on all sites at once Script makes a request to all per-site flag history pages at once Stack Apps post
SE Data Dump Transformer Automatically downloads and optionally converts the new-style data dumps from all SE sites Significant community and archival value by ensuring continued access to the data dump Zoe
Internet Archive wayback machine Archives both StackExchange sites and outgoing links. Helps to reduce link rot and maintain the quality of the posts (e.g. verify whether an external link is/was legit, whether there was content rift, whether a link needs to be replaced or content incorporated into the post). Internet Archive
10
  • 4
    Always glad to share my experience, @Slate . :P Curious, is it possible to poke through the top Stack Apps tools and see if any of them would need to be added to this? Maybe a notification there of some sort to let app creators know about this change? I know this question is featured network wide but ... dunno, just throwing out some ideas.
    – Catija
    Commented Sep 19 at 22:34
  • 2
    @Catija That's a good call, I'll ping the StackApps mods.
    – Slate StaffMod
    Commented Sep 19 at 22:46
  • 3
    I'm wondering if I should add Sloshy here. It's a fairly small tool, and if it qualifies, then all of SOBotics probably should too, which is a significant undertaking and possibly outside of my competence.
    – tripleee
    Commented Sep 20 at 5:49
  • The "question" / announcement post repeatedly mentions Bonfire; I'm guessing you mean github.com/SOBotics/Bonfire?
    – tripleee
    Commented Sep 20 at 5:51
  • 11
    @tripleee It's the internal code name for chat. Commented Sep 20 at 7:15
  • @tripleee what about OakBot? Is it considered important enough to keep as well? Commented Sep 20 at 14:45
  • @ShadowWizard OakBot is not part of SOBotics Commented Sep 20 at 15:09
  • 5
    It might be a good idea to add github.com/SOBotics/chatexchange to the list (aside from SOBotics in general) as this library is used by a few chatbots and it goes through the login flow IIRC.
    – dan1st
    Commented Sep 23 at 21:50
  • 3
    For the record, all chat libraries go through the login flow, as there are no other ways to connect to chat, and are therefore at risk of being affected. Here's 8, and SOBotics has another list with some overlap Commented Sep 25 at 13:33
  • 1
    I have stuff, but I have a good feeling I won't be affected, and even if I am, I can figure out a way around it.
    – starball
    Commented Oct 4 at 8:49
246

We need your help

Many of us don't want to help you sell our content, which we agreed was under CC BY-SA, i.e. allowing commercial use. You're killing the data dump (which already was incomplete, e.g. no images), and now you're killing scraping.

The original spirit of Stack Exchange was to share knowledge to anyone for free (as opposed to, e.g. Experts Exchange). Nowadays, the current spirit of Stack Exchange Inc. is how to maximize profits by selling that content.

20
  • 57
    For what little it's worth, there is an unofficial community reupload of currently both versions of the new-style data dump (2024-06-30, with the versions uploaded on 2024-08-05 and 2024-08-29 respectively). This does not change the situation, but it at least means we continue having access for a little longer, even if SE doesn't seem to want it Commented Sep 19 at 18:49
  • 12
    @Zoe-Savethedatadump Don't be surprised if that gets taken down eventually, you may want to think of a backup plan up front.
    – Mast
    Commented Sep 20 at 8:02
  • 7
    @Mast Torrent is difficult to kill, so even if IA complies with a takedown request from SE, as long as there are non-IA seeders of the torrents, it'll continue to work as-is. Not sure about uploads after a takedown (probably just making torrents directly - there's already a website that has a copy of the list from the MSE post, so the list should remain live), but uploads are not my department :p Commented Sep 20 at 13:45
  • 2
    @Mast any suggestions on other places to mirror the SE data dumps? Commented Sep 20 at 15:48
  • 2
    Zoe has a good point about the torrent being hard to kill. If a large enough group of people seeds it, it's decentralized enough. May want to organize that though. Perhaps through Discord. Ensure you have the numbers, so to say. You'd want 10 people for each file at least, because numbers will go down eventually.
    – Mast
    Commented Sep 20 at 16:08
  • 6
    @Mast Yes. It is time to go underground :) Viva la resistance!!! Commented Sep 20 at 16:12
  • 2
    @Mast Thanks, many torrents barely have any seeders. Do you know any other alternatives? Commented Sep 20 at 19:18
  • @FranckDernoncourt IIRC, some discussions mentioned some (academic?) dataset site I forget the name of. It isn't super applicable though, and I don't think they seed the torrent. Unfortunately, outside archive.org, the amount of applicable sites is low. Commented Sep 21 at 0:52
  • 2
    @Zoe-Savethedatadump sounds like Zenodo. Max size is 50 GB, and upon request, 200 GB. So maybe ok indeed. Commented Sep 21 at 0:56
  • @Mast The current seed numbers are not great. According to the trackers, 2024-06-30 (original release) is at 8 seeders, and 2024-06-30 (revised release) is at just 4. March 2024 and December 2023 are difficult, because they apparently don't have trackers , at least not at the time I downloaded. According to my torrent client, December 2023 has 1 and March 2024 has 0 (but again, no tracker, so it's difficult to tell how accurate those numbers are). Commented Sep 21 at 0:57
  • 4
    @FranckDernoncourt I checked - I was thinking about academictorrents.com Commented Sep 21 at 0:58
  • @Zoe-Savethedatadump thanks, then no limit, but need seeders. Commented Sep 21 at 0:59
  • 3
    @JustinThymetheSecond e.g., ads and Teams. Commented Oct 12 at 19:13
  • 4
    @JustinThymetheSecond SE founders chose the CC BY-SA license to avoid that situation. That doesn't ban anyone from monetizing it but I'm just saying that I'm not working for free to help others monetize it. Commented Oct 13 at 0:46
  • 1
    @ Franck Dernoncourt Or they chose that license to deliberately mask their true intentions. Step one Convince everyone you do not intend to monetize submissions Step two accumulate a huge data base of user-supplied data for free Step three change the site, give the data to another site, close SE, and say 'surprise', and start charging four that data. Commented Oct 14 at 14:38
208

disallow all indexing, except indexing by known and recognized search engines

Ask yourself "what if everyone did this?". The answer is that it would prevent any new search engines from ever becoming useful, and so making search a permanent oligopoly.

6
  • 35
    So you find out their goal. What you going to do with this information? :) Commented Sep 20 at 13:09
  • 67
    While you reflect on your finding, please also consider the totally unrelated OpenAI partnership and the similarly totally unrelated "Let's kill the dump so only a selected few (totally unrelated to OpenAI) can easily access the data" Commented Sep 20 at 13:54
  • 4
    For what little it's worth, the bullet also says "We will alter the robots.txt file on our chat servers to disallow all indexing [...]". This is still an incredibly negative move, and one that could expand to main at some point in the future, but at least for now, it's fairly low-impact. And just to be clear, I use a small search engine myself that's unlikely to get recognition by SE, so if this ever makes it to main, it will break my use of the site hard. But at least for now, this bullet point isn't completely breaking for most people Commented Sep 23 at 20:00
  • 1
    Everyone already does this, thanks to Cloudflare. And there are only 3 search engines with a unique index, everyone else just pays Bing to use their API
    – Anonymous
    Commented Sep 27 at 12:12
  • 3
    Stack Exchange is a huge deal. This will significantly damage non-major search engines even without everyone else also doing so. Commented Sep 30 at 20:05
  • And that is a BAD thing? 'Search Engines' have now been so substantially monetized that they are nothing more than another paid advertising service. Really, all you are doing on a search engine is finding those 'sources' that have paid the most for you to find them first. AI does NOT stand for 'Artificial Intelligence', it stands for 'Algorithmic Intelligence'. Commented Oct 13 at 0:21
100

Since SE decided to respond to everyone except me on the internal announcement, this answer is a direct and mostly unmodified copy of the questions I posted on the internal announcement [mod-only link] on 2024-08-22. A couple notes have been added after the fact to account for answers provided to other people that further emphasise the various points I've outlined. Additionally, a couple extra notes have been added, as a form of bot detection was rolled out some time around 2024-08-30 that resulted in problems a week later (2024-09-06) that killed boson. This rollout was apparently unrelated to this change, but it took several days and calling out SE6 to actually figure that out.


TL;DR:

  1. What is a request and how do you count it? The post demonstrates how one singular webpage load can be as much as nearly 60 requests entirely on its own, and that most page loads invokes more than one request, which drastically reduces the number of effective page loads before blocks appear.
  2. Have you accounted for power users with a significant amount of traffic from, among other things, mass-opening search pages and running self-hosted stackapps that don't take the form of userscripts?
    • Even though individual applications may not exceed 60 or even just 10 requests per minute, combined traffic from power users can trivially exceed this in active periods, especially when operating at a scale.
  3. While not formatted as a question, how this is implemented actually matters. See both questions 1 and 2, and the rest of this post. This point isn't possible to summarise.
    • Failure to actually correctly detect traffic can and likely will result in applications like the comment archive being taken offline, because its traffic is combined with my traffic and everything else I host and use. A moderator exemption will not help the other things I host that make requests that aren't authed under my account.5
  4. How exactly is the block itself implemented? Is the offending user slapped with an IP ban? This affects whether or not this change screws over shared networks, including (but not limited to) workplace networks and VPNs
  5. The API is mentioned as a migration target, but it isn't exhaustive enough, nor are extensions to it planned for some of the cure functionality that risks being affected by this change - also not formatted as a question, but the API is not a fully viable replacement for plain requests to the main site.

See the rest of the post for context and details.


I cannot add anything to the list, as it either can't be reviewed or doesn't meet the threshold to be added to the list, but as a self-hoster, I am heavily impacted by this.

We do not plan on implementing stricter general use rate-limits. In other words, we will only limit traffic if it comes from non-human sources. [...] (Unfortunately I’m unable to share details as to how this is implemented.)

See, this worries me. If you [SE] can't tell us how you tell bots from non-bots, that means there's a good chance of incorrect classifications in both directions. I have no idea how you're implementing this, but if you [SE], for example, make the mistake of relying on stuff "humans normally do", you [SE] can and will run into problems with adblock users.

I'm taking an educated guess here, because this is the only trivial way to axe stuff like Selenium without affecting identical-looking real users. Going by user-agent is another obvious choice, but this doesn't block selenium and similar tooling.

There's also the third option of using dark magic in CloudFlare, in which case, I'm completely screwed for reasons I'll describe in the bullet point on Boson momentarily.


There's two problems I see with this if the incorrect classifications are bad enough:

  1. Any IP with more than a couple people actively using the site can and will be slapped with a block, even if they're normal users
    • Not going after IPs makes it trivial to tell how the bot detections are made, which means it's trivial to bypass without using VPNs.
    • VPNs may also be disproportionately affected here; there are many moderators who regularly use VPNs for all or most internet use who will get slapped with blocks.
  2. Any IP with a sufficient number of stackapps running (of any type) can and will run into problems.

I fall solidly into category #2, and occasionally #1 (via both work and varying VPN use).

On my network, I host the following stackapps:

  • Boson, the bot running the comment archive (2 API requests per minute + 2 * n chat messages, n >= 0, + multiple requests to log in)

    • Very occasionally, CloudFlare decides to slap Boson's use of the API, which very occasionally sends it into a reboot-relog spiral. I'm pretty sure I've mitigated this, but there are still plenty of failure modes I haven't accounted for. When this happens, I often then get slapped with a recaptcha that I have to manually solve to get boson back up. Also interestingly, CloudFlare only ever slaps API use, and not chat use, even though the number of chat messages posted (in a minute where there are new comments to post to chat) exceeds that of the number of API requests made by a potentially significant margin.

      Also, generally speaking, chatbots on multiple chat domains have to do three logins in a short period of time. Based on observational experience, somewhere between 4 and 8 logins is a CloudFlare trigger.

      The number of CF-related problems has gone drastically up in the last few months as a result of CF being, well, CF. It detects perfectly normal traffic as being suspicious on a semi-regular basis, and this kills my tools and is annoying to recover from, because a few of the blocks are hard to get around. Finishing up the note from earlier in this answer, if dark CloudFlare magic is used to implement this system, I will have problems within the first few hours of this being released, because that's just how CloudFlare works.

      Don't you just love CloudFlare? /s

      Editor's note: On 2024-09-06, an apparently unrelated bot detection change happened that independently killed boson. It initially looked like parts of the rate limiting change, but SE has denied this. They also denied making any recent changes to bot detection, and instead suspected CloudFlare might've done something that broke it separately. In either case, this highlights my point; CloudFlare-based bot detection will break stuff, intentionally or otherwise. If they don't even have to enable something that can break community bots and tooling, CF itself is a problem as long as it's operational.

      For context, based on the internet archive and observations from devtools, the specific bot detection system used is JavaScript detections, which requires a full browser environment to run. This is not an option with Boson, because it's fully headless, and written in C++ where I'm not masochistic enough to set up webdriver support.

      IA observations suggest this particular form of bot detection was enabled on 2024-08-30. It's still unclear why it took a week to run into problems - it might be a coincidence, or it might've only become a problem on 2024-09-06 for complicated CF config reasons I'm not going to pretend to understand.

      If you too would like to get blocked from bot detection, you can curl https://stackoverflow.com/login (intentionally 404 - 404 pages result in far more aggressive blocks than any other pages) four times. Also, as a preemtive note, SE was notified about the details of this several times (after the fact, however7) when they took an interest after being called out in public

  • An unnamed, partly unregistered, closed-source bulk-action comment moderation tool. It does 1 API request per minute (and has for the last 1-2 years). When actively used1, it does up to 15-20 per minute, with a combination of API requests (the main comment deletion method) and via the undocumented bulk deletion endpoint[2][3], because this thing is designed for moderation at a scale.

    • In addition to the requests already listed, in certain configurations, it too goes through an automated login process.
    • The majority of the requests are API requests, however, but the login required to make bulk deletion work cannot be done through the API
    • Just like boson, the comment collection process semi-regularly gets killed by CloudFlare during API calls.
  • Every quarter, I'll be downloading the data dump - and likely have to repeat several requests, because the download system appears to be flaky under load. This is done automatically through Stack Exchange data dump downloader and transformer, which makes somewhere between 10 and 20 page requests per second, including redirects, and with peaks during the login process for reasons that will be shown later.

    Editor's note: The data dump downloader shouldn't be affected. Based on science done on the JS detection bot killer from 2024-09-06 to 2024-09-07, Selenium is unlikely to be affected. If the rate limit becomes a problem, I'll artificially lower the request volume until it cooperates. Please open an issue on GitHub if it breaks anyway - ensuring we have continued access to the data dump is the only priority I have left atm.

  • I have an RSS feed set up to read a meta feed, running every 15 minutes

  • I had plans to host a few more stackapps as well, but those plans were delayed by the strike and strike fallout. If those plans continue at some point, I'll be making ✨ even more requests ✨, as these were also planned in the form of chatbots, and chatbots cannot be moved over to the API.

  • I very occasionally run various informal tooling to monitor Stuff. These are in the form of bash scripts that use curl, and ntfy to tell me if whatever it is I'm looking for has happened. These are all applications where the API is an infeasible strategy for something this tiny. The last such script I made ran every 6 hours

  • Whenever I do bot/stackapp development, the number of requests goes up significantly for a short period of time (read: up to several hours) due to higher-than-normal request rates for debugging purposes

Editor's note: Most of these, with the exception of Boson (and the data dump downloader), were axed a few days ahead of the initial announcement deadline as an attempt to load shed after not getting any response from SE for nearly two weeks, and the initial deadline approaching rapidly. Boson, as previously mentioned, was later killed when an unrelated bot detection change appeared to take the place of the rate limit change. All the applications listed here (again, except the data dump downloader) are now disabled and/or axed, and all my future plans for stackapps are scrapped due to the lack of response from SE.

In addition, I run a crapton of userscripts (including a few very request-heavy userscripts), and actively use SO and chat. Under my normal use, I can load several pages per minute, and depending on how you count requests (a problem I'm commenting on later), this totals potentially hundreds of web requests.

During burninations, I also load a full pagesize 50 search page worth of questions to delete. This means that over the course of around 20 seconds, my normal use can burn through around 55 page loads, not including requests to delete questions. Mass-opening search pages is a semi-common use-case as well, and will result in problems with the currently proposed limits - 60/min is extremely restrictive for power users.

Though the vast majority of these individual things does not exceed 60 requests per minute, when you combine this activity, I have a problem. If any of my activity is incorrectly (or even correctly in the case of the bots I have running) identified as automated, I get yeeted and my tools get killed. If said killing is done by CloudFlare, recovery is going to be a pain in the ass.

With limits as strict as the proposed limits, it'll become significantly harder to self-host multiple stackapps without running into even more rate limiting problems. The vast majority of current anti-whatever tooling is IP-based, so if this system is too, the activity will be totalled, and I will eventually and inevitably get IP-blocked.

Even though some applications have a very infrequent usage time, if Something Happens:tm: or the run times happen to overlap, I can trivially exceed the request limit. This is especially true of logins following an internet outage or site outage, as an outage that kills my scripts forces relogs, and again, logging in is expensive when done at a scale. It's already annoying enough with CloudFlare getting in the way.

While a moderator exemption will reduce some of the problems, if the result is an IP block, I have a4 bot account that does a chunk of these requests. Other requests again are fully unauthenticated, but these make up an extreme minority of the total number of requests made.

Editor's note: as demonstrated by the bot detection rollout earlier in September, while there may be exemptions here, that apparently does not extend to anything else that could break bots and tooling. We may be safe from the rate limit, but not necessarily the next bot detection system they enable quietly, or have enabled for them by CloudFlare.

Misc. other problems

To this end, moderators will be granted a unilateral exception to the new rate-limit on any site they moderate

Moderators are far from the only people to have a use pattern like mine - there's lots of active users doing a lot in the network, or on their favourite site.


Note that we strongly recommend userscript developers switch over to API usage as soon as possible

There are multiple things there aren't endpoints for. There's no way to log in, there's no way to download the data dumps, there's no way to log into chat, there's no way to post to chat, etc.

Advanced Flagging, for example, posts feedback to certain bots by sending messages in chat. It does so with manual non-API calls, because there's simply no way to do this via the API.

Exclusions of endpoints for some of these things are either implied to be, or have explicitly been said to be by design. As convenient as it would be, as it currently stands, the API is not a viable substitute for everything userscripts or bots need to do. This is especially true for certain actions given the heavy rate limiting, tiny quota, and small to non-existent capacity for bulk API requests.

Editor's note: SE answered someone else internally who asked about an API for chat around 3 hours after I posted my questions, where they confirmed they would not be adding chat support to the API in the foreseeable future. This further underlines my point; not everything can just be switched over to the API and be expected to work. This will result in stuff breaking.

What is a request?

Here's a few samples of requests made from various pages. Note that non-SO domains are omitted, including gravatar, metasmoke, ads, cookielaw, and third-party JS CDNs. Also note that in this entire section, I'll be referring specifically to stackoverflow.com, but that's just because I don't feel like writing a placeholder for any domain in the network. Whenever I refer to stackoverflow.com, it can be substituted with any network URL

Requests from a random question page with no answers:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
13 2 cdn.sstatic.net
4a 2 i.sstatic.net
3 stackoverflow.com
2 qa.sockets.stackexchange.com
1 1 api.stackexchange.com

Editor's note: After the release of the previously mentioned unrelated bot detection and the (also unrelated) recent tags experiment, requests to stackoverflow.com went from 3 to 7, meaning 9 questions/min is enough to get rate limited. The bot detection adds 3 requests alone, so all the counts are now out of date - and they're on the lower end of things.

Requests from a random question page with 4 answers:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
13 2 cdn.sstatic.net
12a 3 i.sstatic.net
3 stackoverflow.com
2 qa.sockets.stackexchange.com

Requests from the flag dashboard:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
31a i.sstatic.net
8 2 cdn.sstatic.net
1 stackoverflow.com
1 qa.sockets.stackexchange.com

Requests made when expanding any post with flags:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
3 stackoverflow.com
1 i.sstatic.net

Requests from https://stackoverflow.com/admin/show-suspicious-votes:

Requests made (incl. userscripts, excl. blocked) Requests blocked (uBlock) Userscript requests Domain
18 2 cdn.sstatic.net
33a i.sstatic.net
1 stackoverflow.com
1 qa.sockets.stackexchange.com

Requests from the login page through a completed login:

Requests made (incl. userscripts, excl. blocked and failed) Requests blocked (uBlock) Failed requests Userscript requests Domain
1 1 askubuntu.com
30 6 cdn.sstatic.net
18 i.sstatic.net
1 1 mathoverflow.net
1 1 serverfault.net
1 1 stackapps.com
1 1 stackexchange.com
6 stackoverflow.com
1 1 superuser.com

a: Changes based on the number of users on the page. On large Q&A threads or any form of listing page, this can get big.

While mods are allegedly exempt, this still shows a pretty big problem; how do you count one request? Even if it's just requests to stackoverflow.com, question pages amplify one request to 3, and that's just with the current requests made. Depending on how you count the number of requests, just loading one singular strategic page (probably a search page with pagesize=50) can be enough to cap out the entire request limit. The vast majority of pages in the network do some amount of request amplification.

The login page in particular is, by far, the worst. One singular login makes requests to every other site in the network, meaning the total succeeded requests account for 12 requests just on that one action.

This leads to the question proposed in the section header: what is a request, and how is it counted? How do you [SE] plan to ensure that the already restrictive 60 requests/min limit doesn't affect normal users?

Footnotes

  1. Admittedly, it hasn't for over a year because strike and the following lack of motivation to keep going
  2. Part of the manual tooling deals with bulk operations on identical comments. I could queue them up to the mountain of a backlog I generate when reviewing comments, or I could tank a large number of problematic comments in one request, drastically freeing up quota for other parts of the application.
  3. This doesn't run all the time either, and is in a separate module from the bulk of the API requests
  4. Technically two, but one of them are not in use yet because strike.
  5. I also just realised that I can bypass this by running all the stackapps under my account, but this means running live bot accounts with moderator access, and insecurely storing credentials on a server exposed to the internet. For obvious reasons, this is a bad idea, but if self-hosting isn't accounted for, this is the only choice I have to maintain the comment archive.
  6. I still stand by my comments saying Stack Exchange, Inc. killed the comment archive - whether or not it's related to the change announced here or not, someone at Stack Exchange, Inc. rolled out bot detection either intentionally, or accidentally by willingly continuing to use CloudFlare in spite of it already having been disruptive to bots and other community tools. Granted, the challenge platform detection in particular is far more disruptive than other parts of CloudFlare, but CloudFlare has killed API access for me several times (and I'm not counting the bug on 2024-08-16 - I'm exclusively talking about rate limiting blocks by CF under normal operation, and while complying with the backoff signals in API responses)
  7. SE was not notified ahead of time, because due to the initial rollout timeline for the rate limit coinciding with the bot blocking kicking in, and being ignored for three weeks straight internally at the time, I assumed this was intentional, and opted to ensure the comment archive could remain functional somewhere over writing a bug report. There's a lot more private details as to why the bug report came after the fact (read: after SE suddenly decided to respond to me), which SE has been told in detail multiple times.
31
  • 10
    Hi Zoe, it appears as though the vast majority of your answer assumes that the rate-limits we are implementing will not work as designed, e.g. cannot properly differentiate between automated and human requests. If the system is not working as intended, you should file a bug report so we can fix it. If you suspect the system is rate-limiting in an unreasonably aggressive way, and the blocks are interfering with your normal usage of the site, open a support or feature request on Meta; alternatively, send in a support email, and be sure to include relevant diagnostic information.
    – Slate StaffMod
    Commented Sep 19 at 16:39
  • 8
    Additionally, as mentioned in the original post, I am not able to answer questions about how the block functions on a technical level. My aim is to share enough information so that you may understand whether the system is behaving properly, without needing to know the exact details of its operation. Unfortunately, however, I can't even partially answer those particular questions as you have written them, because they are seeking the information I have already said I cannot provide.
    – Slate StaffMod
    Commented Sep 19 at 16:39
  • 7
    Finally, your question about how requests are counted is again predicated on the notion that the implementation will not work as designed. However, it cannot be broken in the way suggested here, because a casual user using the site can easily and quickly surpass 60 requests per minute sum total with only a few page loads. If the implementation does not work as designed, please file a bug report. If it does work as designed, then it should be fairly clear what a "request" is to someone writing a script that sends requests. Thank you for your feedback.
    – Slate StaffMod
    Commented Sep 19 at 16:40
  • 42
    @Slate "Finally, your question about how requests are counted is again predicated on the notion that the implementation will not work as designed." - the data dump downloader is browser-based. It's right in the middle of a grey area. It's automation, but a full browser that loads stuff like a normal browser. One request isn't straight-forward in this context. If it's affected, users can be too. Commented Sep 19 at 18:00
  • 31
    It isn't browser emulation, it's a full browser. It's using the full version of Firefox through Selenium, a common browser automation tool that also happens to be relatively common in scraping (it also supports chrome, though my tool does not support anything but FF). The source code is at github.com/LunarWatcher/se-data-dump-transformer, along with usage instructions Commented Sep 19 at 18:26
  • 76
    @Slate Zoe didn't quite make that assumption, but I do. The rate-limits you're implementing cannot work as designed, because all requests on the internet are issued by computer programs: there is fundamentally no way to distinguish between "computer program with a human sitting behind it" and "computer program running on a laptop in a cupboard". This is a well-studied problem, and to my knowledge, nobody has ever made this work. We can probably assume that it is not generally understood how to do so.
    – wizzwizz4
    Commented Sep 19 at 19:48
  • 13
    I respect that view, @wizzwizz4, and I am familiar with these challenges too. Unfortunately the most I can say is: if you encounter issues accessing the network, you should submit a bug report or support request. If you believe that it is not an "if," but a "when," then I want you to know that I understand the rationale and respect the claim, but my advice does not change.
    – Slate StaffMod
    Commented Sep 19 at 20:33
  • 14
    @Slate based on your comments here, would it be roughly correct to describe a "request" as "a page load and any resources requested by that page not in response to any human interaction"? The main motivation to have a definition is in order to know what might or might not exceed the stated "10 requests per minute"; one needs some workable (even if vague) definition of what constitutes a request in order to determine that. Assuming that definition is correct, I think that would be workable enough to make that determination without giving away any implementation details.
    – Ryan M
    Commented Sep 19 at 23:30
  • 12
    @Slate Sorry for being blunt... but given the general hate for this move that hopefully you have noticed from the downvotes and the general reception here... what makes you think that someone would want to help with "cannot properly differentiate between automated and human requests. If the system is not working as intended, you should file a bug report so we can fix it."? If someone in the company is scared that the pathetic gatekeeping of the datadump can be circumvented, given that the community never agreed with that.. I fear that preventing robot automation is a company problem. [cont] Commented Sep 20 at 8:36
  • 15
    @Slate To be even more blunt... Please, don't insult our inteligence. We are not stupid. The timing of this move (BTW, did someone had time to hear from your legal time about the answer promised months ago on the data protection issues among all of this?) clearly show this is PURPOSELY targeted at shooting project like the one Zoe mentioned, so it is quite ironic to ask the victim to help you killing them out. Also, since Zoe mentions that this answer is a copy of another post on the "secret mod only boards" ... how comes that you are discussing this now and non when she first posted it?? Commented Sep 20 at 8:49
  • 7
    @RyanM One request in this context should approximately be understood as "what takes place when you visit a single page in the browser."
    – Slate StaffMod
    Commented Sep 20 at 14:10
  • 20
    I did not expect this change to be well-loved, @SPArcheon. But deception is not my style. I try to be as transparent and clear as I am able to be, both within the company and within the community. To the best of my knowledge, this change set is not targeted at killing data dump utilization. To that end, if someone who encounters issues does not want to report it because they believe the change is being made in bad faith, I can understand and respect that perspective. However, regardless of how such a user feels, I know intellectually that the only way to get an issue fixed is to report it.
    – Slate StaffMod
    Commented Sep 20 at 14:14
  • 18
    @Slate Again, this is not against you personally. My issue is actually a sum of many factors. First, the company constantly claiming one goal when evidence points to other agendas - just think back to some recent disclosures by ex employees about when the "kill dump protocol" was first mentioned. Second, tell me why Zoe concerns during the private phase weren't even addressed? Again, not a you problem, but she was ignored when the company was in a position that no one would know (how convenient is that the mods can't comment about what happens in the gods halls) [cont] Commented Sep 23 at 7:58
  • 11
    [cont] Replying only when the things go public and everyone would notice is not what I would call a good sign. Third, the "we can't tell full reasons because it is not our role to"... then where are those who can? Why are you here left alone if you can't comment on the higher level strategies this change fits in?? It is not great to see those individuals go silent when they would be most needed, but be fast to go around doing empty close door PR talks at conventions most users are too far away to care. Commented Sep 23 at 8:02
  • 9
    @Slate Lastly, let me point out that nowhere I suggested that "people should not report bugs". I instead stated that I doubt that the users that don't like the change will help you with improving your "automation detection" algorithms. That is not the same. Commented Sep 23 at 8:04
85
+50

Let me make it clear.

You ask us to help with solving a problem you created in an attempt to solve another problem. You refuse to disclose former problem.

Is it me, or does it sound mighty shady?

PS: "Preventing unauthorized automated access". Unauthorized? Seriously? Why not go all the way and call it: "Preventing criminal access to our data".

2
  • 21
    "Preventing criminal access to our data"... was the joke about them labeling automated access as "criminal"... or them claiming the data as "their"? Commented Sep 20 at 13:56
  • @SPArcheon Under DMCA, unauthorized access might be criminal access by definition. Only a court can tell you one way or the other. Commented Oct 11 at 20:44
57

except indexing by known and recognized search engines

While I of course have concerns over other parts of this statement, this is particularly concerning to me.

Which engines are "known and recognized"? Limiting the user base

This essentially allows SE to bar all but the biggest search engines, which will either disincentivize existing SE users to use those engines, or disincentivize existing users of those engines to use SE.

Existing SE users who also use an "unrecognized" search engine would likely have to find a new search engine. Personally, as a software developer and frequenter of Stack Overflow, I feel that this would be my choice. I can't honestly say that if SE wasn't indexed by my search engine that I would cease to use it in protest as SO is too valuable of a resource not to be used, but I would rather not make the switch from DuckDuckGo back to Google or Bing, both of whom are forcing AI features that I find only disrupt my experience and workflow. Forcing a portion of SE users (primarily SO users, who I feel are not unlikely to use "alternative" search engines), will surely sour the attitude of the community.

There's also a small group of potential users to be lost. People who do not use SE, but use search engines not approved by the site, will either have a bad experience searching the network, or be unable to discover it entirely. Especially with more and more people moving away from Google and Bing, this will disincentivize those users from using the network, counter to the recent initiative to find new active users.

Suppressing new search engines

SE is a huge part of the internet that many would say they can't live without. Due to its influence, decisions like this have the power to make or break new and up-and-coming search engines. This decision essentially solidifies SE's pre-approved list as the only search engines worth using for many, preventing new ones from getting their start.

The decision creates unfair hurdles for smaller search engines to overcome. Even if they get big despite this change, at what point will SE consider adding them to their "allow list"?

SE's search bar won't fix this.

Realistically, users aren't going to use the search feature on their respective SE sites.

First, It's added friction - "type query" vs "type stackoverflow.com, click search bar, type query". That's far too much effort for most users.

Second, it excludes other sources of knowledge. Searching only within SE excludes Q&As and other resources relevant to the query that are hosted outside the network. Assuming that SE has all the answers is... well, rather stupid, so limiting searches for knowledge to just SE is generally not a viable option.

Finally, SE search is quite frankly an inferior experience to what users would get with any half-decent search engine. Investing in the SE search experience would likely not come close to being a complete solution as it "solves" - realistically, "poorly mitigates" only one of many problems.

Conclusion

Overall, I think allowing only specific search engines to index the site is a bad decision for users of smaller search engines, as well as those search engine themselves. It will sour attitudes, deter users, and solidify the monopolies held over the internet by a few large companies who provide search engines, telling a not insignificant group of users and potential users that they're second-class citizens of the Stack Exchange Network because, for whatever reason, they chose to use a search engine that big SE doesn't approve.

5
  • 35
    If SE blocks my search engine, then I guess I just won't be visiting SE anymore when I'm searching for something.
    – Gloweye
    Commented Sep 25 at 9:36
  • 6
    Re "solidify the monopolies held over the internet by a few large companies", I wonder if that's an intended outcome, in the hope of then being in a position to extract rents from those large companies. That's how big European publishers acted in the last decade: they saw big tech make a lot of profits and decided to lobby for laws which would solidify the big tech oligopolies but also try to force them to give some money to legacy publishers (link tax, article 17 etc.). Maybe StackExchange behaves like a legacy publisher these days.
    – Nemo
    Commented Sep 30 at 12:10
  • 5
    Interesting they formed a partnership with a large search provider earlier in the year and then started blocking smaller search engines out later in the same year. Coincidence? Commented Oct 6 at 13:12
  • 4
    Remember: Bing is a "known and recognised search engine" and they're explicitly blocking it! I guess Google is paying them quite a pretty sum to do this!
    – jwenting
    Commented Oct 7 at 10:12
  • 4
    Ironically, if SE doesn't show up in a particular search engine, sites that rip SE content verbatim will. And those sites probably already use many tricks to circumvent rate limiting of any kind. Commented Oct 11 at 12:25
54

There's a better way

Stack Exchange should publish all data it holds in an easily downloadable and mirrorable machine-readable format. Maybe it can be updated once a month by an automatic process. Then there will be no need for scraping, and if a particular scraper did pose an excessive server load there would be few objections to limiting it because that would be the wrong way to get the data.

5
  • 42
    So... the data dumps? Commented Oct 1 at 15:25
  • 5
    @JourneymanGeek ding ding ding!
    – TylerH
    Commented Oct 1 at 15:35
  • 23
    But if they did that how would they be able to sell access to the content we donated and cash in on the AI hype?
    – ColleenV
    Commented Oct 1 at 17:19
  • 5
    Note that I received a 7-day network-wide ban after posting this answer. Commented Oct 20 at 9:49
  • 4
    @CriticizingSEisbannable Did they say tha5 it was because of that answer or did say there was some other reason?
    – Starship
    Commented Oct 21 at 17:40
52

Will this hinder the Wayback Machine's ability to archive pages from the Stack Exchange network? (If so, I'd argue that it should get a special exemption from the rate-limiting.)

8
  • 16
    Based on the information available to me, blocking archival services is not an intended outcome of this change set. That said, I'm not sure what the impact would be (if any). I will need to ask the engineering team tomorrow, and/or see if there are ways to mitigate the impact if present. By the way, @talex, please don't answer unknown questions on our behalf with your assumptions, especially if they express certainty. These sorts of responses from community members are very misleading and make answering questions transparently more challenging to do.
    – Slate StaffMod
    Commented Sep 22 at 19:48
  • 24
    I've talked with the engineering teams. We have no plans to limit the Wayback Machine / Internet Archive at this time. Any impact would be unintended. If you encounter any issues using the Wayback Machine, I would recommend reporting it as a bug.
    – Slate StaffMod
    Commented Sep 23 at 14:40
  • 9
    @ShadowWizard "Any impact would be unintended." and "If you encounter any issues using the Wayback Machine, I would recommend reporting it as a bug." both sound like pretty strong statements regarding the company's intent for the foreseeable future. Is your issue with the phrase "at this time" in the sentence before those two?
    – Ryan M
    Commented Sep 25 at 10:37
  • 4
    My interpretation of "at this time" would be "We currently don't plan to do this and also currently don't have plans to do this in the future" but if AI companies (or similar) decide to use the Wayback machine for getting their training data, that might change.
    – dan1st
    Commented Sep 25 at 11:07
  • 6
    @dan1st hey, and us evil meddling users documenting ToS and policies changes and controversial actions on a third party site? We are entitled to some hate too!... Sarcasm aside, the Archive has plenty of other uses, like documenting censorship of featured questions tags during some big network wide strike Commented Sep 25 at 16:29
  • 5
    @Slate One of the ingestion pathways for the Wayback Machine involves volunteers running downloader scripts on their home IPs, identical to the bots you want to ban. Commented Oct 1 at 15:17
  • 1
    Gonna need accredited bot accounts at this rate (can't use human mod accounts, they know too much)
    – SamB
    Commented Oct 1 at 16:55
  • 1
    @SamB cdn.sstatic.net/Winterbash/img/hat/6942050.svg
    – starball
    Commented Oct 1 at 18:34
38

We will heavily rate-limit access to the platform from Microsoft’s IP addresses.

(How) will this impact people accessing the SE network using services like Windows 365 or Amazon WorkSpaces?

5
  • 8
    Tentative conclusion - if you run heavy-handed scripts from it, it'll get rate-limited. If you're just using Stack Overflow in a browser on Windows 365, you probably won't get rate-limited.
    – Slate StaffMod
    Commented Sep 20 at 14:21
  • 24
    @Slate "you probably won't get rate-limited". Not very assuring, right?
    – M--
    Commented Sep 23 at 16:00
  • 11
    @M-- I mean, I'm not trying to mislead people by saying that. I say "probably" instead of "certainly" because I can't guarantee that Windows 365 won't ever accidentally be impacted. But let me be clear: It isn't the goal to rate-limit Windows 365 users, and I can't see a reason why it would happen. In the event it does happen, report it so that it can be fixed.
    – Slate StaffMod
    Commented Sep 23 at 16:09
  • 6
    @Slate let me clarify, I am not implying that you are not genuine (and I apologize if my comment hinted that). It's the nature of these changes. In the pursuit of monetizing our collective data, SO is introducing changes that most definitely will result in users facing "bugs", hence, you advising us to report them as they arise.
    – M--
    Commented Sep 23 at 16:20
  • 8
    @M-- Yeah, that's fair, and no worries - I don't take it personally, I just want to make sure I'm being as clear as I can :) For what it's worth, I do empathize. I want to make the change as easy as possible. And... yeah, there's no denying that this is a complex change that risks introducing bugs.
    – Slate StaffMod
    Commented Sep 23 at 16:28
28

Could high-rep users be exempted from this? I'd imagine that the vast vast vast majority of problematic automated requests come from lower rep users (or more likely 1 rep/non-registered users) and the vast majority of non-problematic automated requests come from higher rep users.

This would solve a good portion of the problems with this idea.

11
  • 26
    We have actually already taken this into account! In general, high-rep users should not see any additional rate-limiting, even when submitting automated requests to the site e.g. via userscripts. The exact implementation of these rate-limits is less black-and-white than stated in the original post, and more complex exceptions exist than the ones strictly listed there. However, due to the associated complexity, please keep in mind that we can't guarantee those exceptions will be completely infallible.
    – Slate StaffMod
    Commented Sep 19 at 17:05
  • 9
    @Slate would you mind saying how high rep is high rep? 100? 1000? 10000? Is it rep one site or rep on the site in question? Anyways, I do appreciate that you thought of this.
    – Starship
    Commented Sep 19 at 17:50
  • 10
    Pretty sure the exact details are kept hidden or vague on purpose, as there are high rep users who become trolls, and might use that info for bad stuff. Commented Sep 19 at 17:54
  • 1
    Okay...well if so could I get something of very general range at least. Because I think that depending on the answer to that how much a help this is is a big difference. @ShadowWizard
    – Starship
    Commented Sep 19 at 17:55
  • 2
    @Starship I would imagine it's linked to the Trusted User privilege, but we're not going to get any further details. (As much as I dislike the Cloudflare setup, this is one of those things I agree with secrecy about: it is literally security by obscurity, and only effective as long as the obscurity remains.)
    – wizzwizz4
    Commented Sep 19 at 19:43
  • 2
    While it might solve many problems, there are also many legitimate sockpuppets for automation reasons and these probably don't have that much reputation, e.g. Natty (that might use the API though, I don't know).
    – dan1st
    Commented Sep 19 at 20:31
  • @dan1st I didn't say it would solve everything but its a step in the right direction, is it not?
    – Starship
    Commented Sep 19 at 21:35
  • 8
    @Slate I don't suppose the complexity is due (in part) to using network rep to exempt users network wide instead of only on sites where they have high rep? For example, Charcoal users allow the tool to flag on their behalf to help smaller sites stay spam free. Now, I think that's done through the API, so maybe not a concern directly but I guess it seems reasonable for the system to assume that the user should be exempt everywhere. I was actually wondering why mods were only exempt on their own sites, too for similar reasons.
    – Catija
    Commented Sep 19 at 22:44
  • 1
    @Slate here's a high rep user running into issues: meta.stackoverflow.com/questions/431678/… Would you take a look to see if this is related to the recent changes?
    – M--
    Commented Sep 23 at 16:02
  • 5
    @M-- These changes haven't gone out yet. Rate limits won't be active until, at earliest, the last week in October.
    – Slate StaffMod
    Commented Sep 23 at 16:04
  • @M-- that's just CloudFlare blocking certain IP addresses based on... god knows what. However, it's not new, and not related to what announced here. Commented Sep 25 at 10:15
16

It seems like you have done some attempts of trying to reduce the negative impact by asking for which tools the Community uses. However, there are many other small user scripts and applications that are probably only used by a few people and not worth considering on their own. Community user scripts and other apps have all sorts of reasons (including but not limited to moderation) to access various pages. I guess most access to question/answer pages can use the API but this isn't the case for more niche-things.

That being said,

What stops you from limiting these restrictions to question/answer and similar pages?

From your post (and other posts in the past), it seems like the main problem you are dealing it is scraping the user content you are hosting (if it isn't, correct me and tell us what else you want to achieve with this). If you want to block this, you could do that by implementing these restrictions on question/answer pages, maybe the search and a few other pages you don't want people to scrape. This would mean most Community applications and user scripts that don't request many additional question/answer (and similar) should be unaffected.

Please only restrict the sites you really want to restrict. There isn't much reason to limit automated access to other site and it is doing unnecessary damage.

6
  • 16
    You put the finger at the issue "your content" - it is not their content. They have permission (a license) to use the content created by the community. A license which, by the way, while allowing them to profit from said content, also requires them not to prevent others from also using said content. You could take the data dumps, create your own "questions asked and answered" website with them, and be 100% in the clear (as long as you properly attribute the questions/answers to their original users). But to answer your question: They want profit/big numbers/money - that is all
    – CharonX
    Commented Sep 23 at 9:48
  • 2
    @CharonX My fault, fixed - I don't know why I wrote that.
    – dan1st
    Commented Sep 23 at 10:25
  • 13
    I raised this as a possibility internally. We're optimistic that this idea could work, and agree that would alleviate a number of community pain points around this change. We do need to do some technical/trade exploration before we can commit to it, so don't take this as a firm "yes" - yet.
    – Slate StaffMod
    Commented Sep 23 at 14:56
  • 5
    Small update here. Based on the data we're seeing from the logging mode we're currently running, we actually suspect there should be very few (if any) new false positive detections of community tools / scripts. So for now I'll say that this approach is definitely still an option, but it's likely to be something we consider as a fallback, rather than something we implement at the start.
    – Slate StaffMod
    Commented Sep 26 at 20:36
  • 3
    Thank you for being transparent. I hope that your logging doesn't miss significant cases of user scripts/similar applications and that you are able to minimize false positives as much as possible (even though I still think it would be preferable to not restrict pages that don'tneed to be restricted).
    – dan1st
    Commented Sep 26 at 21:33
  • 2
    @dan1st I hope so, too!
    – Slate StaffMod
    Commented Sep 26 at 22:28
14

This announcement is quite useless without information on how this is going to implemented:

We will restrict the rate at which users can submit automated requests to the site through scripts and code without using the API, e.g., actions originating from sources other than human interaction with a browser.

In particular, will you use a captcha? If so, which one and configured how? CloudFlare for example is known to be very problematic for Tor users and in some configurations forces everyone to allow third-party cookies as well as proprietary JavaScript and assorted surveillance technology in order to visit a website.

It's disingenous to state that blocking the Internet Archive's wayback machine is not an intended goal, unless you have already planned to rule out any implementation method known to interfere with legitimate uses like the wayback machine.

2
  • 6
    Generally, if a user trips this ratelimit, they will see a Cloudflare captcha page.
    – Slate StaffMod
    Commented Sep 25 at 15:22
  • 3
    @Slate Thanks for confirming. So what appearance will be used, "managed"? How will analytics be used to evaluate success? Do you have an estimate of how many actual humans are misclassified as non-human or otherwise blocked by cloudflare?What demographics have been deemed acceptable to lose (e.g. Tor users, people who block JavaScript, prople with ad blockers, people from certain countries with higher captcha failure rates, whatever)?
    – Nemo
    Commented Sep 27 at 7:31
6

I just remembered this was a thing - during the period SE was getting DDOSed, I ran/set up the website uptime monitoring service I use for other things (uptime kuma) to check on a few key services for me - main SE and MSE chat, MSE, SO and SU. It checks that the site is up, and posts a message on a chatroom if there's an error.

This in theory gives me an independent way to check if the network is down and its a tool I'm running on other resources anyway.

As I understand it, the service makes a request, checks the status code, and reports if its unhealthy. I'm reasonably sure I'm within the request limit but I'm pondering setting a longer time between requests (I'm currently about a minute between site checks). I do check multiple sites though, so it'd be closer to 5, depending on how many requests a check is.

We are evaluating what an appropriate rate-limit would be to match our desired level of restriction. Our initial guess is that the new rate-limit will be set to around 60 requests per minute.

Would this be networkwide or per site?

There's mentions (on a deleted post?) that this may get extended to other cloud providers. I currently run most of my services on a dedicated server, though I might switch some of them to my home server. I'd like to think what I'm doing is of limited impact but if its something that could affect my overall access to SE, I'd consider it undesirable.

Assuming I suspect I'm affected by these restrictions - is there a path to check if I am and to mitigate the impact on the network? I'm currently running a dedicated server on scaleway but I might move in future.

As for tools like this -

Technically this feels like ~90% of the behaviour y'all are trying to prevent (as written, though since I'm not scraping, less in spirit) which are automated non human interactions with the network. Practically, its a very useful tool as a avid SE user and hobbyist server admin. Should I be checking (and with whom) before setting up monitoring tools on the network?

6

Doesn't this violate CC BY-SA 4.0, under which user created content is licensed across all SE?

You are free to:

  1. Share — copy and redistribute the material in any medium or format for any purpose, even commercially.

  2. Adapt — remix, transform, and build upon the material for any purpose, even commercially.

  3. The licensor cannot revoke these freedoms as long as you follow the license terms.

and

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Emphasis mine.

12
  • 5
    Can you elaborate why that would violate the license? These are restrictions on the site, not the content and all posts are still accessible my humans.
    – dan1st
    Commented Oct 15 at 20:56
  • 2
    @dan1st Doesn't the current download page make a sort of threat that the download is only for personal use and if users are found to be using it for commercial means, they will refuse access in the future? That seems like an additional legal term. Also, I think you have to have an account to download it and a profile on every site to get the entire network - so... that's a lot of legal/technical stuff.
    – Catija
    Commented Oct 16 at 12:56
  • 1
    I guess the argument would be that it's just for the purpose of using the tool to download the dump and so they're "allowed" to police that, since there's no requirement that content licensed this way be easily accessible. But, since the dump has been so easily accessible until this change, it certainly feels like they're going against the spirit of it, even if "technically" they're not.
    – Catija
    Commented Oct 16 at 13:02
  • @Catija Indeed, that's mostly what I was referring to; though I'm also not sure whether making a resource much less available than before, and restricting the methods it can be accessed to the point it significantly limits usability (no automated tools) is acceptable either.
    – Neinstein
    Commented Oct 16 at 13:36
  • Well, that would be the data dump restrictions violating the license but I don't see how changing restrictions for bots would do that.
    – dan1st
    Commented Oct 16 at 14:46
  • The CC BY-SA license doesn't apply to StackExchange Inc., which has a separate private license per ToS «you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, » stackoverflow.com/legal/terms-of-service/public
    – Nemo
    Commented Oct 18 at 12:32
  • 1
    @Nemo User created content on SE is explicitly CC BY-SA. Just click on the "share" link under any question or answer to verify this.
    – Neinstein
    Commented Oct 19 at 5:16
  • @Neinstein Yes, that's what I wrote. There are two licenses. A public license for the general public (CC BY-SA) and a private license to StackExchange Inc. Have you read the ToS? Are you familiar with the concept of public license? en.wikipedia.org/wiki/Public_copyright_license
    – Nemo
    Commented Oct 20 at 6:27
  • @Nemo I'm not sure if I understood your point. The data in question is user created data, to which CC BY-SA applies explicitly as per SE ToS.
    – Neinstein
    Commented Oct 20 at 9:26
  • @Neinstein The CC BY-SA applies to the reusers who use the CC BY-SA. StackExchange Inc is not such a user. Are you familiar with the concept of dual licensing?
    – Nemo
    Commented Oct 21 at 13:13
  • @Nemo why would that apply to my point? As I already said, I'm talking about SE restricting access to user created data, which is explicitly under CC BY-SA and which is not dual licensed. As per SE TOS, users provide the data under said license, which applies to SE itself as well. There's no room for dual licensing, not that I know of at least. But if you think my point is not in accordance with the TOS, feel free to direct me to the specific part you think I'm in disagreement with.
    – Neinstein
    Commented Oct 21 at 13:22
  • SE Inc. is not bound to the CC BY-SA. Whatever the CC BY-SA asks from licensees is irrelevant for SE Inc. I don't know how to state this more clearly. Have you actually read the legal code?
    – Nemo
    Commented Oct 28 at 5:28

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .