Blog hosted on GitHub pages not getting indexed by Google Search #7478

fredrikaverpil · 2024-08-25T12:29:17Z

fredrikaverpil
Aug 25, 2024

I've got a weird problem where none of my Material for MkDocs pages hosted on GitHub Pages are getting indexed by Google Search.

No matter what I do in the Google Search Console, I just get the "Discovered - currently not indexed" error and the only thing that works is to manually submit individual URLs for indexing, one by one (very tedious and time consuming + rate limited at around 10 URLs a day).

What am I doing wrong - or what could be the root cause of this behavior?
I have registered my sitemap.xml in the Google Search Console.

What happened?

I used to run a Jekyll based blog, also hosted on GitHub Pages (but using the built-in GitHub Pages GHA workflow for publishing) for this. I also used Google Analytics at the time. This worked just fine and I never had to bother making sure my blog was indexed by Google. Maybe Google Analytics made sure all valid and visited URLs were indexed automatically.

I moved my blog onto Material for MkDocs along with a custom GHA workflow which builds and publishes my blog, and I also moved away from Google Analytics. At the same time, I did change the URL to my blog pages (although I have a forwarding mechanism to the new URL for each blog post).

At some point, I noticed all of my blog posts got de-listed at Google as I had a massive drop in visits to my website. Perhaps this happened because I changed the URL to the posts 🤷. I also use Umami instead of Google Analytics now. Once logged into the Google Search Console (GSC) I noticed none of the blog posts were indexed by Google on the new URLs. That's fair, I did change the post URLs after all.

So I figured, I need to tell GSC where those blog posts reside now.

What did I do?

I manually registered the sitemap.xml in the GSC which made GSC aware of all the pages on my website.
I asked GSC to start indexing my website but received an error after a couple of days for this operation (which covered all of my 100+ blog posts): "Discovered - currently not indexed".
As a precaution, I also added a robots.txt which would tell Google to crawl everything.

Months later, the GSC still won't index any of my pages unless I manually tell it to explicitly do so - and I can only do this one URL at a time.

Screenshots

What am I doing wrong?

I've noticed another user (👋 @Zwyx) is having pretty much the exact problem, where GitHub refuses to index pages. And as they are using Docusaurus (source here) it really feels like this is not a Material for MkDocs problem per se.

Answered by kamilkrzyskow

Aug 25, 2024

Over at Nype - https://npe.cm/ we had the same issue, we moved away from GitHub Pages and we're still recovering from it.

1st issue is basically that, due to how GitHub Pages are setup everything is stored in a limited amount of servers under a limited amount of IP addresses, so this leads to a quota threshold being met and Google pausing on crawling the servers at all.

https://support.google.com/webmasters/thread/187613472/discovered-currently-not-indexed-error-on-github-pages-hosted-site

2nd issue (my guess) is that any problem with a site is "amplified" due to the slow crawling, and Google sort of abandons a site after too many problems occur. You can check the status of your crawl s…

View full answer

alexvoss · 2024-08-25T12:54:44Z

alexvoss
Aug 25, 2024
Collaborator Sponsor

There seems to be a tool for figuring out reasons why pages are discovered but not indexed. The URL Inspection Tool should give you a report but you need to prove site ownership to it first. Hope that helps and please do let us know if this has anything at all to do with Material for MkDocs.

2 replies

fredrikaverpil Aug 25, 2024
Author

That's the tool I use to submit individual URLs to Google Search, so I've seen this before. When viewing a non-indexed URL in this tool, I can't see why the indexing doesn't happen. Do you?

If I click "Request indexing", the URL is submitted and a few days later it will be indexed.
"Learn more" takes me here: https://support.google.com/webmasters/answer/7440203#crawled
"Open report" just shows me all crawled (but not indexed) pages, see second screenshot.

URL inspection tool

Open report

alexvoss Aug 25, 2024
Collaborator Sponsor

Hm... the description of the tool led me to believe that it would spit out a reason for not indexing. Specifically, I expected there to be a "manual action description panel" as the troubleshooting guide suggests?! Edit: have a look in the navigation on the left, it is there!

I am not yet using the search console myself and only just verified ownership of a site I own. Google says it will take a day or so before I see reports.

kamilkrzyskow · 2024-08-25T13:55:46Z

kamilkrzyskow
Aug 25, 2024
Collaborator

Over at Nype - https://npe.cm/ we had the same issue, we moved away from GitHub Pages and we're still recovering from it.

1st issue is basically that, due to how GitHub Pages are setup everything is stored in a limited amount of servers under a limited amount of IP addresses, so this leads to a quota threshold being met and Google pausing on crawling the servers at all.

https://support.google.com/webmasters/thread/187613472/discovered-currently-not-indexed-error-on-github-pages-hosted-site

2nd issue (my guess) is that any problem with a site is "amplified" due to the slow crawling, and Google sort of abandons a site after too many problems occur. You can check the status of your crawl stats under Google Search Console -> Your Property -> Settings -> Crawling -> Crawl Stats (Open Report) -> "Between the chart and Crawl requests breakdown there is a status button for robots, DNS, Server connectivity", after a ~month after changing servers there is still "residue" of old issues and not all check-marks are green in our back panel 😞.

3rd issue (my guess) is that in your case, you set up your redirections on the 404 page. GoogleBot should in theory handle JavaScript and detect the location change in JS, but overall when a bad URL loads GitHub sends the 404 server status code, and only later it loads the correct page. At Nype we played around with a combination of 404 page JavaScript redirects, mkdocs-redirects plugin to create a valid page with a redirect, so the server status code is at least 200, and some duplicate pages and link rel=canonical mangling, and the results were mediocre at best, likely due to the 1st issue and slow "reassessment" 😞Later on, as the page has now it's own server, we moved on to the proper way of sending a 301 code when redirecting, but too soon to tell anything more.

4th issue (my guess) is that in your case, you have now a lot of pages with possibly low traffic / low amount of backlinks, so Google is reluctant in adding so many links at once 🤔

We have also used the currently trending script https://github.com/goenning/google-indexing-script it helped a bit, but due to the 2nd issue manual requests for indexing were better. I personally didn't do it as I'm not the property owner.

I hope at least some of the above helps you out 😅 but yeah after working a bit with Google "docs", A LOT of things are not said directly, and users have to guess too much imo.

5 replies

kamilkrzyskow Aug 25, 2024
Collaborator

I'll also add that over at Gothic Modding Community we also were slow to index our pages at the beginning, and a bunch of our users reported missing links at the bottom of the search result page to help a bit with indexing 😉

tbh GMC doesn't track it at all, so we might be still slow to index 😅

EDIT:

Some off-topic observations:

In this case the owner set up a double hosting approach, GitHub Pages, and a Development branch hosted on a server.
The canonical base URL in both cases is unique, so search engines should index both.
Searching for "Gothic Modding Community" should show the default github.io address, and searching for "Gothic Modding Community auronen" should show the Development branch.
This shows to me that the "maturity" of github.io domain is stronger and therefore is preferred by search engines, even though it likely is slower to recrawl 🤔

squidfunk Aug 26, 2024
Maintainer

Thanks, @kamilkrzyskow for sharing this – very valuable insights!

Over at Nype - https://npe.cm/ we had the same issue, we moved away from GitHub Pages and we're still recovering from it.

I'm not sure I understand this, though. You said that GitHub Pages essentially limits the number of pages Google will crawl, since they serve the world's documentation from a handful of IPs. Moving away from GitHub Pages should remedy the situation, giving Google probably an IP that is much less frequented, not?

It might also deteriorate search engine rankings due to domain maturity, as you mentioned, since there's so much content on GitHub Pages, you benefit from it also hosting your docs there. You only benefit from the domain though, so rolling your own docs.foobar.com while keeping hosting on GitHub Pages via CNAME might be the worst thing to do. However, as long as you use Google Analytics, this might not be much of a problem.

kamilkrzyskow Aug 26, 2024
Collaborator

Moving away from GitHub Pages should remedy the situation, giving Google probably an IP that is much less frequented, not?

In general it should remedy this situation, but similarly to the OP there was a domain / page structure change for https://npe.cm/ resulting in some crawl issues, and Google "holds a grudge", and will do so for some time.
We're currently on the "Google encountered at least one significant crawl availability issue in the last 90 days on your site" stage, and looking at our time-frame chart the issues were still on GitHub Pages.

https://support.google.com/webmasters/answer/9679690
- Go to "Host status" section... Why do Google "docs" not have TOC/heading links? 🤨

So even after the server change (my guess) Google is reluctant to work with the domain, has some blacklist etc.
Fiori Tracker - https://fioritracker.org/ a Nype product, also moved off from GitHub Pages, and had no prior issues, and over the span of ~2 weeks it started to get re-indexed almost daily. So I'll stay with the theory that Google can "hold a grudge" with a domain 🙂.

so rolling your own docs.foobar.com while keeping hosting on GitHub Pages via CNAME might be the worst thing to do

Yeah, this would be good to know beforehand. Even though GitHub Pages is free, it's still GitHub, so it should be better than other 321xyz.freehosting.domain solutions, and there is the expectation that it should have "higher priority" (whatever that means).

Sadly Google doesn't report anything of that sort in the Google Search Console aka

⚠�� Stop using free hosting from GitHub as it's not better than any other free hosting it only hurts your indexing.

The next crawl / indexing will be in 5 days, please expect higher server load at this time.

Oh no, the quota cap for GitHub (IP address) server has been reached, your crawl will be postponed.

IMO, Google should proactively report things as-is, and be very specific about the issue, and when will it try to fix it automagically, not some generic "Discovered - currently not indexed". But then people could try to play the system and SEO specialists would lose their jobs or have more work to do 😆

Given how all this is a lot of guess work and ambiguous behaviour with a delay for results, I've just noticed how SEO can be unrewarding.

squidfunk Aug 26, 2024
Maintainer

I'm marking this as an answer, since it is currently our best guess backed with at least some data. Thanks again for taking the time to write this up, @kamilkrzyskow. It's very valuable for sharing with other users.

alexvoss Aug 26, 2024
Collaborator Sponsor

Also want to add my thanks, @kamilkrzyskow. So, you can pay for preferential indexing? Who would have thought...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blog hosted on GitHub pages not getting indexed by Google Search #7478

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Blog hosted on GitHub pages not getting indexed by Google Search #7478

fredrikaverpil Aug 25, 2024

What happened?

What did I do?

Screenshots

What am I doing wrong?

Replies: 2 comments · 7 replies

alexvoss Aug 25, 2024 Collaborator Sponsor

fredrikaverpil Aug 25, 2024 Author

URL inspection tool

Open report

alexvoss Aug 25, 2024 Collaborator Sponsor

kamilkrzyskow Aug 25, 2024 Collaborator

kamilkrzyskow Aug 25, 2024 Collaborator

squidfunk Aug 26, 2024 Maintainer

kamilkrzyskow Aug 26, 2024 Collaborator

squidfunk Aug 26, 2024 Maintainer

alexvoss Aug 26, 2024 Collaborator Sponsor

fredrikaverpil
Aug 25, 2024

Replies: 2 comments 7 replies

alexvoss
Aug 25, 2024
Collaborator Sponsor

fredrikaverpil Aug 25, 2024
Author

alexvoss Aug 25, 2024
Collaborator Sponsor

kamilkrzyskow
Aug 25, 2024
Collaborator

kamilkrzyskow Aug 25, 2024
Collaborator

squidfunk Aug 26, 2024
Maintainer

kamilkrzyskow Aug 26, 2024
Collaborator

squidfunk Aug 26, 2024
Maintainer

alexvoss Aug 26, 2024
Collaborator Sponsor