Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable search engine crawling of non-canonical forks #90

Open
PathogenDavid opened this issue Jul 25, 2024 · 2 comments
Open

Disable search engine crawling of non-canonical forks #90

PathogenDavid opened this issue Jul 25, 2024 · 2 comments

Comments

@PathogenDavid
Copy link
Member

PathogenDavid commented Jul 25, 2024

It can be convenient in forks to enable deployment to GitHub Pages for the purposes of testing. However this inadvertently creates duplicate copies of the documentation accessible on the wider public internet, which means search engines have the potential to find them.

This runs the risk of polluting search results with content which is likely outdated. I believe it also runs the risk of harming the SEO of the official documentation website. (I'm no SEO expert but my understanding is Google in particular harshly penalizes websites which duplicate other websites.)

We should generate a robots.txt and/or add the appropriate meta tags to non-canonical copies of the docs website.

As a semi-related aside (since you specify it in the robots.txt), we should also enable the sitemap.xml generation. Looks like it just needs to be turned on.

@glopesdev
Copy link
Member

Having a robots.txt makes complete sense, I just never got to dive into how it works properly 🙂

@PathogenDavid
Copy link
Member Author

PathogenDavid commented Jul 26, 2024

One thing I didn't really think about when writing this is that the main website's robots.txt is what actually matters since the docs repo is nested in a subdirectory.

(Similarly for forks, the robots.txt in the GitHub Pages website of the user or the organization associated with the fork is what actually matters.)

This means we actually probably just go the route of adding <meta name="robots" content="noindex, nofollow"> tags to the <head> of every page instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants