Implement `respectRobotsTxtFile` crawler option

This option automatically fetches the robots.txt file based on the current request and adheres to the `disallow` directives.

JS version was implemented via the following PRs:

* https://github.com/apify/crawlee/pull/2910
* https://github.com/apify/crawlee/pull/2916
* https://github.com/apify/crawlee/pull/2913

We will first need to implement the `RobotsTxtFile` and `Sitemap` classes:

* https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/robots.ts
* https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/sitemap.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement `respectRobotsTxtFile` crawler option #1144

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Implement respectRobotsTxtFile crawler option #1144

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Implement `respectRobotsTxtFile` crawler option #1144