Skip to content

SAME_HOSTNAME not working on non www URLs #955

Description

@ROYOSTI

When using the EnqueueStrategy.SAME_HOSTNAME I noticed it does not work properly on non www urls.

In the debugger I noticed it passes origin to the _check_enqueue_strategy but it uses the context.request.loaded_url if available.
So every URL that is checked will mismatch because of the difference in hostname

Image

I tested this with multiple urls with & without www prefix and got the same behaviour.

Image

Changing the line to origin = context.request.url fix this issue, but I have no idea what implications this would have on the other code.

I use the PlaywrightCrawler in my code with context.enqueue_links

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions