Skip to content

fix: fix match check for specified enqueue strategy for requests with redirect#1199

Merged
janbuchar merged 2 commits into
apify:masterfrom
Mantisus:fix-enqueue-strategy
May 20, 2025
Merged

fix: fix match check for specified enqueue strategy for requests with redirect#1199
janbuchar merged 2 commits into
apify:masterfrom
Mantisus:fix-enqueue-strategy

Conversation

@Mantisus

Copy link
Copy Markdown
Collaborator

Description

  • Fixes match check for specified enqueue strategy for requests with redirect. Before this PR, the check used the final url after the redirect, after that the original url will be used.

Issues

Testing

  • Added tests for enqueue strategy with redirect simulation.

@Mantisus Mantisus self-assigned this May 16, 2025
@Mantisus Mantisus requested a review from Copilot May 16, 2025 19:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the match check for the enqueue strategy by ensuring that the original URL is used for comparison instead of the final URL after a redirect. Key changes include adding a new field (loaded_url) to test inputs, updating tests to assign the loaded URL, and modifying the crawler’s commit handler to use context.request.url for the URL check.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/unit/crawlers/_basic/test_basic_crawler.py Added a new "loaded_url" field in the test input dataclass and updated test cases and request context assignment to simulate original URLs.
src/crawlee/crawlers/_basic/_basic_crawler.py Changed the URL used for the match check by replacing the earlier "origin" variable with context.request.url.

add_requests_call.get('strategy', 'all'),
target_url=urlparse(dst_request.url),
origin_url=urlparse(origin),
origin_url=urlparse(context.request.url),

Copilot AI May 16, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests assign the original URL to context.request.loaded_url but the production code uses context.request.url for the enqueue strategy check. Ensure that the correct property (either 'url' or 'loaded_url') is used consistently across the system to fix the redirect issue.

Suggested change
origin_url=urlparse(context.request.url),
origin_url=urlparse(context.request.loaded_url or context.request.url),

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests assign context.request.loaded_url to simulate a redirect. The start url should be used when we check the strategy.


@crawler.router.handler('start')
async def start_handler(context: BasicCrawlingContext) -> None:
context.request.loaded_url = test_input.loaded_url

Copilot AI May 16, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test assigns the original URL to context.request.loaded_url but if the commit handler is intended to use the original URL for enqueue strategy testing, consider aligning the property used (for example, updating context.request.url) to match the production logic.

Suggested change
context.request.loaded_url = test_input.loaded_url
context.request.url = test_input.loaded_url

Copilot uses AI. Check for mistakes.

@Mantisus Mantisus May 16, 2025

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, we do this to simulate a redirect.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂

Something along the lines of "Assign test value to loaded_url - BasicCrawler does not do any navigation by itself".

@Mantisus Mantisus requested review from Pijukatel and janbuchar May 16, 2025 19:35

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please add that one comment.


@crawler.router.handler('start')
async def start_handler(context: BasicCrawlingContext) -> None:
context.request.loaded_url = test_input.loaded_url

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂

Something along the lines of "Assign test value to loaded_url - BasicCrawler does not do any navigation by itself".

@janbuchar janbuchar merged commit d84c30c into apify:master May 20, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When the page is redirected, the strategy parameter of context.enqueue_links is no longer effective.

5 participants