fix: fix match check for specified enqueue strategy for requests with redirect#1199
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes the match check for the enqueue strategy by ensuring that the original URL is used for comparison instead of the final URL after a redirect. Key changes include adding a new field (loaded_url) to test inputs, updating tests to assign the loaded URL, and modifying the crawler’s commit handler to use context.request.url for the URL check.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/unit/crawlers/_basic/test_basic_crawler.py | Added a new "loaded_url" field in the test input dataclass and updated test cases and request context assignment to simulate original URLs. |
| src/crawlee/crawlers/_basic/_basic_crawler.py | Changed the URL used for the match check by replacing the earlier "origin" variable with context.request.url. |
| add_requests_call.get('strategy', 'all'), | ||
| target_url=urlparse(dst_request.url), | ||
| origin_url=urlparse(origin), | ||
| origin_url=urlparse(context.request.url), |
There was a problem hiding this comment.
The tests assign the original URL to context.request.loaded_url but the production code uses context.request.url for the enqueue strategy check. Ensure that the correct property (either 'url' or 'loaded_url') is used consistently across the system to fix the redirect issue.
| origin_url=urlparse(context.request.url), | |
| origin_url=urlparse(context.request.loaded_url or context.request.url), |
There was a problem hiding this comment.
Tests assign context.request.loaded_url to simulate a redirect. The start url should be used when we check the strategy.
|
|
||
| @crawler.router.handler('start') | ||
| async def start_handler(context: BasicCrawlingContext) -> None: | ||
| context.request.loaded_url = test_input.loaded_url |
There was a problem hiding this comment.
The test assigns the original URL to context.request.loaded_url but if the commit handler is intended to use the original URL for enqueue strategy testing, consider aligning the property used (for example, updating context.request.url) to match the production logic.
| context.request.loaded_url = test_input.loaded_url | |
| context.request.url = test_input.loaded_url |
There was a problem hiding this comment.
This is correct, we do this to simulate a redirect.
There was a problem hiding this comment.
Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂
Something along the lines of "Assign test value to loaded_url - BasicCrawler does not do any navigation by itself".
janbuchar
left a comment
There was a problem hiding this comment.
LGTM, but please add that one comment.
|
|
||
| @crawler.router.handler('start') | ||
| async def start_handler(context: BasicCrawlingContext) -> None: | ||
| context.request.loaded_url = test_input.loaded_url |
There was a problem hiding this comment.
Perhaps we should add a comment here? Copilot has a point, this looks sort of fishy 🙂
Something along the lines of "Assign test value to loaded_url - BasicCrawler does not do any navigation by itself".
Description
Issues
Testing