Skip to content

feat!: Implement RequestManagerTandem, remove add_request from RequestList, accept any iterable in RequestList constructor#777

Merged
janbuchar merged 43 commits into
masterfrom
request-provider-tandem
Dec 19, 2024
Merged

feat!: Implement RequestManagerTandem, remove add_request from RequestList, accept any iterable in RequestList constructor#777
janbuchar merged 43 commits into
masterfrom
request-provider-tandem

Conversation

@janbuchar

@janbuchar janbuchar commented Dec 3, 2024

Copy link
Copy Markdown
Collaborator

Tandem, or in tandem, is an arrangement in which two or more animals, machines, or people are lined up one behind another, all facing in the same direction.[1] Tandem can also be used more generally to refer to any group of persons or objects working together, not necessarily in line.[1]
(https://en.wikipedia.org/wiki/Tandem)

Breaking changes

  • RequestList does not support .drop(), .reclaim_request(), .add_request() and add_requests_batched() anymore
    • RequestManagerTandem with a RequestQueue should be used for this use case, await list.to_tandem() can be used as a shortcut
  • The RequestProvider interface has been renamed to RequestManager and moved to the crawlee.request_loaders package
  • RequestList has been moved to the crawlee.request_loaders package
  • The BasicCrawler.get_request_provider method has been renamed to BasicCrawler.get_request_manager and it does not accept the id and name arguments anymore
  • The request_provider parameter of BasicCrawler.__init__ has been renamed to request_manager

TODO

  • new tests
  • fix existing tests

@janbuchar janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. adhoc Ad-hoc unplanned task added during the sprint. labels Dec 3, 2024
@janbuchar janbuchar requested review from Pijukatel and vdusek December 3, 2024 15:30
@github-actions github-actions Bot added this to the 104th sprint - Tooling team milestone Dec 3, 2024
Comment thread src/crawlee/storages/_request_list.py
@janbuchar janbuchar changed the title feat!: Implement RequestProviderTandem feat!: Implement RequestProviderTandem, remove addRequest from RequestList Dec 4, 2024
@janbuchar janbuchar changed the title feat!: Implement RequestProviderTandem, remove addRequest from RequestList feat!: Implement RequestSourceTandem, remove addRequest from RequestList Dec 4, 2024
Comment thread src/crawlee/storages/_request_source.py
@github-actions github-actions Bot added the tested Temporary label used only programatically for some analytics. label Dec 4, 2024
@janbuchar janbuchar requested a review from Pijukatel December 4, 2024 13:45
@janbuchar janbuchar marked this pull request as ready for review December 4, 2024 13:45
Comment thread tests/unit/storages/test_request_source_tandem.py Outdated

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the tests, it seems to be working, and I like how it's used. Good job. However, I have a few comments (& maybe concerns)...

Comment thread src/crawlee/storages/__init__.py Outdated
Comment thread src/crawlee/basic_crawler/_basic_crawler.py Outdated
Comment thread src/crawlee/storages/__init__.py
Comment thread src/crawlee/storages/_request_source.py Outdated
Comment thread src/crawlee/storages/_request_source.py Outdated
Comment thread src/crawlee/storages/_request_source_tandem.py Outdated
Comment thread src/crawlee/storages/_request_source_tandem.py Outdated
Comment thread tests/unit/basic_crawler/test_basic_crawler.py Outdated
Comment thread tests/unit/basic_crawler/test_basic_crawler.py Outdated
Comment thread tests/unit/basic_crawler/test_basic_crawler.py Outdated
vdusek

This comment was marked as resolved.

janbuchar and others added 2 commits December 6, 2024 16:24

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to like it, good job. Just a few more details...

Comment thread src/crawlee/request_sources/__init__.py Outdated
Comment thread docs/guides/request_storage.mdx Outdated

## Processing requests from multiple sources

In some cases, you might need to combine requests from multiple sources, most frequently from a static list of URLs and a <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>, where the queue takes care of persistence and retrying failed requests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention and add the link to the RequestList as well?

Comment thread src/crawlee/request_sources/_request_list.py Outdated
Comment thread src/crawlee/request_sources/_request_source.py

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Only a few nits...

And before merging, could you please:

  • Update the PR title to match the new version.
  • Mention the breaking changes in the upgrading guide.
  • We will also need to update the Request storage and Result storage guides to reflect these changes. It doesn't have to happen now, but could you at least open an issue for that?

Comment thread docs/guides/code/request_storage/rl_with_crawler_example.py Outdated
Comment thread docs/guides/code/request_storage/tandem_example.py Outdated
Comment thread src/crawlee/basic_crawler/_basic_crawler.py Outdated
@janbuchar janbuchar changed the title feat!: Implement RequestSourceTandem, remove addRequest from RequestList feat!: Implement RequestManagerTandem, remove add_request from RequestList Dec 19, 2024
@janbuchar janbuchar changed the title feat!: Implement RequestManagerTandem, remove add_request from RequestList feat!: Implement RequestManagerTandem, remove add_request from RequestList, accept any iterable in RequestList constructor Dec 19, 2024

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@janbuchar janbuchar merged commit 4172652 into master Dec 19, 2024
@janbuchar janbuchar deleted the request-provider-tandem branch December 19, 2024 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. v0.5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants