Skip to content

feat: Add use_state context method#682

Merged
janbuchar merged 19 commits into
apify:masterfrom
Mantisus:impl-use-state
Dec 10, 2024
Merged

feat: Add use_state context method#682
janbuchar merged 19 commits into
apify:masterfrom
Mantisus:impl-use-state

Conversation

@Mantisus

@Mantisus Mantisus commented Nov 11, 2024

Copy link
Copy Markdown
Collaborator

Description

Add use_state context helper method with auto-save behavior inspired by TS's realization of crawlee.
Adds a get_auto_saved_value method for kvs. Working with internal cache and saving data on PERSIST_STATE event

(At this point, it's more of a gut check and synchronization with the team. It will probably require some more refinement)

Issues

Testing

For correct testing, it is necessary to use event_manager in tests for KeyValueStore, therefore I have added a new fixture.

Checklist

  • CI passed

@janbuchar janbuchar self-requested a review November 12, 2024 11:42
@Mantisus Mantisus changed the title First implementation use_state feat: First implementation use_state Nov 12, 2024

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the right way to go, see my comments please 🙂

Comment thread tests/unit/basic_crawler/test_basic_crawler.py Outdated
Comment thread tests/unit/basic_crawler/test_basic_crawler.py Outdated
"""
return await self._resource_client.get_public_url(key)

async def get_auto_saved_value(self, key: str, default_value: dict[str, Any] | None = None) -> dict[str, Any]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, crawlee-js does this in two separate ways. What you ported over is the "regular" one. Apart from that, we have adaptive crawling, where useState bypasses get_auto_saved_value. See https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts for details.

The general idea behind this is that we don't want to "save" the changed state before we know that the request handler succeeded.

We should probably discuss what would be the best way forward. Try to give the javascript adaptive crawler a glance, but if it doesn't make any sense, just say so so that we don't waste any time. Then we'll talk more 🙂

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up adaptive-playwright-crawler, if I understand its workings correctly, it completely overrides the default useState

We could do the same thing when we need to implement this in Python. Given that this behavior is specific to one particular implementation of crawler.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's basically how it works. The reason is that in adaptive crawling, we routinely run request handler functions twice and we do not want to introduce duplicities to the storages (or inconsistent state).

However, re-running the request handler is common enough to warrant only "commiting" storage changes after a request handler is finished in all cases, not just the adaptive one. In the javascript version, this functionality was not made default 1. because it would break BC and 2. because it was a bit of an experiment.

What I want to say is, we should think this through. Once we implement this, it will be much harder to change. By the way, adaptive crawling for the python version is very much on the roadmap 🙂 (#249)

@vdusek vdusek added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 13, 2024
@vdusek vdusek added this to the 102nd sprint - Tooling team milestone Nov 13, 2024
@Mantisus Mantisus changed the title feat: First implementation use_state feat: Implementation use_state Nov 13, 2024
@vdusek vdusek changed the title feat: Implementation use_state feat: Add use_state context method Nov 26, 2024
@Mantisus Mantisus requested a review from janbuchar November 26, 2024 11:15
@vdusek vdusek self-requested a review November 27, 2024 13:43
vdusek

This comment was marked as resolved.

Mantisus and others added 5 commits November 27, 2024 15:51
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job 🙂

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but I have a couple of concerns 🙂

Comment thread src/crawlee/_types.py Outdated
Comment thread src/crawlee/basic_crawler/_basic_crawler.py Outdated
Comment thread src/crawlee/basic_crawler/_basic_crawler.py Outdated
Comment thread src/crawlee/storages/_key_value_store.py Outdated
Comment thread src/crawlee/storages/_key_value_store.py Outdated
@janbuchar

Copy link
Copy Markdown
Collaborator

One thing we came up with @vdusek - we should mark the use_state helper as experimental. We may want to change the behavior later on, so let's make that clear.

Comment thread src/crawlee/_types.py
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Comment thread src/crawlee/storages/_key_value_store.py
Comment thread tests/unit/basic_crawler/test_basic_crawler.py Outdated
@janbuchar janbuchar self-requested a review December 4, 2024 21:25
@Mantisus Mantisus added the enhancement New feature or request. label Dec 6, 2024

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit, thanks!

@janbuchar janbuchar merged commit 868b41e into apify:master Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement use_state context helper method

3 participants