feat: Add use_state context method#682
Conversation
use_stateuse_state
janbuchar
left a comment
There was a problem hiding this comment.
Seems like the right way to go, see my comments please 🙂
| """ | ||
| return await self._resource_client.get_public_url(key) | ||
|
|
||
| async def get_auto_saved_value(self, key: str, default_value: dict[str, Any] | None = None) -> dict[str, Any]: |
There was a problem hiding this comment.
Currently, crawlee-js does this in two separate ways. What you ported over is the "regular" one. Apart from that, we have adaptive crawling, where useState bypasses get_auto_saved_value. See https://github.com/apify/crawlee/blob/master/packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts for details.
The general idea behind this is that we don't want to "save" the changed state before we know that the request handler succeeded.
We should probably discuss what would be the best way forward. Try to give the javascript adaptive crawler a glance, but if it doesn't make any sense, just say so so that we don't waste any time. Then we'll talk more 🙂
There was a problem hiding this comment.
I looked up adaptive-playwright-crawler, if I understand its workings correctly, it completely overrides the default useState
We could do the same thing when we need to implement this in Python. Given that this behavior is specific to one particular implementation of crawler.
There was a problem hiding this comment.
Yes, that's basically how it works. The reason is that in adaptive crawling, we routinely run request handler functions twice and we do not want to introduce duplicities to the storages (or inconsistent state).
However, re-running the request handler is common enough to warrant only "commiting" storage changes after a request handler is finished in all cases, not just the adaptive one. In the javascript version, this functionality was not made default 1. because it would break BC and 2. because it was a bit of an experiment.
What I want to say is, we should think this through. Once we implement this, it will be much harder to change. By the way, adaptive crawling for the python version is very much on the roadmap 🙂 (#249)
use_stateuse_state
use_stateuse_state context method
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
janbuchar
left a comment
There was a problem hiding this comment.
It makes sense, but I have a couple of concerns 🙂
|
One thing we came up with @vdusek - we should mark the |
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Description
Add
use_statecontext helper method with auto-save behavior inspired by TS's realization ofcrawlee.Adds a
get_auto_saved_valuemethod forkvs. Working with internal cache and saving data onPERSIST_STATEevent(At this point, it's more of a gut check and synchronization with the team. It will probably require some more refinement)
Issues
use_statecontext helper method #191Testing
For correct testing, it is necessary to use event_manager in tests for KeyValueStore, therefore I have added a new fixture.
Checklist