Skip to content

How to reset crawlee URL cache? #351

Description

@GHOST-LOVE-YOU

I want to use fastapi+crawlee to scrape the posts content of a forum. I hope that my crawler will re-run every time a post request is made, but actually, when I request a second time, the links requested before will not be requested again. I have learned two solutions from this post:

  1. Set environment variables
    I did this, but it didn't work
@app.post("/crawl")
async def crawl(auth_data: AuthData):
    if not authenticate(auth_data.username, auth_data.password):
        raise HTTPException(status_code=401, detail="Unauthorized")
    
    if not crawler_semaphore.locked():
        async with crawler_semaphore:
            try:
                crawl_results.clear()
                await run_crawler()
                return {"status": "success", "data": crawl_results}
            except Exception as e:
                raise HTTPException(status_code=500, detail=str(e))
    else:
        raise HTTPException(status_code=429, detail="Crawler is already running. Please try again later.")
async def run_crawler():
    config = Configuration(
        persist_storage=False,
        purge_on_start=True,
    )
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=500,
        request_handler=router,
        configuration=config,
    )
    unique_key = str(uuid.uuid4())
    request = BaseRequestData(url='https://localhost:8888', unique_key=unique_key)
    
    await crawler.run([request])
  1. Modify the unique_key for each request
    I did this in baseUrl, it's cool and it really worked, but I don't know how to add unique_key for enqueue_links
await context.enqueue_links(
    selector=f"a[href='{post['url']}']",
    label='DETAIL',
)

Thank you very much for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions