I want to use fastapi+crawlee to scrape the posts content of a forum. I hope that my crawler will re-run every time a post request is made, but actually, when I request a second time, the links requested before will not be requested again. I have learned two solutions from this post:
- Set environment variables
I did this, but it didn't work
@app.post("/crawl")
async def crawl(auth_data: AuthData):
if not authenticate(auth_data.username, auth_data.password):
raise HTTPException(status_code=401, detail="Unauthorized")
if not crawler_semaphore.locked():
async with crawler_semaphore:
try:
crawl_results.clear()
await run_crawler()
return {"status": "success", "data": crawl_results}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
else:
raise HTTPException(status_code=429, detail="Crawler is already running. Please try again later.")
async def run_crawler():
config = Configuration(
persist_storage=False,
purge_on_start=True,
)
crawler = PlaywrightCrawler(
max_requests_per_crawl=500,
request_handler=router,
configuration=config,
)
unique_key = str(uuid.uuid4())
request = BaseRequestData(url='https://localhost:8888', unique_key=unique_key)
await crawler.run([request])
- Modify the unique_key for each request
I did this in baseUrl, it's cool and it really worked, but I don't know how to add unique_key for enqueue_links
await context.enqueue_links(
selector=f"a[href='{post['url']}']",
label='DETAIL',
)
Thank you very much for your help!
I want to use fastapi+crawlee to scrape the posts content of a forum. I hope that my crawler will re-run every time a post request is made, but actually, when I request a second time, the links requested before will not be requested again. I have learned two solutions from this post:
I did this, but it didn't work
I did this in baseUrl, it's cool and it really worked, but I don't know how to add unique_key for enqueue_links
Thank you very much for your help!