Both Playwright and BeautifulSoup Crawlers suffer the same issue regarding the enqueue links.
BeautifulSoup:
import asyncio
import logging
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
logging.basicConfig(level=logging.INFO)
async def main() -> None:
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links()
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
await context.push_data(data)
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
or Playwright:
import asyncio
import logging
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
logging.basicConfig(level=logging.INFO)
async def main() -> None:
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
await context.enqueue_links()
record = {
'request_url': context.request.url,
'page_url': context.page.url,
'page_title': await context.page.title(),
}
await context.push_data(record)
await crawler.run(['https://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
Only 16 URLs are processed:
[
"https://crawlee.dev/",
"https://crawlee.dev/docs/guides/typescript-project",
"https://crawlee.dev/docs/guides/javascript-rendering",
"https://crawlee.dev/docs/guides/avoid-blocking",
"https://crawlee.dev/docs/guides/cheerio-crawler-guide",
"https://crawlee.dev/docs/guides/jsdom-crawler-guide",
"https://crawlee.dev/api/core/class/AutoscaledPool",
"https://crawlee.dev/docs/guides/proxy-management",
"https://crawlee.dev/docs/guides/result-storage",
"https://crawlee.dev/docs/guides/request-storage",
"https://crawlee.dev/api/utils/namespace/social",
"https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions",
"https://crawlee.dev/api/utils",
"https://crawlee.dev/docs/quick-start",
"https://crawlee.dev/docs/deployment/aws-cheerio",
"https://crawlee.dev/docs/deployment/gcp-cheerio"
]
Clearly, there are many more pages on crawlee.dev than what is being processed.
The problem is probably somewhere in the BasicCrawler's _check_enqueue_strategy or similar function.
Both Playwright and BeautifulSoup Crawlers suffer the same issue regarding the enqueue links.
BeautifulSoup:
or Playwright:
Only 16 URLs are processed:
[ "https://crawlee.dev/", "https://crawlee.dev/docs/guides/typescript-project", "https://crawlee.dev/docs/guides/javascript-rendering", "https://crawlee.dev/docs/guides/avoid-blocking", "https://crawlee.dev/docs/guides/cheerio-crawler-guide", "https://crawlee.dev/docs/guides/jsdom-crawler-guide", "https://crawlee.dev/api/core/class/AutoscaledPool", "https://crawlee.dev/docs/guides/proxy-management", "https://crawlee.dev/docs/guides/result-storage", "https://crawlee.dev/docs/guides/request-storage", "https://crawlee.dev/api/utils/namespace/social", "https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions", "https://crawlee.dev/api/utils", "https://crawlee.dev/docs/quick-start", "https://crawlee.dev/docs/deployment/aws-cheerio", "https://crawlee.dev/docs/deployment/gcp-cheerio" ]Clearly, there are many more pages on
crawlee.devthan what is being processed.The problem is probably somewhere in the
BasicCrawler's_check_enqueue_strategyor similar function.