Skip to content

fix: Do not add a request that is already in progress to MemoryRequestQueueClient#1384

Merged
vdusek merged 1 commit into
apify:masterfrom
Mantisus:fix-memory-storage
Sep 3, 2025
Merged

fix: Do not add a request that is already in progress to MemoryRequestQueueClient#1384
vdusek merged 1 commit into
apify:masterfrom
Mantisus:fix-memory-storage

Conversation

@Mantisus

@Mantisus Mantisus commented Sep 1, 2025

Copy link
Copy Markdown
Collaborator

Description

  • Do not add a request that is already in progress for MemoryRequestQueueClient.

Issues

@Mantisus Mantisus requested a review from vdusek September 1, 2025 21:33
@Mantisus Mantisus self-assigned this Sep 1, 2025
@Mantisus Mantisus changed the title fix: Do not add a request that is already in progress to MemoryRequestQueueClient. fix: Do not add a request that is already in progress to MemoryRequestQueueClient Sep 1, 2025
@vdusek

vdusek commented Sep 2, 2025

Copy link
Copy Markdown
Collaborator

Old version:

┌───────────────────────────────┬────────────────┐
│ requests_finished             │ 9024           │
│ requests_failed               │ 0              │
│ retry_histogram               │ [9024]         │
│ request_avg_failed_duration   │ None           │
│ request_avg_finished_duration │ 482.4ms        │
│ requests_finished_per_minute  │ 541            │
│ requests_failed_per_minute    │ 0              │
│ request_total_duration        │ 1h 12min 33.1s │
│ requests_total                │ 9024           │
│ crawler_runtime               │ 16min 40.5s    │
└───────────────────────────────┴────────────────┘

New version:

┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 4512        │
│ requests_failed               │ 0           │
│ retry_histogram               │ [4512]      │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 444.8ms     │
│ requests_finished_per_minute  │ 579         │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 33min 26.9s │
│ requests_total                │ 4512        │
│ crawler_runtime               │ 7min 47.7s  │
└───────────────────────────────┴─────────────┘

With the new version, we only process about half as many requests.

However, based on the SDK's deduplication, there should only be 2362 unique pages, so there still seems to be an issue.

cc @Pijukatel

@Mantisus

Mantisus commented Sep 2, 2025

Copy link
Copy Markdown
Collaborator Author

However, based on the SDK's deduplication, there should only be 2362 unique pages, so there still seems to be an issue.

2363 if the start URL is https://crawlee.dev/
4512 if the start URL is http://crawlee.dev/

This is a non-deduplication error.

In the JS version, similarly

@vdusek

vdusek commented Sep 2, 2025

Copy link
Copy Markdown
Collaborator

Oh, thanks for the explanation

@vdusek vdusek requested a review from Pijukatel September 2, 2025 11:47

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's wait for @Pijukatel as well, but I believe this is fine

@vdusek vdusek merged commit 3af326c into apify:master Sep 3, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MemoryStorageClient issue with deduplication

3 participants