Skip to content

When splitting_pdf_page is started, only the last set of API requests can succeed. #220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
issj6 opened this issue Oct 14, 2024 · 1 comment

Comments

@issj6
Copy link

issj6 commented Oct 14, 2024

Describe the bug
When I set split_pdf_page=True,split_pdf_concurrency_level=15.
Assuming the pdf is divided into 10 sets, it will report an error:
ERROR: Failed to send request for page 1
...
WARNING: Failed to partition set Unstructured-IO/unstructured-api#1, its elements will be omitted in the final result.
...
WARNING: Failed to partition set Unstructured-IO/unstructured-api#9, its elements will be omitted in the final result.
INFO: Successfully partitioned set Unstructured-IO/unstructured-api#10, elements added to the final result.

To Reproduce
code:

import os, json

import requests
from unstructured_client.models.operations import PartitionRequest
from unstructured_client.models.shared import PartitionParameters, ChunkingStrategy

os.environ["UNSTRUCTURED_API_KEY"] = "EMPTY"
os.environ["UNSTRUCTURED_API_URL"] = ""

import unstructured_client
from unstructured_client.models import shared, operations

requests_client = requests.Session()
client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
    client=requests_client
)

filename = "./test_pdf.pdf"

file = open(filename, "rb")
req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=file.read(),
            file_name=filename,
        ),
        strategy=shared.Strategy.HI_RES,
        split_pdf_page=True,
        split_pdf_concurrency_level=15,
        chunking_strategy=ChunkingStrategy("by_title")
    )
)

try:
    res = client.general.partition(req)
    element_dicts = [element for element in res.elements]

    print(element_dicts)
    for e in element_dicts:
        print(e['text'])
except Exception as e:
    print(e)

Console Information:

INFO: Preparing to split document for partition.
INFO: Concurrency level set to 15
INFO: Splitting pages 1 to 23 (23 total)
INFO: Determined optimal split size of 2 pages.
INFO: Partitioning 11 files with 2 page(s) each.
INFO: Partitioning 1 file with 1 page(s).
INFO: Partitioning set Unstructured-IO/unstructured-api#1 (pages 1-2).
INFO: Partitioning set Unstructured-IO/unstructured-api#2 (pages 3-4).
INFO: Partitioning set Unstructured-IO/unstructured-api#3 (pages 5-6).
INFO: Partitioning set Unstructured-IO/unstructured-api#4 (pages 7-8).
INFO: Partitioning set Unstructured-IO/unstructured-api#5 (pages 9-10).
INFO: Partitioning set Unstructured-IO/unstructured-api#6 (pages 11-12).
INFO: Partitioning set Unstructured-IO/unstructured-api#7 (pages 13-14).
INFO: Partitioning set Unstructured-IO/unstructured-api#8 (pages 15-16).
INFO: Partitioning set Unstructured-IO/unstructured-api#9 (pages 17-18).
INFO: Partitioning set Unstructured-IO/unstructured-api#10 (pages 19-20).
INFO: Partitioning set Unstructured-IO/unstructured-api#11 (pages 21-22).
INFO: Partitioning set Unstructured-IO/unstructured-api#12 (pages 23-23).
ERROR: Failed to send request for page 1
ERROR: Failed to send request for page 3
ERROR: Failed to send request for page 5
ERROR: Failed to send request for page 7
ERROR: Failed to send request for page 9
ERROR: Failed to send request for page 11
ERROR: Failed to send request for page 13
ERROR: Failed to send request for page 15
ERROR: Failed to send request for page 17
ERROR: Failed to send request for page 19
ERROR: Failed to send request for page 21
WARNING: Failed to partition set Unstructured-IO/unstructured-api#1, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#2, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#3, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#4, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#5, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#6, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#7, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#8, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#9, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#10, its elements will be omitted in the final result.
WARNING: Failed to partition set Unstructured-IO/unstructured-api#11, its elements will be omitted in the final result.
INFO: Successfully partitioned set Unstructured-IO/unstructured-api#12, elements added to the final result.
INFO: Successfully partitioned the document.
@sam-ayo
Copy link

sam-ayo commented Oct 23, 2024

I get the same error

@awalker4 awalker4 transferred this issue from Unstructured-IO/unstructured-api Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants