Sisyphe.core.sisyphePDF

External packages/modules

class Sisyphe.core.sisyphePDF.SisypheParsePdf

Description

This class performs OCR processing on PDF files.

  • OCR processing to extract all strings from the PDF,

  • search table fields (table column header),

  • extract table field values.

Inheritance

object -> SisypheParsePdf

Creation: 15/12/2025 Last revision: 10/01/2026

appendFieldNamesToExclude(fields: str | list[str])

Add strings to the list of strings to exclude from the OCR detection.

Parameters

fieldsstr | list[str]

list of strings to exclude.

appendFieldNamesToExtract(fields: str | list[str])

Add fields to the list of table fields (header column) to search for in the PDF.

Parameters

fieldsstr | list[str]

list of fields to be searched in the PDF.

clearFieldNames() None

Clear the list of table fields to be searched in the PDF.

classmethod detect(filename: str) list[tuple[list[list[float]], str, float]]

Extract all strings from a PDF.

Parameters

filenamestr

PDF filename

Returns

list[tuple[list[list[float]], str, float]]

One tuple for each string extracted from the PDF:

  • list[list[float]], four corners of the detected string’s bounding box.

  • str, dtected string.

  • float, confidence level in percent of OCR.

getFieldNamesToExclude() list[str]

Get the list of strings to exclude from the OCR detection.

Returns

list[str]

list of strings to exclude.

getFieldNamesToExtract() list[str]

Get the list of table fields to be searched in the PDF.

Returns

list[str]

list of fields (header column) to be searched in the PDF.

classmethod hasField(extracted: list[tuple[list[list[float]], str, float]], fieldname: str) bool

Check if a string was extracted from a PDF.

Parameters

extractedlist[tuple[list[list[float]], str, float]]

list of extracted strings from the PDF, processed with the detect() method.

fieldnamestr

string to search in the PDF.

Returns

bool

True if the string was extracted from the PDF.

hasFieldNamesToExclude() bool

Check if the list of strings to exclude from the OCR detection is not empty.

Returns

bool

True if the list of strings to exclude is not empty.

hasFieldNamesToExtract() bool

Check if the list of table fields (header column) to be searched in the PDF is not empty.

Returns

bool

True if the list of table fields to be searched in the PDF is not empty.

classmethod hasFields(extracted: list[tuple[list[list[float]], str, float]], fieldnames: list[str]) dict[str, tuple[bool, list[int]]]

Check if a list of strings was extracted from a PDF.

Parameters

extractedlist[tuple[list[list[float]], str, float]]

list of extracted strings from the PDF, processed with the detect() method.

fieldnameslist[str]

list of strings to search in the PDF.

Returns

dict[str, tuple[bool, list[int]]]

  • keys are strings to search

  • items are tuples of a bool, True if the current key string is extracted from the PDF; an index list

of the current key string within the extracted list.

loadFieldNames(filename: str) None

Load the XML file that stores the table fields to be searched in the PDF.

Parameters

filename : str

parse(filename: str, wait: DialogWait | None = None) tuple[DataFrame, list]

PDF parsing:

  • OCR processing to extract all strings from the PDF

  • search table fields (table column header)

  • extract table field values

Parameters

filenamestr

PDF file to parse.

waitDialogWait

progress dialog.

Returns

Tuple[DataFrame, list]

  • Dataframe of extracted table values

  • list of all strings extracted from the PDF

saveFieldNames(filename: str) None

Save the XML file that stores the table fields to be searched in the PDF.

Parameters

filename : str

setFieldNamesToExclude(fields: list[str]) None

Set the list of strings to exclude from the OCR detection.

Parameters

fieldsstr | list[str]

list of strings to exclude.

setFieldNamesToExtract(fields: list[str]) None

Set the list of table fields (header column) to be searched in the PDF.

Parameters

fieldsstr | list[str]

list of fields to be searched in the PDF.