Sisyphe.core.sisyphePDF
External packages/modules
easyocr, OCR, https://github.com/JaidedAI/EasyOCR
pandas, data analysis and manipulation tool, https://pandas.pydata.org/
pymupdf, PDF processing, https://pymupdf.readthedocs.io/
- class Sisyphe.core.sisyphePDF.SisypheParsePdf
Description
This class performs OCR processing on PDF files.
OCR processing to extract all strings from the PDF,
search table fields (table column header),
extract table field values.
Inheritance
object -> SisypheParsePdf
Creation: 15/12/2025 Last revision: 10/01/2026
- appendFieldNamesToExclude(fields: str | list[str])
Add strings to the list of strings to exclude from the OCR detection.
Parameters
- fieldsstr | list[str]
list of strings to exclude.
- appendFieldNamesToExtract(fields: str | list[str])
Add fields to the list of table fields (header column) to search for in the PDF.
Parameters
- fieldsstr | list[str]
list of fields to be searched in the PDF.
- clearFieldNames() None
Clear the list of table fields to be searched in the PDF.
- classmethod detect(filename: str) list[tuple[list[list[float]], str, float]]
Extract all strings from a PDF.
Parameters
- filenamestr
PDF filename
Returns
- list[tuple[list[list[float]], str, float]]
One tuple for each string extracted from the PDF:
list[list[float]], four corners of the detected string’s bounding box.
str, dtected string.
float, confidence level in percent of OCR.
- getFieldNamesToExclude() list[str]
Get the list of strings to exclude from the OCR detection.
Returns
- list[str]
list of strings to exclude.
- getFieldNamesToExtract() list[str]
Get the list of table fields to be searched in the PDF.
Returns
- list[str]
list of fields (header column) to be searched in the PDF.
- classmethod hasField(extracted: list[tuple[list[list[float]], str, float]], fieldname: str) bool
Check if a string was extracted from a PDF.
Parameters
- extractedlist[tuple[list[list[float]], str, float]]
list of extracted strings from the PDF, processed with the detect() method.
- fieldnamestr
string to search in the PDF.
Returns
- bool
True if the string was extracted from the PDF.
- hasFieldNamesToExclude() bool
Check if the list of strings to exclude from the OCR detection is not empty.
Returns
- bool
True if the list of strings to exclude is not empty.
- hasFieldNamesToExtract() bool
Check if the list of table fields (header column) to be searched in the PDF is not empty.
Returns
- bool
True if the list of table fields to be searched in the PDF is not empty.
- classmethod hasFields(extracted: list[tuple[list[list[float]], str, float]], fieldnames: list[str]) dict[str, tuple[bool, list[int]]]
Check if a list of strings was extracted from a PDF.
Parameters
- extractedlist[tuple[list[list[float]], str, float]]
list of extracted strings from the PDF, processed with the detect() method.
- fieldnameslist[str]
list of strings to search in the PDF.
Returns
dict[str, tuple[bool, list[int]]]
keys are strings to search
items are tuples of a bool, True if the current key string is extracted from the PDF; an index list
of the current key string within the extracted list.
- loadFieldNames(filename: str) None
Load the XML file that stores the table fields to be searched in the PDF.
Parameters
filename : str
- parse(filename: str, wait: DialogWait | None = None) tuple[DataFrame, list]
PDF parsing:
OCR processing to extract all strings from the PDF
search table fields (table column header)
extract table field values
Parameters
- filenamestr
PDF file to parse.
- waitDialogWait
progress dialog.
Returns
Tuple[DataFrame, list]
Dataframe of extracted table values
list of all strings extracted from the PDF
- saveFieldNames(filename: str) None
Save the XML file that stores the table fields to be searched in the PDF.
Parameters
filename : str