Sisyphe.core.sisyphePDF

External packages/modules

easyocr, OCR, https://github.com/JaidedAI/EasyOCR

pandas, data analysis and manipulation tool, https://pandas.pydata.org/

pymupdf, PDF processing, https://pymupdf.readthedocs.io/

class Sisyphe.core.sisyphePDF.SisypheParsePdf

Description

This class performs OCR processing on PDF files.

OCR processing to extract all strings from the PDF,
search table fields (table column header),
extract table field values.

Inheritance

object -> SisypheParsePdf

Creation: 15/12/2025 Last revision: 16/01/2026

appendFieldNamesToExclude(fields: str | list[str])

Add strings to the list of strings to exclude from the OCR detection.

Parameters

fieldsstr | list[str]: list of strings to exclude.

appendFieldNamesToExtract(fields: str | list[str])

Add fields to the list of table fields (header column) to search for in the PDF.

Parameters

fieldsstr | list[str]: list of fields to be searched in the PDF.

clearFieldNames() → None: Clear the list of table fields to be searched in the PDF.

classmethod detect(filename: str) → list[tuple[list[list[float]], str, float]]

Extract all strings from a PDF.

Parameters

filenamestr: PDF filename

Returns

list[tuple[list[list[float]], str, float]]

One tuple for each string extracted from the PDF:

list[list[float]], four corners of the detected string’s bounding box.

str, dtected string.

float, confidence level in percent of OCR.

getFieldNamesToExclude() → list[str]

Get the list of strings to exclude from the OCR detection.

Returns

list[str]: list of strings to exclude.

getFieldNamesToExtract() → list[str]

Get the list of table fields to be searched in the PDF.

Returns

list[str]: list of fields (header column) to be searched in the PDF.

classmethod hasField(extracted: list[tuple[list[list[float]], str, float]], fieldname: str) → bool

Check if a string was extracted from a PDF.

Parameters

extractedlist[tuple[list[list[float]], str, float]]: list of extracted strings from the PDF, processed with the detect() method.
fieldnamestr: string to search in the PDF.

Returns

bool: True if the string was extracted from the PDF.

hasFieldNamesToExclude() → bool

Check if the list of strings to exclude from the OCR detection is not empty.

Returns

bool: True if the list of strings to exclude is not empty.

hasFieldNamesToExtract() → bool

Check if the list of table fields (header column) to be searched in the PDF is not empty.

Returns

bool: True if the list of table fields to be searched in the PDF is not empty.

classmethod hasFields(extracted: list[tuple[list[list[float]], str, float]], fieldnames: list[str]) → dict[str, tuple[bool, list[int]]]

Check if a list of strings was extracted from a PDF.

Parameters

extractedlist[tuple[list[list[float]], str, float]]: list of extracted strings from the PDF, processed with the detect() method.
fieldnameslist[str]: list of strings to search in the PDF.

Returns

dict[str, tuple[bool, list[int]]]

keys are strings to search

items are tuples of a bool, True if the current key string is extracted from the PDF; an index list

of the current key string within the extracted list.

loadFieldNames(filename: str) → None: Load the XML file that stores the table fields to be searched in the PDF.

Parameters

filename : str

parseOCR(filename: str, linc: float = 0.98, wait: DialogWait | None = None) → tuple[DataFrame, list]

PDF OCR parsing:

OCR processing to extract all strings from the PDF
search table fields (table column header)
extract table field values

Parameters

filenamestr: PDF file to parse.
lincfloat: line increment threshold (0.9 to 0.9999)
waitDialogWait: progress dialog.

Returns

Tuple[DataFrame, list]

Dataframe of extracted table values

list of all strings extracted from the PDF

parseRenishawReport(filename: str) → tuple[DataFrame, str]

Renishaw robotic neurosurgery system report parsing:

search table fields (table column header)
extract table field values
extract orientation: ‘lr’ lateral right, ‘ll’ lateral left, ‘sa’ sagittal anterior, ‘sp’ sagittal posterior

Parameters

filenamestr: PDF Renishaw report to parse.

Returns

Tuple[DataFrame, str]

Dataframe of extracted table values

orientation str

saveFieldNames(filename: str) → None: Save the XML file that stores the table fields to be searched in the PDF.

Parameters

filename : str

setFieldNamesToExclude(fields: list[str]) → None

Set the list of strings to exclude from the OCR detection.

Parameters

fieldsstr | list[str]: list of strings to exclude.

setFieldNamesToExtract(fields: list[str]) → None

Set the list of table fields (header column) to be searched in the PDF.

Parameters

fieldsstr | list[str]: list of fields to be searched in the PDF.