High-level functions API¶
extract_text¶
-
pdfminer.high_level.
extract_text
(pdf_file, password='', page_numbers=None, maxpages=0, caching=True, codec='utf-8', laparams=None)¶ Parses and returns the text contained in a PDF file. Takes loads of optional arguments but the defaults are somewhat sane. Returns a string containing all of the text extracted.
- Parameters
pdf_file – Path to the PDF file to be worked on
password – For encrypted PDFs, the password to decrypt.
page_numbers – List of zero-indexed page numbers to extract.
maxpages – The maximum number of pages to parse
caching – If resources should be cached
codec – Text decoding codec
laparams – LAParams object from pdfminer.layout.
extract_text_to_fp¶
-
pdfminer.high_level.
extract_text_to_fp
(inf, outfp, output_type='text', codec='utf-8', laparams=None, maxpages=0, page_numbers=None, password='', scale=1.0, rotation=0, layoutmode='normal', output_dir=None, strip_control=False, debug=False, disable_caching=False, **kwargs)¶ Parses text from inf-file and writes to outfp file-like object. Takes loads of optional arguments but the defaults are somewhat sane. Beware laparams: Including an empty LAParams is not the same as passing None! Returns nothing, acting as it does on two streams. Use StringIO to get strings.
- Parameters
inf – a file-like object to read PDF structure from, such as a file handler (using the builtin open() function) or a BytesIO.
outfp – a file-like object to write the text to.
output_type – May be ‘text’, ‘xml’, ‘html’, ‘tag’. Only ‘text’ works properly.
codec – Text decoding codec
laparams – An LAParams object from pdfminer.layout. Default is None but may not layout correctly.
maxpages – How many pages to stop parsing after
page_numbers – zero-indexed page numbers to operate on.
password – For encrypted PDFs, the password to decrypt.
scale – Scale factor
rotation – Rotation factor
layoutmode – Default is ‘normal’, see pdfminer.converter.HTMLConverter
output_dir – If given, creates an ImageWriter for extracted images.
strip_control – Does what it says on the tin
debug – Output more logging data
disable_caching – Does what it says on the tin
other –
- Returns