High-level functions API

extract_text

pdfminer.high_level.extract_text(pdf_file, password='', page_numbers=None, maxpages=0, caching=True, codec='utf-8', laparams=None)

Parses and returns the text contained in a PDF file. Takes loads of optional arguments but the defaults are somewhat sane. Returns a string containing all of the text extracted.

Parameters
  • pdf_file – Path to the PDF file to be worked on

  • password – For encrypted PDFs, the password to decrypt.

  • page_numbers – List of zero-indexed page numbers to extract.

  • maxpages – The maximum number of pages to parse

  • caching – If resources should be cached

  • codec – Text decoding codec

  • laparams – LAParams object from pdfminer.layout.

extract_text_to_fp

pdfminer.high_level.extract_text_to_fp(inf, outfp, output_type='text', codec='utf-8', laparams=None, maxpages=0, page_numbers=None, password='', scale=1.0, rotation=0, layoutmode='normal', output_dir=None, strip_control=False, debug=False, disable_caching=False, **kwargs)

Parses text from inf-file and writes to outfp file-like object. Takes loads of optional arguments but the defaults are somewhat sane. Beware laparams: Including an empty LAParams is not the same as passing None! Returns nothing, acting as it does on two streams. Use StringIO to get strings.

Parameters
  • inf – a file-like object to read PDF structure from, such as a file handler (using the builtin open() function) or a BytesIO.

  • outfp – a file-like object to write the text to.

  • output_type – May be ‘text’, ‘xml’, ‘html’, ‘tag’. Only ‘text’ works properly.

  • codec – Text decoding codec

  • laparams – An LAParams object from pdfminer.layout. Default is None but may not layout correctly.

  • maxpages – How many pages to stop parsing after

  • page_numbers – zero-indexed page numbers to operate on.

  • password – For encrypted PDFs, the password to decrypt.

  • scale – Scale factor

  • rotation – Rotation factor

  • layoutmode – Default is ‘normal’, see pdfminer.converter.HTMLConverter

  • output_dir – If given, creates an ImageWriter for extracted images.

  • strip_control – Does what it says on the tin

  • debug – Output more logging data

  • disable_caching – Does what it says on the tin

  • other

Returns