Welcome to pdfminer.six’s documentation!¶
Pdfminer.six is a python package for extracting information from PDF documents.
Check out the source on github.
Features¶
Parse all objects from a PDF document into Python objects.
Analyze and group text in a human-readable way.
Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged contents and more.
Support for (almost all) features from the PDF-1.7 specification
Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing.
Support for various font types (Type1, TrueType, Type3, and CID).
Support for basic encryption (RC4).
Installation instructions¶
Before using it, you must install it using Python 2.7 or newer.
$ pip install pdfminer.six
Note that Python 2.7 support is dropped at January, 2020.
Common use-cases¶
Get started with command-line tools if you just want to extract text from a pdf once.
Get started using the high-level functions if you want to integrate pdfminer.six with your Python code.
Get started using the composable components API when you want to tailor the behavior of pdfmine.six to your needs.
Contributing¶
We welcome any contributors to pdfminer.six! But, before doing anything, take a look at the contribution guide.