如何从PDF文件提取文本和文本坐标?

发布于 2021-01-29 19:33:05

我想使用PDFMiner从PDF文件中提取所有文本框和文本框坐标。

其他许多Stack Overflow帖子都介绍了如何以有序方式提取所有文本,但是我该如何做获取文本和文本位置的中间步骤呢?

给定一个PDF文件,输出应类似于:

489, 41,  "Signature"
500, 52,  "b"
630, 202, "a_g_i_r"
关注者
0
被浏览
162
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    换行符在最终输出中转换为下划线。这是我发现的最小工作解决方案。

    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfpage import PDFTextExtractionNotAllowed
    from pdfminer.pdfinterp import PDFResourceManager
    from pdfminer.pdfinterp import PDFPageInterpreter
    from pdfminer.pdfdevice import PDFDevice
    from pdfminer.layout import LAParams
    from pdfminer.converter import PDFPageAggregator
    import pdfminer
    
    # Open a PDF file.
    fp = open('/Users/me/Downloads/test.pdf', 'rb')
    
    # Create a PDF parser object associated with the file object.
    parser = PDFParser(fp)
    
    # Create a PDF document object that stores the document structure.
    # Password for initialization as 2nd parameter
    document = PDFDocument(parser)
    
    # Check if the document allows text extraction. If not, abort.
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed
    
    # Create a PDF resource manager object that stores shared resources.
    rsrcmgr = PDFResourceManager()
    
    # Create a PDF device object.
    device = PDFDevice(rsrcmgr)
    
    # BEGIN LAYOUT ANALYSIS
    # Set parameters for analysis.
    laparams = LAParams()
    
    # Create a PDF page aggregator object.
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    
    def parse_obj(lt_objs):
    
        # loop over the object list
        for obj in lt_objs:
    
            # if it's a textbox, print text and location
            if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
                print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))
    
            # if it's a container, recurse
            elif isinstance(obj, pdfminer.layout.LTFigure):
                parse_obj(obj._objs)
    
    # loop over all pages in the document
    for page in PDFPage.create_pages(document):
    
        # read the page into a layout object
        interpreter.process_page(page)
        layout = device.get_result()
    
        # extract text from this object
        parse_obj(layout._objs)
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看