检测和更改PDF中的字符串

发布于 2021-01-29 17:29:47

我希望能够检测PDF中的图案并以某种方式标记它。

例如,在此PDF中,有字符串*2。我希望能够解析PDF,检测到的所有实例*[integer],并采取一些措施引起人们对比赛的关注(例如,将其突出显示为黄色或在边距中添加一个符号)。

我更喜欢用Python进行此操作,但是我可以使用其他语言。到目前为止,我已经能够使用pyPdf读取PDF的文本。我可以使用正则表达式来检测模式。但是我还无法弄清楚如何标记比赛并重新保存PDF。

关注者
0
被浏览
52
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    人们可能不感兴趣,或者Python没有能力,所以这里是Perl
    :-)中的解决方案。认真地说,如上所述,您不需要“更改字符串”。PDF批注是您的解决方案。我不久前有一个带有注释的小项目,那里有一些代码。但是,我的内容解析器不是通用的,并且您不需要全面的解析-
    意味着能够更改内容并将其写回。因此,我求助于外部工具。我使用的PDF库有些底层,但是我不介意。这也意味着,人们期望对PDF内部结构有适当的了解,以了解正在发生的事情。否则,只需使用该工具。

    这是使用命令标记例如OP文件中的所有gerunds的快照

    perl pdf_hl.pl -f westlaw.pdf -p '\S*ing'

    代码(里面的注释也值得一读):

    use strict;
    use warnings;
    use XML::Simple;
    use CAM::PDF;
    use Getopt::Long;
    use Regexp::Assemble;
    
    #####################################################################
    #
    #  This is PDF highlight mark-up tool.
    #  Though fully functional, it's still a prototype proof-of-concept.
    #  Please don't feed it with non-pdf files or patterns like '\d*' 
    #  (because you probably want '\d+', don't you?).
    #  
    #  Requires muPDF-tools installed and in the PATH, plus some CPAN modules.
    #
    #  ToDo:
    #  - error handling is primitive if any.
    #  - cropped files (CropBox) are processed incorrectly. Fix it.
    #  - of course there can be other useful parameters.
    #  - allow loading them from file.
    #  - allow searching across lines (e.g. for multi-word patterns)
    #    and certainly across "spans" within a line (see mudraw output).
    #  - multi-color mark-up, not just yellow.
    #  - control over output file name.
    #  - compress output (use cleanoutput method instead of output,
    #    plus more robust (think compressed object streams) compressors 
    #    may be useful).
    #  - file list processing.
    #  - annotations are not just colorful marks on the page, their 
    #    dictionaries can contain all sorts of useful information, which may 
    #    be extracted automatically further up the food chain i.e. by 
    #    whoever consumes these files (date, time, author, comments, actual 
    #    text below, etc., etc., plus think of customized appearence streams,
    #    placing them on layers, etc..
    #  - ???
    #
    #   Most complexity in the code comes from adding appearance 
    #   dictionary (AP). You can safely delete it, because most viewers don't 
    #   need AP for standard annotations. Ironically, muPDF-viewer wants it 
    #   (otherwise highlight placement is not 100% correct), and since I relied 
    #   on muPDF-tools, I thought it be proper to create PDFs consumable by 
    #   their viewer... Firefox wants AP too, btw.
    #
    #####################################################################
    
    my ($file, $csv);
    my ($c_flag, $w_flag) = (0, 1);
    GetOptions('-f=s' => \$file,   '-p=s' => \$csv, 
               '-c!'  => \$c_flag, '-w!'  => \$w_flag) 
        and defined($file)
        and defined($csv)
    or die "\nUsage: perl $0 -f FILE -p LIST -c -w\n\n",
           "\t-f\t\tFILE\t PDF file to annotate\n",
           "\t-p\t\tLIST\t comma-separated patterns\n",
           "\t-c or -noc\t\t be case sensitive (default = no)\n",
           "\t-w or -now\t\t whole words only (default = yes)\n";
    my $re = Regexp::Assemble->new
        ->add(split(',', $csv))
        ->anchor_word($w_flag)
        ->flags($c_flag ? '' : 'i')
        ->re;
    my $xml = qx/mudraw -ttt $file/;
    my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]);
    my $pdf = CAM::PDF->new($file);
    
    sub __num_nodes_list {
        my $precision = shift;
        [ map {CAM::PDF::Node->new('number', sprintf("%.${precision}f", $_))} @_ ]
    }
    
    sub add_highlight {
        my ($idx, $x1, $y1, $x2, $y2) = @_;
        my $p = $pdf->getPage($idx);
    
        # mirror vertically to get to normal cartesian plane 
        my ($X1, $Y1, $X2, $Y2) = $pdf->getPageDimensions($idx);
        ($x1, $y1, $x2, $y2) = ($X1 + $x1, $Y2 - $y2, $X1 + $x2, $Y2 - $y1);
        # corner radius
        my $r = 2;
    
        # AP appearance stream
        my $s = "/GS0 gs 1 1 0 rg 1 1 0 RG\n";
        $s .= "1 j @{[sprintf '%.0f', $r * 2]} w\n";
        $s .= "0 0 @{[sprintf '%.1f', $x2 - $x1]} ";
        $s .= "@{[sprintf '%.1f',$y2 - $y1]} re B\n";
    
        my $highlight = CAM::PDF::Node->new('dictionary', {
            Subtype => CAM::PDF::Node->new('label', 'Highlight'),
            Rect => CAM::PDF::Node->new('array', 
              __num_nodes_list(1, $x1 - $r, $y1 - $r, $x2 + $r * 2, $y2 + $r * 2)),
            QuadPoints => CAM::PDF::Node->new('array', 
                __num_nodes_list(1, $x1, $y2, $x2, $y2, $x1, $y1, $x2, $y1)),
            BS => CAM::PDF::Node->new('dictionary', {
                S => CAM::PDF::Node->new('label', 'S'),
                W => CAM::PDF::Node->new('number', 0),
            }),
            Border => CAM::PDF::Node->new('array', 
                __num_nodes_list(0, 0, 0, 0)),
            C => CAM::PDF::Node->new('array', 
                __num_nodes_list(0, 1, 1, 0)),
    
            AP => CAM::PDF::Node->new('dictionary', {
                N => CAM::PDF::Node->new('reference', 
                    $pdf->appendObject(undef, 
                        CAM::PDF::Node->new('object',
                            CAM::PDF::Node->new('dictionary', {
                                Subtype => CAM::PDF::Node->new('label', 'Form'),
                                BBox => CAM::PDF::Node->new('array',
                                  __num_nodes_list(1, -$r, -$r, $x2 - $x1 + $r * 2, 
                                                     $y2 - $y1 + $r * 2)),
                                Resources => CAM::PDF::Node->new('dictionary', {
                                    ExtGState => CAM::PDF::Node->new('dictionary', {
                                        GS0 => CAM::PDF::Node->new('dictionary', {
                                            BM => CAM::PDF::Node->new('label', 
                                                'Multiply'),
                                        }),
                                    }),
                                }),
                                StreamData => CAM::PDF::Node->new('stream', $s),
                                Length => CAM::PDF::Node->new('number', length $s),
                            }),
                        ),
                    ,0),
                ),
            }),
        });
    
        $p->{Annots} ||= CAM::PDF::Node->new('array', []);
        push @{$pdf->getValue($p->{Annots})}, $highlight;
    
        $pdf->{changes}->{$p->{Type}->{objnum}} = 1
    }
    
    my $page_index = 1;
    for my $page (@{$tree->{page}}) {
        for my $block (@{$page->{block}}) {
            for my $line (@{$block->{line}}) {
                for my $span (@{$line->{span}}) {
                    my $string = join '', map {$_->{c}} @{$span->{char}};
                    while ($string =~ /$re/g) {
                        my ($x1, $y1) = 
                            split ' ', $span->{char}->[$-[0]]->{bbox};
                        my (undef, undef, $x2, $y2) = 
                            split ' ', $span->{char}->[$+[0] - 1]->{bbox};
                        add_highlight($page_index, $x1, $y1, $x2, $y2)
                    }
                }
            }
        }
        $page_index ++
    }
    $pdf->output($file =~ s/(.{4}$)/++$1/r);
    
    __END__
    

    附言:我用“ Perl”标记了该问题,以便获得社区的一些反馈(代码更正等)。



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看