bigd : 用于从网页抓取文件的命令行工具

bigd : 用于从网页抓取文件的命令行工具,并发文件下载程序

C/C++ CLI命令行界面

访问GitHub主页

共35Star

详细介绍

bigd : a command-line tool for scraping files from a webpage.

Usage examples are based on the following options:

Allowed options:
  -h [ --help ]                 produce help message
  -u [ --url ] arg              page to download from
  -t [ --type ] arg             type of file to download
  -m [ --match ] arg            wildcard pattern (supersedes type)
  -n [ --threads ] arg (=10)    number of files to simultaneously download
  -d [ --depth ] arg (=0)       recursive depth
  -a [ --download-archive ] arg archive file path
  -f [ --folder ] arg (=./)     folder of where to download content to

For example, to concurrently download files of type mp3 from a given url:

./bigd --url <URL> --type mp3

OR, note, one can provide a wildcard which has the effect of superseding the --type parameter:

./bigd --url <URL> --match "*.mp3"

Note:

This tool works by first scraping the given URL for href links. If a link matches a given type of content to download, it downloads it, otherwise it tries to recurse into it (provided the --depth flag is set). Because of this, bigd works best with apache-style directory listings, and webpages with direct links to the 'type' of content that one wishes to scrape. There is no DOM emulation of a browser and no javascript emulation.

Usage notes:

  • Multiple file types can be specified with a multiplicity of --type (e.g. -t mp3 -t jpg etc.)
  • Note also that a wildcard pattern, specified with --match can be provided instead of --type and also has the effect of superseding the latter (for example -m "*.jpg" etc.).
  • Unless specified using the --folder flag, all content is downloaded to the current working directory.
  • A threadpool is used to concurrently scrape content (so should prove quicker than tools like wget).
  • The default threading value results in simultaneous downloading of 10 files. This can be overridden via the --threads flag.
  • An optional history of downloaded content (a 'download archive') will be written to a file when specified by the --download-archive flag.
  • The download archive is also used to ensure that bigd doesn't attempt to re-download content already downloaded.
  • Recursive downloading is supported with the --depth argument but is disabled by default (a depth of zero).

Building

Using Homebrew is the most straight-forward. Add tap and install:

brew tap benhj/bigd
brew install bigd

Or use cmake:

mkdir build
cd build
cmake ..
make
make install

Or compile directly

clang++ -std=c++11 bigd.cpp -lcurl -lboost_program_options -lboost_filesystem -lboost_system -o bigd

Contributing

Please create an issue if you find a bug / have an enhancement request / follow the usual fork, pull request methodology.

License

Adheres to The Hacky As Fuck Software License.

推荐源码