bigd : 用于从网页抓取文件的命令行工具

bigd : 用于从网页抓取文件的命令行工具,并发文件下载程序

C/C++ CLI命令行界面

访问GitHub主页

共35Star

详细介绍

bigd : a command-line tool for scraping files from a webpage.

Usage examples are based on the following options:

Allowed options:
  -h [ --help ]                 produce help message
  -u [ --url ] arg              page to download from
  -t [ --type ] arg             type of file to download
  -m [ --match ] arg            wildcard pattern (supersedes type)
  -n [ --threads ] arg (=10)    number of files to simultaneously download
  -d [ --depth ] arg (=0)       recursive depth
  -a [ --download-archive ] arg archive file path
  -f [ --folder ] arg (=./)     folder of where to download content to

For example, to concurrently download files of type mp3 from a given url:

./bigd --url <URL> --type mp3

OR, note, one can provide a wildcard which has the effect of superseding the --type parameter:

./bigd --url <URL> --match "*.mp3"

Note:

This tool works by first scraping the given URL for href links. If a link matches a given type of content to download, it downloads it, otherwise it tries to recurse into it (provided the --depth flag is set). Because of this, bigd works best with apache-style directory listings, and webpages with direct links to the 'type' of content that one wishes to scrape. There is no DOM emulation of a browser and no javascript emulation.

Usage notes:

Multiple file types can be specified with a multiplicity of --type (e.g. -t mp3 -t jpg etc.)
Note also that a wildcard pattern, specified with --match can be provided instead of --type and also has the effect of superseding the latter (for example -m "*.jpg" etc.).
Unless specified using the --folder flag, all content is downloaded to the current working directory.
A threadpool is used to concurrently scrape content (so should prove quicker than tools like wget).
The default threading value results in simultaneous downloading of 10 files. This can be overridden via the --threads flag.
An optional history of downloaded content (a 'download archive') will be written to a file when specified by the --download-archive flag.
The download archive is also used to ensure that bigd doesn't attempt to re-download content already downloaded.
Recursive downloading is supported with the --depth argument but is disabled by default (a depth of zero).

Building

Using Homebrew is the most straight-forward. Add tap and install:

brew tap benhj/bigd
brew install bigd

Or use cmake:

mkdir build
cd build
cmake ..
make
make install

Or compile directly

clang++ -std=c++11 bigd.cpp -lcurl -lboost_program_options -lboost_filesystem -lboost_system -o bigd

Contributing

Please create an issue if you find a bug / have an enhancement request / follow the usual fork, pull request methodology.

License

Adheres to The Hacky As Fuck Software License.

推荐源码

暂无源码更多源码...