Skip to content

Command line utility to extract URLs from text or HTML using an RFC 3986 compliant parser.

License

Notifications You must be signed in to change notification settings

divyekapoor/grepurls

Repository files navigation

GrepURLs / UGrep

This is a simple C++ program that reads its input extracts URLs from it and prints those URLs to stdout. URLs conforming to the RFC 3986 should be parsed successfully. URLs that are delimited by various characters such as (), <>, {}, [], "", '' are also supported.

Build Status

Installation

Just fetch and run Make:

git clone --recursive https:/divyekapoor/grepurls.git
cd grepurls
./configure && make && sudo make install

by default, we install to /usr/local/bin/. If you'd like to install to another path, just edit the Makefile.

Note: The binary is installed as "grepurls" and "ugrep". You may use either one as per your convenience.

If you like, you can also download a pre-built binary from the Releases page.

Usage

ugrep input.html input1.txt input2.txt
cat /tmp/input.html /tmp/input1.txt /tmp/input2.txt | ugrep

OR

grepurls input.html input1.txt input2.txt
cat /tmp/input.html /tmp/input1.txt /tmp/input2.txt | grepurls

You can use standard shell expansion to run ugrep over entire directories:

ugrep mirrored_website/*.html

The URLs will be output 1 per line.

Here's an example:

$ ugrep testdata/contrib_uri.txt
http://de.wikipedia.org/wiki/Uniform_Resource_Identifier
ftp://ftp.is.co.za/rfc/rfc1808.txt
file:///C:/Users/Benutzer/Desktop/Uniform%20Resource%20Identifier.html
file:///etc/fstab
geo:48.33,14.122;u=22.5
ldap://[2001:db8::7]/c=GB?objectClass?one
gopher://gopher.floodgap.com
mailto:[email protected]
sip:[email protected]
news:comp.infosystems.www.servers.unix
data:text/plain;charset=iso-8859-7,%be%fa%be
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
git:/rails/rails.git
crid://broadcaster.com/movies/BestActionMovieEver
http://nobody:[email protected]:8080/cgi-bin/script.php?action=submit&pageid=86392001#section_2

Uninstall

sudo make uninstall

About

Command line utility to extract URLs from text or HTML using an RFC 3986 compliant parser.

Resources

License

Stars

Watchers

Forks

Packages

No packages published