Skip to content

A simple command-line tool to extract text from Wikipedia and/or PDFs built using Python

License

Notifications You must be signed in to change notification settings

MistaAsh/WikiExtractor

Repository files navigation

WikiExtractor

This is WikiExtractor! A simple and easy to use Python-based Web Scraping tool that can be used to extract information from Wikipedia pages.

As an added feature we have also included a simple pdf extractor that uses the Tesseract OCR engine to extract text from pdf files.

Installation

To contribute and work on the repository, you need Python installed on your system. If you do not have Python installed, you can install it from here.

Fork and clone the repository from GitHub.

git clone https:/<your-username-here>/WikiExtractor.git

Traverse to the directory where the repository is cloned.

cd WikiExtractor

To execute the script, you will need to install the dependencies. It is recommended to create a virtual environment to do the same

# Create a virtual environment (not necessary but recommended)
python3 -m venv <name-of-virtual-environment>
source <name-of-virtual-environment>/bin/activate

# Install the dependencies
pip install -r requirements.txt

Wikipedia Extractor

Use the following commands to run the script.

python wiki_extractor.py --keyword=<your_keyword> --num_urls=<your_num_urls> --output=<your_output_JSON_file>

Replace each <>with the appropriate values. Make sure to append .json to the end of the output file name to prevent any errors.


PDF Extractor

To use the PDF Extractor, you will additionally have to install the Tesseract OCR Engine from here. You will also have to install Poppler from here and add the bin folder to the system PATH.

To run the script, use this command in the terminal

python pdf_extractor.py

Implementation

The implementation of WikiExtractor is done in Python. The code is written in a modular way so that it can be easily integrated into other projects.

The wikipedia extractor tool leverages the Search Optimization of the Google search engine to give the user the best possible results. It initially sends a GET request to the Google search engine with the query as the search term. The search engine returns a list of Wikipedia URLs that are relevant to the search term. The extractor then sends a GET request to each of the URLs and extracts the relevant information from the HTML page.

The pdf extractor tool uses the Tesseract OCR engine to extract text from pdf files. The extractor first downloads the pdf file and then uses the Tesseract OCR engine to extract the text from the pdf file. The extractor then writes the extracted text to a JSON file.


Future Updates

The next version of this tool will be implemented using multiprocessing to speed up the process of extraction for maximun efficiency.

About

A simple command-line tool to extract text from Wikipedia and/or PDFs built using Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages