Skip to content

manuelciosici/ExchangeAndBrown

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Exchange and Brown

Version 0.2

(c) 2015 - 2020 Manuel R. Ciosici and UNSILO.com


Table of Contents


Introduction

This repository contains software for clustering words based on their distributional similarity using Brown clustering, Exchange or a combination of the two.

If you use the code, please cite the paper:

	
@inproceedings{ciosici-etal-2020-accelerated,
    title = "Accelerated High-Quality Mutual-Information Based Word Clustering",
    author = "Ciosici, Manuel R.  and
      Assent, Ira  and
      Derczynski, Leon",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.303",
    pages = "2491--2496",
    language = "English",
    ISBN = "979-10-95546-34-4",
}
	

Building

In order to compile you need the following:

  • A C++ compiler that supports C++14 and OpenMP (e.g., the newest GNU compiler)

  • The CMake program for building the code

Instructions for building (OS X or Linux):

Using a terminal, change working directory to the src directory and

mkdir cmake-build-release

cd cmake-build-release

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 ..

cmake --build . --target help -- -j

You can replace help with either one of the listed build targets, but you might also want to use:

cmake --build . --target all -- -j

In order to build all targets.


Usage

Simple

Build target simple_brown and run the binary with the same name.

./simple_brown --help

Advanced

Running on a new corpus requires that you first build the targets Brown and exchange_runner and is split into phases such reading, reordering, clustering. There are a few binaries that deal with miscellaneous tasks.

Reading

./Brown read --help

Provides information of what parameters are necessary. This binary reads a plain text corpus into a binary object that is used for all later phases.

It is also possible to read skip-grams using the command

./Brown read-skip --help

Reordering

./Brown reorder --help

This binary reads the binary corpus file from the previous step and reorders words in the vocabulary so that low ID words are high-frequency words. This is necessary for both Brown and Exchange. This binary can be extended to implement other reordering strategies if needed.

Clustering
Brown clustering

./Brown induce_brown --help

Runs the Brown algorithm on top of the binary corpus file. It will output both a flat clustering and a hierarchical one. The hierarchical clustering is the same format as the format used by wcluster.

Exchange clustering

./exchange_runner --help

Allows you to create clusters using Exchange. The difference between EXCHANGE and EXCHANGE_STEPS is that EXCHANGE_STEPS outputs the clustering at the end of every single iteration which allows for model selection.

Brown clustering on top of Exchange

Run Exchange as defined in the previous step and then Brown on top of it. And then use the following binary:

./compute_brown_over_clusters --help

Miscellaneous

  1. To get some basic information about a clustering:

./clustering_facts --help

  1. To get the Average Mutual Information of a flat clustering:

./print_clustering_ami --help


License

This software is subject to the terms of The MIT License, which has been included in this repository.


Contact

Please contact Manuel R. Ciosici ([email protected]) with comments, questions, or bugs.


References

  1. P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. (1992) "Class-based n-gram models of natural language." Computational Linguistics 18(4): 467--479. http://dl.acm.org/ft_gateway.cfm?id=176316

  2. Ciosici, M. R. (2016). Improving Quality of Hierarchical Clustering for Large Data Series. Aarhus University. Retrieved from http://arxiv.org/abs/1608.01238


About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published