Skip to content
derekeder edited this page Mar 11, 2013 · 32 revisions

Dedupe class

Public methods

_init_(self, init=None)

Load or initialize a data model.

Example

# initialize from a settings file
deduper = dedupe.Dedupe(settings_file)

# initialize from a defined set of fields
fields = {
        'Site name': {'type': 'String'},
        'Address': {'type': 'String'},
        'Zip': {'type': 'String', 'Has Missing':True},
        'Phone': {'type': 'String', 'Has Missing':True},
        }

deduper = dedupe.Dedupe(fields)

Keyword arguments

init -- a field definition or a file location for a settings file

A field definition is a dictionary where the keys are the fields that will be used for training a model and the values are the field specification

Field types include

  • String

A 'String' type field must have as its key a name of a field as it appears in the data dictionary and a type declaration ex. {'Phone': {type: 'String'}}

Longer example of a field definition:

fields = {'name':       {'type': 'String'},
          'address':    {'type': 'String'},
          'city':       {'type': 'String'},
          'cuisine':    {'type': 'String'}
          }

Settings files are typically generated by saving the settings learned in a previous session. If you need details for this file see the method writeSettings.

train(self, data_sample, training_source=None)

Learn field weights from file of labeled examples or round of interactive labeling.

Examples

# load training data from an existing file
deduper.train(data_sample, training_file)

# train with active learning and human input
deduper.train(data_sample, dedupe.training.consoleLabel)

Keyword arguments

data_sample - a sample of record pairs training_source - either a path to a file of labeled examples or a labeling function

In the sample of record_pairs, each element is a tuple of two records. Each record is, in turn, a tuple of the record's key and a record dictionary.

In in the record dictionary the keys are the names of the record field and values are the record values.

For example, a data_sample with only one pair of records,

[
  (
   (854, {'city': 'san francisco',
          'address': '300 de haro st.',
          'name': "sally's cafe & bakery",
          'cuisine': 'american'}),
   (855, {'city': 'san francisco',
         'address': '1328 18th st.',
         'name': 'san francisco bbq',
         'cuisine': 'thai'})
   )
 ]

The labeling function will be used to do active learning. The function will be supplied a list of examples that the learner is the most 'curious' about, that is examples where we are most uncertain about how they should be labeled. The labeling function will label these, and based upon what we learn from these examples, the labeling function will be supplied with new examples that the learner is now most curious about. This will continue until the labeling function sends a message that we it is done labeling.

The labeling function must be a function that takes two arguments. The first argument is a sequence of pairs of records. The second argument is the data model.

The labeling function must return two outputs. The function must return a dictionary of labeled pairs and a finished flag.

The dictionary of labeled pairs must have two keys, 1 and 0, corresponding to record pairs that are duplicates or nonduplicates respectively. The values of the dictionary must be a sequence of records pairs, like the sequence that was passed in.

The 'finished' flag should take the value False for active learning to continue, and the value True to stop active learning.

i.e.

labelFunction(record_pairs, data_model) :
    ...
    return (labeled_pairs, finished)

For a working example, see consoleLabel in training.

Labeled example files are typically generated by saving the examples labeled in a previous session. If you need details for this file see the method writeTraining.

blockingFunction(self, ppc=1, uncovered_dupes=1)

Returns a function that takes in a record dictionary and returns a list of blocking keys for the record. We will learn the best blocking predicates if we don't have them already.

Example

blocker = deduper.blockingFunction()

Keyword arguments

ppc - Limits the Proportion of Pairs Covered that we allow a predicate to cover. If a predicate puts together a fraction of possible pairs greater than the ppc, that predicate will be removed from consideration.

   As the size of the data increases, the user will generally
   want to reduce ppc.

   ppc should be a value between 0.0 and 1.0

uncovered_dupes - The number of true dupes pairs in our training data that we can accept will not be put into any block. If true true duplicates are never in the same block, we will never compare them, and may never declare them to be duplicates.

               However, requiring that we cover every single
               true dupe pair may mean that we have to use
               blocks that put together many, many distinct pairs
               that we'll have to expensively, compare as well.

goodThreshold(self, blocks, recall_weight=1.5)

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.

Example

threshold = deduper.goodThreshold(blocked_data, recall_weight=2)

Keyword arguments

blocks - Sequence of tuples of records, where each tuple is a set of records covered by a blocking predicate

recall_weight - Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.

duplicateClusters(self, blocks, threshold=.5)

Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids

Example

clustered_dupes = deduper.duplicateClusters(blocked_data, threshold)

Keyword arguments

blocks - Sequence of tuples of records, where each tuple is a set of records covered by a blocking predicate

threshold - Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold.

          Lowering the number will increase recall, raising it
          will increase precision

writeSettings(self, file_name)

Write a settings file that contains the data model and predicates

Example

deduper.writeSettings(settings_file)

Keyword arguments

file_name - path to file

writeTraining(self, file_name)

Write to a json file that contains labeled examples.

Example

deduper.writeSettings(settings_file)

Keyword arguments

file_name - path to a json file

Clone this wiki locally