Skip to content
Forest Gregg edited this page Feb 17, 2014 · 32 revisions

Convenience

Dedupe

Defining a model, _init_(init=None)

Load or initialize a data model.

Example usage

# initialize from a settings file
deduper = dedupe.Dedupe('my_learned_settings')

or

# initialize from a defined set of fields
fields = {
        'Site name': {'type': 'String'},
        'Address': {'type': 'String'},
        'Zip': {'type': 'String', 'Has Missing':True},
        'Phone': {'type': 'String', 'Has Missing':True},
        }

deduper = dedupe.Dedupe(fields)

Keyword arguments

init A field definition or a file location for a settings file. Settings files are typically generated by saving the settings learned in a previous session. If you need details for this file see the method writeSettings.

Field Definitions

A field definition is a dictionary where the keys are the fields that will be used for training a model and the values are the field specification

Field types include

  • String
  • Custom
  • LatLong
  • Set
  • Interaction

String Types

A 'String' type field must have as its key a name of a field as it appears in the data dictionary and a type declaration ex. {'Phone': {type: 'String'}} The string type expects fields to be of class string. Missing data should be represented as an empty string ''

String types are compared using affine gap string distance.

Custom Types

A 'Custom' type field must have as its key a name of a field as it appears in the data dictionary, at 'type' declaration, and a 'comparator' declaration. The comparator must be a function that can take in two field values and return a number or a numpy.nan (not a number, appropriate when a distance is not well defined, as when one of the fields is missing).

Example custom comparator:

def sameOrNotComparator(field_1, field_2) :
    if field_1 and field_2 :
        if field_1 == field_2 :
            return 1
        else:
            return 0
    else :
        return numpy.nan

Field definition:

{'Zip': {'type': 'Custom', 
         'comparator' : sameOrNotComparator}} 

LatLong

A 'LatLong' type field must have as its key a name of a field as it appears in the data dictionary, at 'type' declaration. LatLong fields are compared using the Haversine Formula. A 'LatLong' type field must consist of tuples of floats corresponding to a latitude and a longitude. If data is missing, this should be represented by a tuple of 0s (0.0, 0.0)

{'Location': {'type': 'LatLong'}} 

Set

A 'Set' type field must have as its key a name of a field as it appears in the data dictionary, at 'type' declaration. Set fields are compares sets using the Jaccard index. Missing data is on implemented for this field type.

{'Co-authors': {'type': 'Set'}} 

Interaction

An interaction type field can have as it's key any name you choose, a 'type' declaration, and an 'Interaction Fields' declaration. An interaction field multiplies the values of the declared fields.

The 'Interaction Fields' must be a sequence of names of other fields you have defined in your field definition.

{'Name'    : {'type', 'String'}, 
 'Zip'     : {'type': 'Custom', 
              'comparator' : sameOrNotComparator},
 'Name-Zip : {'type': 'Interaction', 
              'Interaction Fields' : ['Name', 'Zip]}} 

Categorical

Categorical variables are useful when you are dealing with qualitatively different types of things. For example, you may have data on businesses and you find that taxi cab businesses tend to have very similar names but law firms don't. Categorical variables would let you indicate whether two records are both taxi companies, both law firms, or one of each.

Dedupe would represents these three possibilities using two dummy variables:

taxi-taxi      0 0
lawyer-lawyer  1 0
taxi-lawyer    0 1

A categorical field declaration must include a list of all the different strings that you want to treat as different categories.

So if you data looks like this

'Name'          'Business Type'
AAA Taxi        taxi
AA1 Taxi        taxi
Hindelbert Esq  lawyer

You would create a definition like:

{'Business Type'    : {'type', 'Categorical',
                       'Categories' : ['taxi', 'lawyer']}}

Source

Usually different data sources vary in how many duplicates are contained within them and the patterns that make two pairs of records likely to be duplicates. If you are trying to link records from more than one data set, it can be useful to take these differences into account.

If your data has a field that indicates its source, something like

'Name'         'Source'
John Adams     Campaign Contributions
John Q. Adams  Lobbyist Registration
John F. Adams  Lobbyist Registration

You can take these sources into account by the following field definition.

{'Source'    : {'type', 'Source',
                'Categories' : ['Campaign Contributions', 'Lobbyist Registration']}}

Dedupe will create a categorical variable for the source and then cross-interact it with all the other variables. This has the effect of letting dedupe learn three different models at once. Let's say that we had defined another variable called name. Then our total model would have the following fields

bias
Name
Source
Source:Name
different sources
different sources:Name

Bias + Name would predict the probability that a pair of records were duplicates if both records were from Campaign Contributions.

Bias + Source + Name + Source:Name would predict the probability that a pair of records were duplicates if both records were from Lobbyist Registration

Bias + different sources + Name + different sources:Name would predict the probability that a pair of records were duplicates if one record was from each of the two sources.

Missing Data

If a field has missing data, you can set 'Has Missing' : True in the field definition. This creates a new, additional field representing whether the data was present or not and zeros out the missing data. If there is missing data, but you did not declare 'Has Missing' : True then the missing data will simply be zeroed out.

If you define an an interaction with a field that you declared to have missing data, then Has Missing : True will also be set for the Interaction field.

Longer example of a field definition:

fields = {'name'         : {'type' : 'String'},
          'address'      : {'type' : 'String'},
          'city'         : {'type' : 'String'},
          'zip'          : {'type' : 'Custom', 'comparator' : sameOrNotComparator},
          'cuisine'      : {'type' : 'String', 'Has Missing': True}
          'name-address' : {'type' : 'Interaction', 'Interaction Fields' : ['name', 'city']}
          }

train(data_sample, training_source=None)

Learn field weights from file of labeled examples or round of interactive labeling.

Example usage

See our CSV and MySQL examples for methods of creating a data dictionary data_d. To create a data sample, see the dataSample documentation.

# given data_d, a list of frozndicts, grab a sample
# see CSV example
data_sample = dedupe.dataSample(data_d, 150000)

# load training data from an existing file
deduper.train(data_sample, 'my_training')

or

# given data_d, a list of frozndicts, grab a sample
data_sample = dedupe.dataSample(data_d, 150000)

# train with active learning and human input
deduper.train(data_sample, dedupe.training.consoleLabel)

Keyword arguments

data_sample A sample of record pairs.

training_source Either a path to a file of labeled examples or a labeling function.

Additional detail

In the sample of record_pairs, each element is a tuple of two records. Each record is, in turn, a tuple of the record's key and a record dictionary.

In the record dictionary the keys are the names of the record field and values are the record values.

For example, a data_sample with only one pair of records,

[
  (
   (854, {'city': 'san francisco',
          'address': '300 de haro st.',
          'name': "sally's cafe & bakery",
          'cuisine': 'american'}),
   (855, {'city': 'san francisco',
         'address': '1328 18th st.',
         'name': 'san francisco bbq',
         'cuisine': 'thai'})
   )
 ]

The labeling function will be used to do active learning. The function will be supplied a list of examples that the learner is the most 'curious' about, that is examples where we are most uncertain about how they should be labeled. The labeling function will label these, and based upon what we learn from these examples, the labeling function will be supplied with new examples that the learner is now most curious about. This will continue until the labeling function sends a message that we it is done labeling.

The labeling function must be a function that takes two arguments. The first argument is a sequence of pairs of records. The second argument is the data model.

The labeling function must return two outputs. The function must return a dictionary of labeled pairs and a finished flag.

The dictionary of labeled pairs must have two keys, 1 and 0, corresponding to record pairs that are duplicates or nonduplicates respectively. The values of the dictionary must be a sequence of records pairs, like the sequence that was passed in.

The 'finished' flag should take the value False for active learning to continue, and the value True to stop active learning.

i.e.

labelFunction(record_pairs, data_model) :
    ...
    return (labeled_pairs, finished)

For a working example, see consoleLabel in training.

Labeled example files are typically generated by saving the examples labeled in a previous session. If you need details for this file see the method writeTraining.

blockingFunction(ppc=1, uncovered_dupes=1)

Returns a function that takes in a record dictionary and returns a list of blocking keys for the record. We will learn the best blocking predicates if we don't have them already.

Example usage

blocker = deduper.blockingFunction()

Keyword arguments

ppc Limits the Proportion of Pairs Covered that we allow a predicate to cover. If a predicate puts together a fraction of possible pairs greater than the ppc, that predicate will be removed from consideration.

As the size of the data increases, the user will generally want to reduce ppc.

ppc should be a value between 0.0 and 1.0

uncovered_dupes The number of true dupes pairs in our training data that we can accept will not be put into any block. If true true duplicates are never in the same block, we will never compare them, and may never declare them to be duplicates.

However, requiring that we cover every single true dupe pair may mean that we have to use blocks that put together many, many distinct pairs that we'll have to expensively, compare as well.

goodThreshold(blocks, recall_weight=1.5)

Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.

Example usage

threshold = deduper.goodThreshold(blocked_data, recall_weight=2)

Keyword arguments

blocks Sequence of tuples of records, where each tuple is a set of records covered by a blocking predicate.

recall_weight Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.

duplicateClusters(blocks, threshold=.5)

Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids

Example usage

clustered_dupes = deduper.duplicateClusters(blocked_data, threshold)

Keyword arguments

blocks Sequence of tuples of records, where each tuple is a set of records covered by a blocking predicate.

threshold Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold.

Lowering the number will increase recall, raising it will increase precision.

writeSettings(file_name)

Write a settings file that contains the data model and predicates

Example usage

deduper.writeSettings('my_learned_settings')

Keyword arguments

file_name Path to file.

writeTraining(file_name)

Write to a json file that contains labeled examples.

Example usage

deduper.writeTraining('my_training')

Keyword arguments

file_name Path to a json file.

Convenience

dataSample(data, sample_size)

Randomly sample pairs of records from a data dictionary

Example usage

data_sample = dedupe.dataSample(data_d, 150000)

Keyword arguments

data_d A dictionary-like object indexed by record ID where the values are dictionaries representing records.

sample_size Number of record tuples to return. 150,000 is typically a good size.

blockData(data_d, blocker)

Takes in a data dictionary and a blockingFunction and returns blocks of data to be compared.

Example usage

blocked_data = dedupe.blockData(data_d, blocker)

Keyword arguments

data_d A dictionary-like object indexed by record ID where the values are dictionaries representing records.

blocker A blockingFunction object.

Clone this wiki locally