Skip to content

Answering Questions

brian717 edited this page Aug 4, 2014 · 3 revisions

Access to raw data from outside of openPDS is not permitted. As such, openPDS provides a compute engine to operate on raw data within the machine running the openPDS instance. This is done by instantiating an InternalDataStore, the implementation of which is located within the openPDS/oms_pds/internal package. Within this package there are modules for mongo and sql backends. In order to analyze data within the PDS and store answers, the type of InternalDataStore used must match the backend storage that the PDS is configured to store to. For most current deployments, this is Mongo.

As the data could be very large, and the operations performed on it very costly with respect to computation time, openPDS generates answers asynchronously via Celery tasks. To implement a new answer to a question, a new celery task must be implemented. This task can reside anywhere within the openPDS app. This task should instantiate a new InternalDataStore, query it for data via the getData method, process it, and then store the result using the saveAnswer method. Answers are stored as key-value pairs, where the key is a unique identifier for the answer that will also be used as an OAuth scope for authorizing access to that answer for external applications. From there, the task must be scheduled to run at designated intervals during the day. This interval should be chosen to be longer than the worst-case runtime for the task, and should also take into consideration the frequency with which raw data is refreshed, as well as the intended user interface for the answer (is it being presented to the user as "real-time" data?)