data_infrastructure:iniital_proposal

Initial Proposal

Note: this document is still in draft.

This document outlines the proposal for the data infrastructure project, which is spun off from the SCEL weatherbox server subsystem.

The purpose of the data infrastructure is to create a platform that can be used to easily collect and store sensor data from any source. This platform will eventually fully support the efforts of the forecasting teach, which given good access to data will be able to do their own independent research.

https://www.draw.io/#G0Bxowpw1NF2d3YmRaRjBHWTJTckE

Outline

I. Motivation
II. Details
III. Goals
IV. Technical Modules

Previous Work

Issues with the previous project

  • Documentation was poor
  • Difficult to contribute
  • Availability was not that great
  • Hard to interface with
  • Limited to one data type

Motivation and Summary

It is currently very difficult to reliably gather time series data from embedded sensor devices. This project aims to provide the software infrastructure to reliably collect data, add new sensors, extend new sensor types and analyze such data.

Specifications

High Level Categories:

  • Availability
  • Interfaces
  • Libraries
  • Graphing
  • Contributions
  • Documentation
  • Extend-ability
  • Logging
  • Verification
  • Validation

Misc:

  • Outside Users should be able to easily view nodes publicly
  • Each node deployment should be able to be tracked
  • Lab users should be able to download datasets using any scripting language
  • We should be able to validate the data that is collected
  • We should be able to scan if a sensor is down or not

Okay this is really hard.

Technical Modules

Here is a block diagram:

  • Client - Primary interface into the data infrastructure - sensors with transport layers such as ZigBee will dump their data to these clients.
  • Messaging Bus/Gateway - Monitors all of the clients and makes sure that they are authorized to send data. Rejects invalid clients.
  • Data Backend - Contains all of the logic necessary to store and process data. Contains a publicly accessible API that can be used to build client applications.
  • Compute Backend - Able to run large compute jobs such as graphing or analysis scripts. Serves dataset results and graphs through a filesystem or the API.
  • Client Gateway
  • Queue
  • Client Connector
  • Reverse Proxy/Balancer
  • Gateway Database
  • Gateway
  • Gateway Queue
  • Worker - Processes data and makes sure that they
  • Node Manager - Manages the registration and display of data nodes
  • API - Serves data from the core/mirror database. Used by the node manager
  • Core database - Database that stores all of our data. Currently postgresql.
  • Mirror database - Public database that is RO for public users. Mirrored from the core database.
  • Gateway Queue - Main queue that exists between the gateway and the worker scripts. This makes it possible to upgrade the gateway without losing any data in the network.

System Validation

To make sure that upgrades are completed to the system, we need to have proper validation tools and processes in place.

We can use tools such as docker and vagrant to help us test.

More on this later.

Education

A large motivation behind this is being able to educate and expose students to a project with the engineering process. We should think about how to bring students up to speed quickly enough - students who have almost no experience.

  • Code Review Process
  • Alumni Contributors

Problems

  • How can contributions be small enough for students? Can we create our system so it's easier to have those small contributions?

Possible Names

  • Data Platform
  • Sensor Platform
  • Data Sensor Platform

Authors

Contributing authors:

kluong

Created by kluong on 2016/06/20 00:05.

  • data_infrastructure/iniital_proposal.txt
  • Last modified: 2021/09/19 21:59
  • (external edit)