Link

Managing Data

Services based on Amazon’s S3 protocol for object storage are ubiquitous and have many advantages in terms of cost, scalability, and use of use. Unfortunately in a distributed environment, they also have a couple critical deficiencies: searching via stored metadata is inefficient and there are no facilities for federating S3 services from different providers.

The Nuvla data management model takes advantage of the positive aspects of S3, while providing global management of metadata for efficient search across providers.

WORM Model

The data management model follows Write-Once, Read-Mostly (WORM) semantics. This model:

  • Optimises read access to data, simplifying access to the data via the underlying services and improving performance.

  • Facilitates the replication of data objects, allowing “hot” data to be accessed efficiently on multiple providers.

  • Improves the reproducibility of data analyses by providing unique identifiers for “versions” of data objects.

Applications can also read data directly from other sources as necessary. For example, an application can read streamed data directly from a sensor, something common at the edge of the hybrid computing platform.

Concretely, the implementation consists of three Nuvla resources:

  • data-object: This resource is a proxy for data stored in a bucket/object within S3 from a given provider. This resource manages the lifecycle of an S3 object, allowing easy upload and download of the data.

  • data-record: This resource provides additional, user-specified metadata for an object. This allows rich, domain-specific metadata to be attached to objects and consequently, precise searching for relevant data objects.

  • data-set: This resources defines dynamic collections of data-object and/or data-record resources via filters. These can be defined by administrators, managers, or users.

Together, these provide a flexible data management framework, applicable to a wide range of use cases. The usual (simplified) workflow consists of 1) creating a data-object (and implicitly the S3 object), 2) optionally adding rich metadata in a data-record object, and 3) finding (and using) the data-object resources included in a data-set.

Data-Object Resources

The data-object resource allows users to create S3 objects on service providers and to store simple metadata concerning those objects.

Creating

The following diagram provides an overview of the workflow for creating a data-object resource.

Workflow to Create Data-Object Resource

The workflow consists of the following steps:

  1. Create data-object resource by providing bucket, object, and S3 credential.

  2. Request pre-signed upload URL via “upload” action.

  3. Use pre-signed upload URL to upload object contents to S3.

  4. Mark object as “ready” (and read-only) via the “ready” action.

Note that the bucket that will contain the data must already exist. (Future versions will create the bucket automatically for you.) The object will be created when you upload the data to S3.

Reading

The data can be consumed by others via the “download” action only when the object has been marked as “ready”. When the object is “ready” the object data can no longer be modified.

Workflow to Read Data-Object Resource

The workflow consists of the following steps:

  1. Request pre-signed download URL via the “download” action.

  2. Use pre-signed download URL to download object contents from S3.

Note that only the pre-signed URL is generated by the Nuvla server. The heavyweight access to the data itself passes directly between the client and the provider’s S3. This ensures that the data transfer occurs uses the highest possible bandwidth.

Deleting

The data object and underlying S3 object can be deleted.

Workflow to Delete Data-Object Resource

The workflow consists of the following steps:

  1. Request to delete the data-object resource via the HTTP DELETE request.

  2. Server verifies access and deletes object from S3.

  3. Server also deletes the bucket if it is empty.

Once the object is deleted, it is no longer accessible either through Nuvla or the underlying S3.

Data-Record Resources

…….

The data-record resources provide rich metadata for data objects, either created through data-object resources in Nuvla or externally.

These resources are simply JSON documents that are searched, created, updated, and deleted via the standard SCRUD patterns of the API.

Aside from a small set of predefined keys, the schema for the data-record resources is open, allowing users, managers, and administrators to attach their own domain-specific metadata to a data object.

An example data-record resource looks like the following:

{
  "id": "data-record/1e48a9d1-9a26-453c-b392-13e179fd53b4",
  "resource-type": "data-record",
  "name": "data-object-1",
  "description": "data-object-1 description",
  "created": "2019-03-14T16:16:04.768Z",
  "updated": "2019-03-14T16:16:04.768Z",
  
  "infrastructure-service": "infrastructure-service/b3ec10fb-086e-42f2-8fdb-a215f6ef2089",

  "resource:protocol": "NFS",
  "resource:object": "data-object/6a1be147871c7a542ccd0047b5a03b20"
  
  "data:bucket": "new-bucket-for-tests",
  "data:object": "new-object-for-tests",
  "data:content-type": "text/plain",
  "data:bytes": 1024,
  "data:timestamp": "2019-03-14T16:16:04Z",

  "data:nfsIP": "159.100.242.78",
  "data:nfsDevice": "/nfs-root",
  
  "data:protocols": [
    "tcp+nfs"
  ],

  "gnss:mission": "random"
}

WARNING: Although the schema is open, all the key prefixes must be defined as data-record-key-prefix resources. Having prefixed attributes avoids collisions between domains. Only the Nuvla administrator can define these prefixes.

NOTE: It is strongly recommended to provide a data-record-key resource for each domain-specific key. These resources provide semantic information about the key to help humans provide the right information.

Data-Set Resources

The data-set resources define dynamic collections of data-object and/or data-record resources based on filters over those resources. The data-set resource can also identify appropriate applications based on file types or other criteria.

These resources are simply JSON documents that are searched, created, updated, and deleted via the standard SCRUD patterns of the API.

An example data-set resource looks like the following:

{
  "id": "data-set/ca96c2c1-317f-40cf-b9f3-8ebd0d87e7a2",
  "resource-type": "data-set",
  "name": "GREAT (CLK)",
  "description": "GREAT (CLK) data at ESA",
  "created": "2019-03-14T16:16:03.406Z",
  "updated": "2019-03-14T16:16:03.406Z",
  
  "module-filter": "data-accept-content-types='application/x-clk'",
  "data-record-filter": "gnss:mission='great' and data:contentType='application/x-clk'"
}

This data-set selects only data-record resources related to the GNSS mission “great” and have a content type of application/x-clk. The data-set also identifies appropriate applications based on this content type via the module-filter value.

Any user can define their own data sets and share those definitions with others by setting an appropriate ACL.


Copyright 2020, SixSq