Data Lakes and Data APIs


Data Lake Strategy

I’m currently working on two problems. The first one is to engineer a solution that allows us to amass data from all over the place at a rapid pace. The second problem is to use this data to provide a source for our API services.

A now common solution to the first problem is data lakes. I’m not going to rehash the Dixon blog post on data lakes. You can read it for yourself.

A promise of the data lake is to ingest massive amounts of data at a rapid pace. Postponing the structuring of data until you actually need it makes it faster to ingest. There is an obvious downside to this model, which leads me to my second problem. Real-time APIS cannot afford to spend time structuring the potential results before querying.

Data artifacts have to be discovered from the lake and structured if necessary. We’ll call this the “mining and refining” process.

Mining data

The data lake is a giant repository of artifacts. The goal of the mining process is to discover the artifacts that the business might care about. Miners are not about structuring the data. Instead, they find the artifacts and orchestrate the refinement process. A data lake usually has a metadata catalog. Miners should rely on that catalog to find artifacts.

Refining data

Artifacts uncovered from the lake may be refined into data structures. The refinement process consists of transforming the artifact into its structured parts.

Storing data

Typical data lake implementations move structured data into a data warehouse. The warehouse and its sub-components support the reporting needs. This definition is too narrow. Implementations should move structured data into a data store tailored to the use-case. Reporting is a common use case, but APIs also represent a use case.

So to summarize: We have a data lake, miners, refiners and a data warehouse for our reporting. But we still don’t have APIs.

alt text

Data Lake and APIs

APIs are a common pattern to allow applications to interact with business data. I often hear developers say things like, “I want to be able to query the data lake”. It makes me cringe. What they really mean is they want to query data that comes from the data lake. I should refer people to Dixon’s metaphor and add that we should never drink from the lake directly.

Read-only APIs

In order to prepare data for APIs, we should once again use the miner/refiner pattern. The use case is no longer reporting, but APIs. This means the target data store should be different. Depending on our needs, we could pick Elasticsearch, DynamoDB, MongoDB or any other implementation. Evaluating and choosing the right one is left to the reader.

alt text

Of course, in order to get to this, we cheated a little. We said the APIs had to be read-only. If they weren’t, you’d allow a user to push a new asset through the API. The API would be another source for the data lake. Eventually the newly added asset would get mined and refined back to the OLTP. This process can take a long time. It would make no sense since it would mean that if I create a new asset through the API, I have to potentially wait hours in order to query it back from the API.

Read-write APIs

You probably need read/write endpoints if you are designing a proper API to manage your data assets. You need an endpoint to store data as well as an endpoint to retrieve data. All should work in real-time of course.

The implementation strategy is similar to caching strategies. Our OLTP data store is acting as a cache in some ways.

There are a few classic caching strategies. Caches are typically slaves to the applications they serve. They sit between the application and the data source. The data may come from the application or from the refinement pipelines sourced by the lake.

Refined data warms the data store in the background. Our data store also acts as a write-through cache for all data added by the APIs in real-time.

alt text

The above example shows the movement of a real-time data asset over time. It begins in the API and moves into the OLTP data store. From there, our write-through store throws it into the data lake. The analytics miner/refiner discovers and structures it before finally pushing into the data warehouse. It naturally flowed through the data streams.

Relationships

Managing relationships between assets is an important part of data management. The miner/refiner rig wrestles with the same constraint. Here is an example to consider: A TIFF image enters a store’s data lake. The document is a scanned fax that displays customer information near the top right of the page and a list of items to order towards the middle. This fax is an artifact that represents an order from a customer.

Imagine that we deploy a miner to recognize this kind of document. It can discover the order but doesn’t need to know how to extract data from it. That is a job for the refinement process. There should be many refinement processes deployed — at least one per structured asset we want to extract. In this case, pretend that there is a refiner for customer, one for item, and one for order.

Relationships complicate things a bit. The miners have to orchestrate the work of the refinement processes in order to execute them in the correct order. For example, the order structure must have references to the customer structures. They should not contain the actual customer data. This means that the miner must trigger the customer miner first to extract a customer identifier. It might be the result of a lookup by customer name or the fruit of creating a new customer asset. Once the customer identifier is obtained, the ordered items are identified in the same manner and the order refinement process can begin.

alt text

Refined customer structure

The JSON structure below depicts what an order asset might look like.

{
    "id": "4f7a7452-fa3e-4807-89c0-55f1aac27be4",
    "customer": {
        "firstName": "John",
        "lastName": "Rambo"
    }
}

Refined item structure

The item asset includes its unique identifier and a name.

{
    "id": "4279a3aa-6f74-4278-bd2a-b35b05609067",
    "name": "Blue blox"
}

Refined order structure

The restructured order only contains references back to the sub-structures it contains.

{
    "id": "67020c85-7513-48a2-841d-2846fd7f90c7",
    "timestamp": "2008-09-15T15:53:00",
    "customerId": "4f7a7452-fa3e-4807-89c0-55f1aac27be4",
    "items": [
        {
            "id": "4279a3aa-6f74-4278-bd2a-b35b05609067",
            "quantity": 3
        }
    ]
}

Conclusion

The miner/refiner pattern is essential to operationalize data from the lake. It can be used to hydrate the data warehouse as well as warm the API backing stores.

Having a callback mechanism to orchestrate the refiners is critical to discovering relationships between assets.

The operational data stores must act as a write-through cache in order to use the data lake in conjunction with real-time read/write APIs. They need to store the real-time input data locally and act as a source to the data lake. Following this pattern ensures that the asset also eventually reaches the data warehouse for reporting.

Pumping data out of the lake doesn’t actually pull it out of the lake. It makes a copy that ends up in the operational data store. Be smart about how you design your application requirements. The purpose of this solution is to keep data “hot” so that the APIs that serve it can respond in real-time. For example, avoid keeping hot data for an extended period of time if it’s not necessary. Keep only the attributes that you care about. The general rule is “be smart” about it.