If the input space is finite and known in advance, such as predicting customer churn based on fixed historical data. The historical data feed is updated each week. Here, batch processing is the way to go.
The beauty of knowing the input space is that you can compute predictions in advance and store them in a database, and when the user requests those predictions, they appear in (near) real time. Little did they know that the predictions were prepared in advance.
Crucially, the inputs aren't constantly changing, and they're not user configurable at runtime.
This is the lasagne served in the bain-marie at your local restaurant. You order it, and the server immediately bags it for you. I'm hungry.
from typing import Dict
import numpy as np
import ray
# Step 1: Create a Ray Dataset from S3
ds = ray.data.read_images("s3://anonymous@ray-example-data/datasets/simple")
# Step 2: Define a Predictor class for inference.
# Use a class to initialize the model just once in `__init__`
# and re-use it for inference across multiple batches.
class HuggingFacePredictor:
def __init__(self):
from transformers import pipeline
# Initialize a pre-trained GPT2 Huggingface pipeline.
self.model = pipeline("text-generation", model="gpt2")
# Logic for inference on 1 batch of data.
def __call__(self, batch: Dict[str, np.ndarray]) -> Dict[str, list]:
# Get the predictions from the input batch.
predictions = self.model(list(batch["data"]), max_length=20, num_return_sequences=1)
# `predictions` is a list of length-one lists. For example:
# [[{'generated_text': 'output_1'}], ..., [{'generated_text': 'output_2'}]]
# Modify the output to get it into the following format instead:
# ['output_1', 'output_2']
batch["output"] = [sequences[0]["generated_text"] for sequences in predictions]
return batch
# Step 2: Map the Predictor over the Dataset to get predictions.
# Use 2 parallel actors for inference. Each actor predicts on a
# different partition of data.
predictions = ds.map_batches(HuggingFacePredictor, concurrency=2)
# Step 3: Show one prediction output.
predictions.show(limit=1)
If you know some of the input space but not all of it, you will likely benefit from a mixture of batch processing and stream processing.
At runtime, your processor queries your feature store for some features while calculating the online features. Batch will alleviate computation during runtime, leaving the rest of the work to the online or stream processor.
This is the chef who is ready with roasted veggies and salads but has to cook the meat fresh upon order.
If you can't determine any input space beforehand - for example, in real-time document extraction, where documents are unpredictable and continuously evolving - you're going all in on stream processing. You're flat-out optimizing your pipeline to stream as fast as possible and return those predictions to the consuming service or user.
This is the chef with nothing prepared except access to produce and expertise to cook a beautiful meal - think MasterChef!
Many modern machine learning systems combine batch and streaming. Perhaps once we figure out quantum computing, we'll only need streaming. Until then, mixing batch and streaming is the way to go.