The Koalja platform is a pipeline construction platform designed to make data processing pipelines simple and easy. Ultimately, users will not have to know about Kubernetes or containers -- they will simply drop/POST code and data into endpoints and necessary outcomes will follow. Scaling of processes can be automated. The provenance of data artifacts is tracked and turned into metadata, using smart tasks and smart links, managed by simple policy. The goal is to make the experience as simple as possible so as to focus on what matters to data scientist or business analyst.
export DOCKERNAMESPACE=<your-docker-hub-account-name>
make build-image docker# Install heptio contour
kubectl apply -f https://j.hept.io/contour-deployment-rbacOnce Contour is deployed, lookup the loadbalancer Service in the
heptio-contour namespace. Make sure to configure DNS appropriately for
all domains (see Domain management) such that
these domains are mapped to the Contour loadbalancer.
# Install the CRD
make install
# Deploy operator
make deployFor every pipeline, a hostname is created and served on the Contour ingress load-balancer.
The hostname is <pipeline-name>.<namespace>.<domain-suffix>.
The domain-suffix can be configured using a ConfigMap in the namespace of the pipeline
(or globally in the koalja-system namespace).
Using the ConfigMap you can specify one or more domain-suffixes. The suffix that is
used is selected using a label selector for every suffix. The suffix with the most specific
label selector that matches the labels of the Pipeline will be used.
The ConfigMap must be named koalja-domain-config and requires the following data format.
apiVersion: v1
kind: ConfigMap
metadata:
name: koalja-domain-config
data:
# These are example settings of domain.
# example.org will be used for pipelines having app=prod.
example.org: |
selector:
app: prod
# Default value for domain, for pipelines that does not have app=prod labels.
# Although it will match all pipelines, it is the least-specific rule so it
# will only be used if no other domain matches.
example.com: ""The Koalja Pipeline resource controller will launch multiple agents and services
for every pipeline resource. The containers for those agents and services are
configured using custom resources (see config/agents)
and a ConfigMap to select which resource will be used.
This way, Koalja can support multiple implementations of services & agents and you can choose which one to use. For example you can choose between (default) in memory implementations of ArangoDB database based implementations.
To select non-default implementations, create a ConfigMap like below and
add it to the namespace in which you're deploying your pipeline resources.
apiVersion: v1
kind: ConfigMap
metadata:
name: koalja-services-config
data:
config: |
annotated-value-registry-name: arangodb-annotatedvalue-registryThe Koalja Operator deploys pipeline components that rely on a FileSystem Storage Service for storing long term data assets. This service is deployed separately from the operator itself.
There are multiple implementations of this service.
- Local: An test only implementation that stores data on the FS of the k8s Nodes.
- S3: An implementation that stores data in a S3 compatible object store.
The selection between these implementation is made using an environment
variable named STORAGETYPE which can be set to either local or s3.
The make deploy target willuses this variable to deploy the appropriate resources.
The S3 storage service must be configured before it can be used. To do so, deploy a yaml file like this:
apiVersion: v1
kind: Secret
metadata:
name: koalja-s3-default-secrets
namespace: koalja-system
data:
access-key: <base64 encoded access key>
secret-key: <base64 encoded secret key>
type: aljabrio/koalja-flex-s3
---
apiVersion: v1
kind: ConfigMap
metadata:
name: koalja-s3-storage-config
namespace: koalja-system
data:
default: |
name: <name of the s3 bucket>
endpoint: <endpoint of the s3 server>
secretName: koalja-s3-default-secretsThe ConfigMap may contain multiple object storages, only a default
object storage is required.
- On GKE:
- Ensure that the nodes support
fusermount(e.g. use ubuntu images) kubectl apply -f config/platforms/gke
- Ensure that the nodes support
cmdPackage containing entry points (mainfunctions) for the binaries of this project.configDirectory containing (customizable) deployment configuration.dockerDirectory containing docker image build files.docsDirectory containing design & specification documents and experiments.examplesExample pipelines.frontendPipeline frontend (react application).hackKubebuilder related template files.pkgPackage root for all koalja libraries & services.third_partyDirectory containing 3rd party code mainly related to GRPC & protocol buffers.vendorExternal go libraries.