Quickstart
PREREQUISITES
Before you begin, make sure you have installed:
SETUP
To checkout the Marquez source code, run:
$ git clone git@github.com:MarquezProject/marquez.git && cd marquez
RUNNING WITH DOCKER
The easiest way to get up and running is with Docker. From the base of the Marquez repository, run:
$ docker-compose up
Tip: Use the
--build
flag to build images from source, or--pull
to pull a tagged image.
Marquez listens on port 5000
for all API calls and port 5001
for the admin interface. To verify the HTTP API server is running and listening on localhost
browse to http://localhost:5001.
Note: By default, the HTTP API does not require any form of authentication or authorization.
Example
In this example, we show how you can collect dataset and job metadata using Marquez. We encourage you to familiarize yourself with the data model and APIs of Marquez.
Note: The example shows how to collect metadata via direct HTTP API calls using
curl
. But, you can also get started using our client library for Java or Python.
STEP 1: CREATE A NAMESPACE
Before we can begin collecting metadata, we must first create a namespace. A namespace
helps you organize related dataset and job metadata. Note that datasets and jobs are unique within a namespace, but not across namespaces. For example, the job my-job
may exist in the namespace this-namespace
and other-namespace
, but not both. In this example, we’ll use the namespace my-namespace
:
REQUEST
$ curl -X PUT http://localhost:5000/api/v1/namespaces/my-namespace \
-H 'Content-Type: application/json' \
-d '{
"ownerName": "me",
"description": "My first namespace."
}'
RESPONSE
{
"name": "my-namespace",
"createdAt": "2020-06-30T20:29:53.521534Z",
"updatedAt": "2020-06-30T20:29:53.525528Z",
"ownerName": "me",
"description": "My first namespace."
}
Note: Marquez provides a
default
namespace to collect metadata, but we encourage you to create your own.
STEP 2: CREATE A SOURCE
Each dataset must be associated with a source. A source
is the physical location of a dataset, such as a table in a database, or a file on cloud storage. A source enables the logical grouping and mapping of physical datasets to their physical source. Below, let’s create the source my-source
for the database mydb
:
REQUEST
$ curl -X PUT http://localhost:5000/api/v1/sources/my-source \
-H 'Content-Type: application/json' \
-d '{
"type": "POSTGRESQL",
"connectionUrl": "jdbc:postgresql://localhost:5431/mydb",
"description": "My first source."
}'
RESPONSE
{
"type": "POSTGRESQL",
"name": "my-source",
"createdAt": "2020-06-30T20:30:56.535357Z",
"updatedAt": "2020-06-30T20:30:56.535357Z",
"connectionUrl": "jdbc:postgresql://localhost:5431/mydb",
"description": "My first source."
}
STEP 3: ADD DATASET TO NAMESPACE
Next, we need to create the dataset my-dataset
and associate it with the existing source my-source
. In Marquez, datasets have both a logical and physical name. The logical name is how your dataset is known to Marquez, while the physical name is how your dataset is known to your source. In this example, we refer to my-dataset
as the logical name and public.mytable
(format:schema.table
) as the physical name:
REQUEST
$ curl -X PUT http://localhost:5000/api/v1/namespaces/my-namespace/datasets/my-dataset \
-H 'Content-Type: application/json' \
-d '{
"type": "DB_TABLE",
"physicalName": "public.mytable",
"sourceName": "my-source",
"fields": [
{"name": "a", "type": "INTEGER"},
{"name": "b", "type": "TIMESTAMP"},
{"name": "c", "type": "INTEGER"},
{"name": "d", "type": "INTEGER"}
],
"description": "My first dataset."
}'
RESPONSE
{
"id": {
"namespace": "my-namespace",
"name": "my-dataset"
},
"type": "DB_TABLE",
"name": "my-dataset",
"physicalName": "public.mytable",
"createdAt": "2020-06-30T20:31:39.129483Z",
"updatedAt": "2020-06-30T20:31:39.259853Z",
"namespace": "my-namespace",
"sourceName": "my-source",
"fields": [
{"name": "a", "type": "INTEGER", "tags": [], "description": null},
{"name": "b", "type": "TIMESTAMP", "tags": [], "description": null},
{"name": "c", "type": "INTEGER", "tags": [], "description": null},
{"name": "d", "type": "INTEGER", "tags": [], "description": null}
],
"tags": [],
"lastModifiedAt": null,
"description": "My first dataset."
}
STEP 4: ADD JOB TO NAMESPACE
With my-dataset
in Marquez, we can collect metadata for the job my-job
:
REQUEST
$ curl -X PUT http://localhost:5000/api/v1/namespaces/my-namespace/jobs/my-job \
-H 'Content-Type: application/json' \
-d '{
"type": "BATCH",
"inputs": [{
"namespace": "my-namespace",
"name": "my-dataset"
}],
"outputs": [],
"location": "https://github.com/my-jobs/blob/124f6089ad4c5fcbb1d7b33cbb5d3a9521c5d32c",
"description": "My first job!"
}'
RESPONSE
{
"id": {
"namespace": "my-namespace",
"name": "my-job"
},
"type": "BATCH",
"name": "my-job",
"createdAt": "2020-06-30T20:32:55.570981Z",
"updatedAt": "2020-06-30T20:32:55.658594Z",
"namespace": "my-namespace",
"inputs": [{
"namespace": "my-namespace",
"name": "my-dataset"
}],
"outputs": [],
"location": "https://github.com/my-jobs/blob/124f6089ad4c5fcbb1d7b33cbb5d3a9521c5d32c",
"context": {},
"description": "My first job!",
"latestRun": null
}
STEP 5: CREATE A RUN
Now, let’s create a run for my-job
and capture any runtime arguments:
REQUEST
$ curl -X POST http://localhost:5000/api/v1/namespaces/my-namespace/jobs/my-job/runs \
-H 'Content-Type: application/json' \
-d '{
"args": {
"email": "me@example.com",
"emailOnFailure": false,
"emailOnRetry": true,
"retries": 1
}
}'
RESPONSE
{
"id": "d46e465b-d358-4d32-83d4-df660ff614dd",
"createdAt": "2020-06-30T20:34:40.146354Z",
"updatedAt": "2020-06-30T20:34:40.165768Z",
"nominalStartTime": null,
"nominalEndTime": null,
"state": "NEW",
"startedAt": null,
"endedAt": null,
"durationMs": null,
"args": {
"email": "me@example.com",
"emailOnFailure": "false",
"emailOnRetry": "true",
"retries": "1"
}
}
The call returns a run ID used to track the execution of our job.
Note: In this example, we use the ID
d46e465b-d358-4d32-83d4-df660ff614dd
to update the run metadata formy-job
, but you’ll want to replace the ID with your own.
STEP 6: START A RUN
Use d46e465b-d358-4d32-83d4-df660ff614dd
to start the run for my-job
:
REQUEST
$ curl -X POST http://localhost:5000/api/v1/jobs/runs/d46e465b-d358-4d32-83d4-df660ff614dd/start
RESPONSE
{
"id": "d46e465b-d358-4d32-83d4-df660ff614dd",
"createdAt": "2020-06-30T20:34:40.146354Z",
"updatedAt": "2020-06-30T20:37:43.746677Z",
"nominalStartTime": null,
"nominalEndTime": null,
"state": "RUNNING",
"startedAt": "2020-06-30T20:37:43.746677Z",
"endedAt": null,
"durationMs": null,
"args": {
"email": "me@example.com",
"emailOnFailure": "false",
"emailOnRetry": "true",
"retries": "1"
}
}
STEP 7: COMPLETE A RUN
Use d46e465b-d358-4d32-83d4-df660ff614dd
to complete the run for my-job
:
REQUEST
$ curl -X POST http://localhost:5000/api/v1/jobs/runs/d46e465b-d358-4d32-83d4-df660ff614dd/complete
RESPONSE
{
"id": "d46e465b-d358-4d32-83d4-df660ff614dd",
"createdAt": "2020-06-30T20:34:40.146354Z",
"updatedAt": "2020-06-30T20:38:25.657449Z",
"nominalStartTime": null,
"nominalEndTime": null,
"state": "COMPLETED",
"startedAt": "2020-06-30T20:37:43.746677Z",
"endedAt": "2020-06-30T20:38:25.657449Z",
"durationMs": 41911,
"args": {
"email": "me@example.com",
"emailOnFailure": "false",
"emailOnRetry": "true",
"retries": "1"
}
}
Summary
In this example, we showed you how to use Marquez to collect dataset and job metadata. We also walked you through the set of API calls to successfully mark a run as complete.