malloy logo Malloy Documentation
search

Queries

The basic syntax for a query in Malloy consists of a source and a "pipeline" of one or more stages separated by ->. The shape of the data defined in the original source is transformed by each stage.

query: flights -> { group_by: carrier; aggregate: flight_count is count() }

Sources

The source of a query can be a table, a source, or a named query.

A query against a table

query: table('malloy-data.faa.flights') -> { aggregate: flight_count is count() }
QUERY RESULTS
flight_​count
37,561,525
[
  {
    "flight_count": 37561525
  }
]
SELECT 
   COUNT( 1) as flight_count
FROM `malloy-data.faa.flights` as base
ORDER BY 1 desc

A query against a source

source: flights is table('malloy-data.faa.flights')

query: flights -> { aggregate: flight_count is count() }
QUERY RESULTS
flight_​count
37,561,525
[
  {
    "flight_count": 37561525
  }
]
SELECT 
   COUNT( 1) as flight_count
FROM `malloy-data.faa.flights` as flights
ORDER BY 1 desc

A query starting from another query

The leading -> is used when the source is a query:

query: flights_by_carrier is table('malloy-data.faa.flights') -> {
  group_by: carrier
  aggregate: flight_count is count()
}

query: -> flights_by_carrier -> { project: carrier; limit: 2 }

Implicit Sources When a query is defined as part of a source or nested inside another query stage, the source is implicit.

Defined as part of a source:

source: flights is table('malloy-data.faa.flights'){
  query: flights_by_carrier is {
    group_by: carrier
    aggregate: flight_count is count()
  }
}

Nested inside another query stage:

query: table('malloy-data.faa.flights') -> {
  group_by: dep_year is dep_time.year
  nest: by_carrier is {
    group_by: carrier
    aggregate: flight_count is count()
  }
}

Pipelines

A each stage of a pipeline performs transformation on the the source or a previous stage.

A stage can do one of:

  • a Reduction: a query containing group_by/aggregate which includes aggregation and/or a group_by to reduce the grain of the data being transformed

  • a Projection: select fields without reducing using project.

Example of a Reduction:

query: flights -> {
  where: distance > 1000        // Filtering
  top: 2                        // Limiting
  order_by: flight_count desc   // Ordering
  group_by: carrier             // Reducing
  aggregate: flight_count is count()
}

Example of a Projection:

  query: flights -> {
    project: *
    limit: 20
  }

Note that the operations in a stage are not order-sensitive like SQL; they can be arranged in any order.

A reference to a named query (which defines its own pipeline) can be the first stage in a pipeline.

query: flights -> by_carrier

Multi-Stage Pipelines

This example shows a pipeline with 3 stages, the multiple stages chained using ->. Each stage generates a CTE in the SQL (click "SQL" on the right to see what this looks like.)

query: table('malloy-data.faa.flights') -> {
  project: *
  where: dep_time > @2003
} -> {
  declare: flight_count is count()     -- declare defines a measure or dimension for use within query
  aggregate: flight_count
  nest: main_query is {
    group_by: carrier
    aggregate: flight_count
  }
} -> {
  project:
    main_query.carrier
    main_query.flight_count
    flight_count_as_a_percent_of_total is main_query.flight_count / flight_count * 100.0
}
QUERY RESULTS
carrierflight_​countflight_​count_​as_​a_​percent_​of_​total
WN2,026,37414.204
AA1,371,9949.617
DL1,345,8369.434
UA1,041,6037.301
MQ1,015,1007.115
[
  {
    "carrier": "WN",
    "flight_count": 2026374,
    "flight_count_as_a_percent_of_total": 14.20405056157329
  },
  {
    "carrier": "AA",
    "flight_count": 1371994,
    "flight_count_as_a_percent_of_total": 9.61711517527129
  },
  {
    "carrier": "DL",
    "flight_count": 1345836,
    "flight_count_as_a_percent_of_total": 9.43375832476411
  },
  {
    "carrier": "UA",
    "flight_count": 1041603,
    "flight_count_as_a_percent_of_total": 7.301209785107004
  },
  {
    "carrier": "MQ",
    "flight_count": 1015100,
    "flight_count_as_a_percent_of_total": 7.115434626111984
  }
]
WITH __stage0 AS (
  SELECT 
     base.carrier as carrier,
     base.origin as origin,
     base.destination as destination,
     base.flight_num as flight_num,
     base.flight_time as flight_time,
     base.tail_num as tail_num,
     base.dep_time as dep_time,
     base.arr_time as arr_time,
     base.dep_delay as dep_delay,
     base.arr_delay as arr_delay,
     base.taxi_out as taxi_out,
     base.taxi_in as taxi_in,
     base.distance as distance,
     base.cancelled as cancelled,
     base.diverted as diverted,
     base.id2 as id2
  FROM `malloy-data.faa.flights` as base
  WHERE base.dep_time>=TIMESTAMP('2004-01-01', 'UTC')
)
, __stage1 AS (
  SELECT
    group_set,
    CASE WHEN group_set=0 THEN
      COUNT( 1)
      END as flight_count__0,
    CASE WHEN group_set=1 THEN
      base.carrier
      END as carrier__1,
    CASE WHEN group_set=1 THEN
      COUNT( 1)
      END as flight_count__1
  FROM __stage0 as base
  CROSS JOIN (SELECT row_number() OVER() -1  group_set FROM UNNEST(GENERATE_ARRAY(0,1,1)))
  GROUP BY 1,3
)
, __stage2 AS (
  SELECT
    ANY_VALUE(CASE WHEN group_set=0 THEN flight_count__0 END) as flight_count,
    ARRAY_AGG(CASE WHEN group_set=1 THEN STRUCT(
      carrier__1 as carrier, 
      flight_count__1 as flight_count
      ) END IGNORE NULLS  ORDER BY  flight_count__1 desc ) as main_query
  FROM __stage1
  ORDER BY 1 desc
)
SELECT 
   main_query_0.carrier as carrier,
   main_query_0.flight_count as flight_count,
   main_query_0.flight_count/base.flight_count*100.0 as flight_count_as_a_percent_of_total
FROM __stage2 as base
LEFT JOIN UNNEST(base.main_query) as main_query_0

This can also be broken into multiple named queries. The syntax to refer to a top-level query (not defined inside a source) like this for purposes of pipelining is -> source_query_name. Used in context:

query: recent_flights is flights -> {
  project: *
  where: dep_time > @2003
}

query: -> recent_flights -> {
  aggregate: flight_count
  nest: main_query is {
    group_by: carrier
    aggregate: flight_count
  }
} -> {
  project:
    main_query.carrier
    main_query.flight_count
    flight_count_as_a_percent_of_total is main_query.flight_count / flight_count * 100.0
}

Fields

In a query stage, fields (dimensions, measures, or queries) may be specified either by referencing an existing name or defining them inline.

query: flights -> {
  group_by: carrier
  aggregate: flight_count is count()
}

When referencing existing fields, wildcard expressions *, **, and some_join.* may be used.

See the Fields section for more information about the different kinds of fields and how they can be defined.

Filters

Filters specified at the top level of query stage apply to the whole stage.

At the query level

query: flights { where: distance > 1000 } -> {
  group_by: distance
  aggregate: flight_count
}

or in the stage.

query: flights -> {
  where: distance > 1000
  group_by: distance
  aggregate: flight_count
}

Filters may also be applied to a query's source, an entire source, or to a measure.

See the Filters section for more information.

Ordering and Limiting

Query stages may also include ordering and limiting specifications.

query: flights -> {
  top: 10
  group_by: carrier
  aggregate: flight_count
}

For detailed information on ordering and limiting, see the Ordering and Limiting section.