MATCH_RECOGNIZE#

Примечание

Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.

Синтаксис#

MATCH_RECOGNIZE (
  [ PARTITION BY column [, ...] ]
  [ ORDER BY column [, ...] ]
  [ MEASURES measure_definition [, ...] ]
  [ rows_per_match ]
  [ AFTER MATCH skip_to ]
  PATTERN ( row_pattern )
  [ SUBSET subset_definition [, ...] ]
  DEFINE variable_definition [, ...]
  )

Описание#

The MATCH_RECOGNIZE clause is an optional subclause of the FROM clause. It is used to detect patterns in a set of rows. Patterns of interest are specified using row pattern syntax based on regular expressions. The input to pattern matching is a table, a view or a subquery. For each detected match, one or more rows are returned. They contain requested information about the match.

Row pattern matching is a powerful tool when analyzing complex sequences of events. The following examples show some of the typical use cases:

  • in trade applications, tracking trends or identifying customers with specific behavioral patterns

  • in shipping applications, tracking packages through all possible valid paths,

  • in financial applications, detecting unusual incidents, which might signal fraud

Example#

In the following example, the pattern describes a V-shape over the totalprice column. A match is found whenever orders made by a customer first decrease in price, and then increase past the starting point:

SELECT * FROM orders MATCH_RECOGNIZE(
     PARTITION BY custkey
     ORDER BY orderdate
     MEASURES
              A.totalprice AS starting_price,
              LAST(B.totalprice) AS bottom_price,
              LAST(U.totalprice) AS top_price
     ONE ROW PER MATCH
     AFTER MATCH SKIP PAST LAST ROW
     PATTERN (A B+ C+ D+)
     SUBSET U = (C, D)
     DEFINE
              B AS totalprice < PREV(totalprice),
              C AS totalprice > PREV(totalprice) AND totalprice <= A.totalprice,
              D AS totalprice > PREV(totalprice)
     )

In the following sections, all subclauses of the MATCH_RECOGNIZE clause are explained with this example query.

Partitioning and ordering#

PARTITION BY custkey

The PARTITION BY clause allows you to break up the input table into separate sections, that are independently processed for pattern matching. Without a partition declaration, the whole input table is used. This behavior is analogous to the semantics of PARTITION BY clause in window specification. In the example, the orders table is partitioned by the custkey value, so that pattern matching is performed for all orders of a specific customer independently from orders of other customers.

ORDER BY orderdate

The optional ORDER BY clause is generally useful to allow matching on an ordered data set. For example, sorting the input by orderdate allows for matching on a trend of changes over time.

Row pattern measures#

The MEASURES clause allows to specify what information is retrieved from a matched sequence of rows.

MEASURES measure_expression AS measure_name [, ...]

A measure expression is a scalar expression whose value is computed based on a match. In the example, three row pattern measures are specified:

A.totalprice AS starting_price returns the price in the first row of the match, which is the only row associated with A according to the pattern.

LAST(B.totalprice) AS bottom_price returns the lowest price (corresponding to the bottom of the «V» in the pattern). It is the price in the last row associated with B, which is the last row of the descending section.

LAST(U.totalprice) AS top_price returns the highest price in the match. It is the price in the last row associated with C or D, which is also the final row of the match.

Measure expressions can refer to the columns of the input table. They also allow special syntax to combine the input information with the details of the match (see Row pattern recognition expressions).

Each measure defines an output column of the pattern recognition. The column can be referenced with the measure_name.

The MEASURES clause is optional. When no measures are specified, certain input columns (depending on ROWS PER MATCH clause) are the output of the pattern recognition.

Rows per match#

This clause can be used to specify the quantity of output rows. There are two main options:

ONE ROW PER MATCH

and

ALL ROWS PER MATCH

ONE ROW PER MATCH is the default option. For every match, a single row of output is produced. Output consists of PARTITION BY columns and measures. The output is also produced for empty matches, based on their starting rows. Rows that are unmatched (that is, neither included in some non-empty match, nor being the starting row of an empty match), are not included in the output.

For ALL ROWS PER MATCH, every row of a match produces an output row, unless it is excluded from the output by the exclusion syntax. Output consists of PARTITION BY columns, ORDER BY columns, measures and remaining columns from the input table. By default, empty matches are shown and unmatched rows are skipped, similarly as with the ONE ROW PER MATCH option. However, this behavior can be changed by modifiers:

ALL ROWS PER MATCH SHOW EMPTY MATCHES

shows empty matches and skips unmatched rows, like the default.

ALL ROWS PER MATCH OMIT EMPTY MATCHES

excludes empty matches from the output.

ALL ROWS PER MATCH WITH UNMATCHED ROWS

shows empty matches and produces additional output row for each unmatched row.

There are special rules for computing row pattern measures for empty matches and unmatched rows. They are explained in Evaluating expressions in empty matches and unmatched rows.

Unmatched rows can only occur when the pattern does not allow an empty match. Otherwise, they are considered as starting rows of empty matches. The option ALL ROWS PER MATCH WITH UNMATCHED ROWS is recommended when pattern recognition is expected to pass all input rows, and it is not certain whether the pattern allows an empty match.

After match skip#

The AFTER MATCH SKIP clause specifies where pattern matching resumes after a non-empty match is found.

The default option is:

AFTER MATCH SKIP PAST LAST ROW

With this option, pattern matching starts from the row after the last row of the match. Overlapping matches are not detected.

With the following option, pattern matching starts from the second row of the match:

AFTER MATCH SKIP TO NEXT ROW

In the example, if a V-shape is detected, further overlapping matches are found, starting from consecutive rows on the descending slope of the «V». Skipping to the next row is the default behavior after detecting an empty match or unmatched row.

The following AFTER MATCH SKIP options allow to resume pattern matching based on the components of the pattern. Pattern matching starts from the last (default) or first row matched to a certain row pattern variable. It can be either a primary pattern variable (they are explained in Row pattern syntax) or a union variable:

AFTER MATCH SKIP TO [ FIRST | LAST ] pattern_variable

It is forbidden to skip to the first row of the current match, because it results in an infinite loop. For example specifying AFTER MATCH SKIP TO A fails, because A is the first element of the pattern, and jumping back to it creates an infinite loop. Similarly, skipping to a pattern variable which is not present in the match causes failure.

All other options than the default AFTER MATCH SKIP PAST LAST ROW allow detection of overlapping matches. The combination of ALL ROWS PER MATCH WITH UNMATCHED ROWS with AFTER MATCH SKIP PAST LAST ROW is the only configuration that guarantees exactly one output row for each input row.

Row pattern syntax#

Row pattern is a form of a regular expression with some syntactical extensions specific to row pattern recognition. It is specified in the PATTERN clause:

PATTERN ( row_pattern )

The basic element of row pattern is a primary pattern variable. Like pattern matching in character strings searches for characters, pattern matching in row sequences searches for rows which can be «labeled» with certain primary pattern variables. A primary pattern variable has a form of an identifier and is defined by a boolean condition. This condition determines whether a particular input row can be mapped to this variable and take part in the match.

In the example PATTERN (A B+ C+ D+), there are four primary pattern variables: A, B, C, and D.

Row pattern syntax includes the following usage:

concatenation#

A B+ C+ D+

It is a sequence of components without operators between them. All components are matched in the same order as they are specified.

alternation#

A | B | C

It is a sequence of components separated by |. Exactly one of the components is matched. In case when multiple components can be matched, the leftmost matching component is chosen.

permutation#

PERMUTE(A, B, C)

It is equivalent to alternation of all permutations of its components. All components are matched in some order. If multiple matches are possible for different orderings of the components, the match is chosen based on the lexicographical order established by the order of components in the PERMUTE list. In the above example, the most preferred option is A B C, and the least preferred option is C B A.

grouping#

(A B C)

partition start anchor#

^

partition end anchor#

$

empty pattern#

()

exclusion syntax#

{- row_pattern -}

Exclusion syntax is used to specify portions of the match to exclude from the output. It is useful in combination with the ALL ROWS PER MATCH option, when only certain sections of the match are interesting.

If you change the example to use ALL ROWS PER MATCH, and the pattern is modified to PATTERN (A {- B+ C+ -} D+), the result consists of the initial matched row and the trailing section of rows.

Specifying pattern exclusions does not affect the computation of expressions in MEASURES and DEFINE clauses. Exclusions also do not affect pattern matching. They have the same semantics as regular grouping with parentheses.

It is forbidden to specify pattern exclusions with the option ALL ROWS PER MATCH WITH UNMATCHED ROWS.

quantifiers#

Pattern quantifiers allow to specify the desired number of repetitions of a sub-pattern in a match. They are appended after the relevant pattern component:

(A | B)*

There are following row pattern quantifiers:

  • zero or more repetitions:

*
  • one or more repetitions:

+
  • zero or one repetition:

?
  • exact number of repetitions, specified by a non-negative integer number:

{n}
  • number of repetitions ranging between bounds, specified by non-negative integer numbers:

{m, n}

Specifying bounds is optional. If the left bound is omitted, it defaults to 0. So, {, 5} can be described as «between zero and five repetitions». If the right bound is omitted, the number of accepted repetitions is unbounded. So, {5, } can be described as «at least five repetitions». Also, {,} is equivalent to *.

Quantifiers are greedy by default. It means that higher number of repetitions is preferred over lower number. This behavior can be changed to reluctant by appending ? immediately after the quantifier. With {3, 5}, 3 repetitions is the least desired option and 5 repetitions – the most desired. With {3, 5}?, 3 repetitions are most desired. Similarly, ? prefers 1 repetition, while ?? prefers 0 repetitions.

Row pattern union variables#

As explained in Row pattern syntax, primary pattern variables are the basic elements of row pattern. In addition to primary pattern variables, you can define union variables. They are introduced in the SUBSET clause:

SUBSET U = (C, D), ...

In the preceding example, union variable U is defined as union of primary variables C and D. Union variables are useful in MEASURES, DEFINE and AFTER MATCH SKIP clauses. They allow you to refer to set of rows matched to either primary variable from a subset.

With the pattern: PATTERN((A | B){5} C+) it cannot be determined upfront if the match contains any A or any B. A union variable can be used to access the last row matched to either A or B. Define SUBSET U = (A, B), and the expression LAST(U.totalprice) returns the value of the totalprice column from the last row mapped to either A or B. Also, AFTER MATCH SKIP TO LAST A or AFTER MATCH SKIP TO LAST B can result in failure if A or B is not present in the match. AFTER MATCH SKIP TO LAST U does not fail.

Row pattern variable definitions#

The DEFINE clause is where row pattern primary variables are defined. Each variable is associated with a boolean condition:

DEFINE B AS totalprice < PREV(totalprice), ...

During pattern matching, when a certain variable is considered for the next step of the match, the boolean condition is evaluated in context of the current match. If the result is true, then the current row, «labeled» with the variable, becomes part of the match.

In the preceding example, assume that the pattern allows to match B at some point. There are some rows already matched to some pattern variables. Now, variable B is being considered for the current row. Before the match is made, the defining condition for B is evaluated. In this example, it is only true if the value of the totalprice column in the current row is lower than totalprice in the preceding row.

The mechanism of matching variables to rows shows the difference between pattern matching in row sequences and regular expression matching in text. In text, characters remain constantly in their positions. In row pattern matching, a row can be mapped to different variables in different matches, depending on the preceding part of the match, and even on the match number.

It is not required that every primary variable has a definition in the DEFINE clause. Variables not mentioned in the DEFINE clause are implicitly associated with true condition, which means that they can be matched to every row.

Boolean expressions in the DEFINE clause allow the same special syntax as expressions in the MEASURES clause. Details are explained in Row pattern recognition expressions.

Row pattern recognition expressions#

Expressions in MEASURES and DEFINE clauses are scalar expressions evaluated over rows of the input table. They support special syntax, specific to pattern recognition context. They can combine input information with the information about the current match. Special syntax allows to access pattern variables assigned to rows, browse rows based on how they are matched, and refer to the sequential number of the match.

pattern variable references#

A.totalprice

U.orderdate

orderstatus

A column name prefixed with a pattern variable refers to values of this column in all rows matched to this variable, or to any variable from the subset in case of union variable. If a column name is not prefixed, it is considered as prefixed with the universal pattern variable, defined as union of all primary pattern variables. In other words, a non-prefixed column name refers to all rows of the current match.

It is forbidden to prefix a column name with a table name in the pattern recognition context.

classifier function#

CLASSIFIER()

CLASSIFIER(A)

CLASSIFIER(U)

The classifier function returns the primary pattern variable associated with the row. The return type is varchar. The optional argument is a pattern variable. It limits the rows of interest, the same way as with prefixed column references. The classifier function is particularly useful with a union variable as the argument. It allows you to determine which variable from the subset actually matched.

match_number function#

MATCH_NUMBER()

The match_number function returns the sequential number of the match within partition, starting from 1. Empty matches are assigned sequential numbers as well as non-empty matches. The return type is bigint.

logical navigation functions#

FIRST(A.totalprice, 2)

In the above example, the first function navigates to the first row matched to pattern variable A, and then searches forward until it finds two more occurrences of variable A within the match. The result is the value of the totalprice column in that row.

LAST(A.totalprice, 2)

In the above example, the last function navigates to the last row matched to pattern variable A, and then searches backwards until it finds two more occurrences of variable A within the match. The result is the value of the totalprice column in that row.

With the first and last functions the result is null, if the searched row is not found in the mach.

The second argument is optional. The default value is 0, which means that by default these functions navigate to the first or last row of interest. If specified, the second argument must be a non-negative integer number.

physical navigation functions#

PREV(A.totalprice, 2)

In the above example, the prev function navigates to the last row matched to pattern variable A, and then searches two rows backward. The result is the value of the totalprice column in that row.

NEXT(A.totalprice, 2)

In the above example, the next function navigates to the last row matched to pattern variable A, and then searches two rows forward. The result is the value of the totalprice column in that row.

With the prev and next functions, it is possible to navigate and retrieve values outside the match. If the navigation goes beyond partition bounds, the result is null.

The second argument is optional. The default value is 1, which means that by default these functions navigate to previous or next row. If specified, the second argument must be a non-negative integer number.

nesting of navigation functions#

It is possible to nest logical navigation functions within physical navigation functions:

PREV(FIRST(A.totalprice, 3), 2)

In case of nesting, first the logical navigation is performed. It establishes the starting row for the physical navigation. When both navigation operations succeed, the value is retrieved from the designated row.

Pattern navigation functions require at least one column reference or classifier function inside of their first argument. The following examples are correct:

LAST("pattern_variable_" || CLASSIFIER())

NEXT(U.totalprice + 10)

This is incorrect:

LAST(1)

It is also required that all column references and all classifier calls inside a pattern navigation function are consistent in referred pattern variables. They must all refer either to the same primary variable, the same union variable, or to the implicit universal pattern variable. The following examples are correct:

LAST(CLASSIFIER() = 'A' OR totalprice > 10) /* universal pattern variable */

LAST(CLASSIFIER(U) = 'A' OR U.totalprice > 10) /* pattern variable U */

This is incorrect:

LAST(A.totalprice + B.totalprice)

Aggregate functions#

It is allowed to use aggregate functions in a row pattern recognition context. Aggregate functions are evaluated over all rows of the current match or over a subset of rows based on the matched pattern variables. The running and final semantics are supported, with running as the default.

The following expression returns the average value of the totalprice column for all rows matched to pattern variable A:

avg(A.totalprice)

The following expression returns the average value of the totalprice column for all rows matched to pattern variables from subset U:

avg(U.totalprice)

The following expression returns the average value of the totalprice column for all rows of the match:

avg(totalprice)

Aggregation arguments#

In case when the aggregate function has multiple arguments, it is required that all arguments refer consistently to the same set of rows:

max_by(totalprice, tax) /* aggregate over all rows of the match */

max_by(CLASSIFIER(A), A.tax) /* aggregate over all rows matched to A */

This is incorrect:

max_by(A.totalprice, tax)

max_by(A.totalprice, A.tax + B.tax)

If an aggregate argument does not contain any column reference or classifier function, it does not refer to any pattern variable. In such a case other aggregate arguments determine the set of rows to aggregate over. If none of the arguments contains a pattern variable reference, the universal row pattern variable is implicit. This means that the aggregate function applies to all rows of the match:

count(1) /* aggregate over all rows of the match */

min_by(1, 2) /* aggregate over all rows of the match */

min_by(1, totalprice) /* aggregate over all rows of the match */

min_by(totalprice, 1) /* aggregate over all rows of the match */

min_by(A.totalprice, 1) /* aggregate over all rows matched to A */

max_by(1, A.totalprice) /* aggregate over all rows matched to A */

Nesting of aggregate functions#

Aggregate function arguments must not contain pattern navigation functions. Similarly, aggregate functions cannot be nested in pattern navigation functions.

Usage of the classifier and match_number functions#

It is allowed to use the classifier and match_number functions in aggregate function arguments. The following expression returns an array containing all matched pattern variables:

array_agg(CLASSIFIER())

This is particularly useful in combination with the option ONE ROW PER MATCH. It allows to get all the components of the match while keeping the output size reduced.

Row pattern count aggregation#

Like other aggregate functions in a row pattern recognition context, the count function can be applied to all rows of the match, or to rows associated with certain row pattern variables:

count(*), count() /* count all rows of the match */

count(totalprice) /* count non-null values of the totalprice column
                     in all rows of the match */

count(A.totalprice) /* count non-null values of the totalprice column
                       in all rows matched to A */

The count function in a row pattern recognition context allows special syntax to support the count(*) behavior over a limited set of rows:

count(A.*) /* count rows matched to A */

count(U.*) /* count rows matched to pattern variables from subset U */

RUNNING and FINAL semantics#

During pattern matching in a sequence of rows, one row after another is examined to determine if it fits the pattern. At any step, a partial match is known, but it is not yet known what rows will be added in the future or what pattern variables they will be mapped to. So, when evaluating a boolean condition in the DEFINE clause for the current row, only the preceding part of the match (plus the current row) is «visible». This is the running semantics.

When evaluating expressions in the MEASURES clause, the match is complete. It is then possible to apply the final semantics. In the final semantics, the whole match is «visible» as from the position of the final row.

In the MEASURES clause, the running semantics can also be applied. When outputting information row by row (as in ALL ROWS PER MATCH), the running semantics evaluate expressions from the positions of consecutive rows.

The running and final semantics are denoted by the keywords: RUNNING and FINAL, preceding a logical navigation function first or last, or an aggregate function:

RUNNING LAST(A.totalprice)

FINAL LAST(A.totalprice)

RUNNING avg(A.totalprice)

FINAL count(A.*)

The running semantics is default in MEASURES and DEFINE clauses. FINAL can only be specified in the MEASURES clause.

With the option ONE ROW PER MATCH, row pattern measures are evaluated from the position of the final row in the match. Therefore, running and final semantics are the same.

Evaluating expressions in empty matches and unmatched rows#

An empty match occurs when the row pattern is successfully matched, but no pattern variables are assigned. The following pattern produces an empty match for every row:

PATTERN(())

When evaluating row pattern measures for an empty match:

  • all column references return null

  • all navigation operations return null

  • classifier function returns null

  • match_number function returns the sequential number of the match

  • all aggregate functions are evaluated over an empty set of rows

Like every match, an empty match has its starting row. All input values which are to be output along with the measures (as explained in Rows per match), are the values from the starting row.

An unmatched row is a row that is neither part of any non-empty match nor the starting row of an empty match. With the option ALL ROWS PER MATCH WITH UNMATCHED ROWS, a single output row is produced. In that row, all row pattern measures are null. All input values which are to be output along with the measures (as explained in Rows per match), are the values from the unmatched row. Using the match_number function as a measure can help differentiate between an empty match and unmatched row.