MATCH_RECOGNIZE#
Примечание
Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.
Synopsis#
MATCH_RECOGNIZE (
[ PARTITION BY column [, ...] ]
[ ORDER BY column [, ...] ]
[ MEASURES measure_definition [, ...] ]
[ rows_per_match ]
[ AFTER MATCH skip_to ]
PATTERN ( row_pattern )
[ SUBSET subset_definition [, ...] ]
DEFINE variable_definition [, ...]
)
Description#
The MATCH_RECOGNIZE
clause is an optional subclause of the FROM
clause.
It is used to detect patterns in a set of rows. Patterns of interest are
specified using row pattern syntax based on regular expressions. The input to
pattern matching is a table, a view or a subquery. For each detected match, one
or more rows are returned. They contain requested information about the match.
Row pattern matching is a powerful tool when analyzing complex sequences of events. The following examples show some of the typical use cases:
in trade applications, tracking trends or identifying customers with specific behavioral patterns
in shipping applications, tracking packages through all possible valid paths,
in financial applications, detecting unusual incidents, which might signal fraud
Example#
In the following example, the pattern describes a V-shape over the
totalprice
column. A match is found whenever orders made by a customer
first decrease in price, and then increase past the starting point:
SELECT * FROM orders MATCH_RECOGNIZE(
PARTITION BY custkey
ORDER BY orderdate
MEASURES
A.totalprice AS starting_price,
LAST(B.totalprice) AS bottom_price,
LAST(U.totalprice) AS top_price
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B+ C+ D+)
SUBSET U = (C, D)
DEFINE
B AS totalprice < PREV(totalprice),
C AS totalprice > PREV(totalprice) AND totalprice <= A.totalprice,
D AS totalprice > PREV(totalprice)
)
In the following sections, all subclauses of the MATCH_RECOGNIZE
clause are
explained with this example query.
Partitioning and ordering#
PARTITION BY custkey
The PARTITION BY
clause allows you to break up the input table into
separate sections, that are independently processed for pattern matching.
Without a partition declaration, the whole input table is used. This behavior
is analogous to the semantics of PARTITION BY
clause in window specification. In the example, the orders
table is
partitioned by the custkey
value, so that pattern matching is performed for
all orders of a specific customer independently from orders of other
customers.
ORDER BY orderdate
The optional ORDER BY
clause is generally useful to allow matching on an
ordered data set. For example, sorting the input by orderdate
allows for
matching on a trend of changes over time.
Row pattern measures#
The MEASURES
clause allows to specify what information is retrieved from a
matched sequence of rows.
MEASURES measure_expression AS measure_name [, ...]
A measure expression is a scalar expression whose value is computed based on a match. In the example, three row pattern measures are specified:
A.totalprice AS starting_price
returns the price in the first row of the
match, which is the only row associated with A
according to the pattern.
LAST(B.totalprice) AS bottom_price
returns the lowest price (corresponding
to the bottom of the «V» in the pattern). It is the price in the last row
associated with B
, which is the last row of the descending section.
LAST(U.totalprice) AS top_price
returns the highest price in the match. It
is the price in the last row associated with C
or D
, which is also the
final row of the match.
Measure expressions can refer to the columns of the input table. They also allow special syntax to combine the input information with the details of the match (see Row pattern recognition expressions).
Each measure defines an output column of the pattern recognition. The column
can be referenced with the measure_name
.
The MEASURES
clause is optional. When no measures are specified, certain
input columns (depending on ROWS PER MATCH clause) are
the output of the pattern recognition.
Rows per match#
This clause can be used to specify the quantity of output rows. There are two main options:
ONE ROW PER MATCH
and
ALL ROWS PER MATCH
ONE ROW PER MATCH
is the default option. For every match, a single row of
output is produced. Output consists of PARTITION BY
columns and measures.
The output is also produced for empty matches, based on their starting rows.
Rows that are unmatched (that is, neither included in some non-empty match, nor
being the starting row of an empty match), are not included in the output.
For ALL ROWS PER MATCH
, every row of a match produces an output row, unless
it is excluded from the output by the exclusion syntax. Output consists
of PARTITION BY
columns, ORDER BY
columns, measures and remaining
columns from the input table. By default, empty matches are shown and unmatched
rows are skipped, similarly as with the ONE ROW PER MATCH
option. However,
this behavior can be changed by modifiers:
ALL ROWS PER MATCH SHOW EMPTY MATCHES
shows empty matches and skips unmatched rows, like the default.
ALL ROWS PER MATCH OMIT EMPTY MATCHES
excludes empty matches from the output.
ALL ROWS PER MATCH WITH UNMATCHED ROWS
shows empty matches and produces additional output row for each unmatched row.
There are special rules for computing row pattern measures for empty matches and unmatched rows. They are explained in Evaluating expressions in empty matches and unmatched rows.
Unmatched rows can only occur when the pattern does not allow an empty match.
Otherwise, they are considered as starting rows of empty matches. The option
ALL ROWS PER MATCH WITH UNMATCHED ROWS
is recommended when pattern
recognition is expected to pass all input rows, and it is not certain whether
the pattern allows an empty match.
After match skip#
The AFTER MATCH SKIP
clause specifies where pattern matching resumes after
a non-empty match is found.
The default option is:
AFTER MATCH SKIP PAST LAST ROW
With this option, pattern matching starts from the row after the last row of the match. Overlapping matches are not detected.
With the following option, pattern matching starts from the second row of the match:
AFTER MATCH SKIP TO NEXT ROW
In the example, if a V-shape is detected, further overlapping matches are found, starting from consecutive rows on the descending slope of the «V». Skipping to the next row is the default behavior after detecting an empty match or unmatched row.
The following AFTER MATCH SKIP
options allow to resume pattern matching
based on the components of the pattern. Pattern matching starts from the last
(default) or first row matched to a certain row pattern variable. It can be
either a primary pattern variable (they are explained in
Row pattern syntax) or a
union variable:
AFTER MATCH SKIP TO [ FIRST | LAST ] pattern_variable
It is forbidden to skip to the first row of the current match, because it
results in an infinite loop. For example specifying AFTER MATCH SKIP TO A
fails, because A
is the first element of the pattern, and jumping back to
it creates an infinite loop. Similarly, skipping to a pattern variable which is
not present in the match causes failure.
All other options than the default AFTER MATCH SKIP PAST LAST ROW
allow
detection of overlapping matches. The combination of ALL ROWS PER MATCH WITH UNMATCHED ROWS
with AFTER MATCH SKIP PAST LAST ROW
is the only
configuration that guarantees exactly one output row for each input row.
Row pattern syntax#
Row pattern is a form of a regular expression with some syntactical extensions
specific to row pattern recognition. It is specified in the PATTERN
clause:
PATTERN ( row_pattern )
The basic element of row pattern is a primary pattern variable. Like pattern matching in character strings searches for characters, pattern matching in row sequences searches for rows which can be «labeled» with certain primary pattern variables. A primary pattern variable has a form of an identifier and is defined by a boolean condition. This condition determines whether a particular input row can be mapped to this variable and take part in the match.
In the example PATTERN (A B+ C+ D+)
, there are four primary pattern
variables: A
, B
, C
, and D
.
Row pattern syntax includes the following usage:
concatenation#
A B+ C+ D+
It is a sequence of components without operators between them. All components are matched in the same order as they are specified.
alternation#
A | B | C
It is a sequence of components separated by |
. Exactly one of the
components is matched. In case when multiple components can be matched, the
leftmost matching component is chosen.
permutation#
PERMUTE(A, B, C)
It is equivalent to alternation of all permutations of its components. All
components are matched in some order. If multiple matches are possible for
different orderings of the components, the match is chosen based on the
lexicographical order established by the order of components in the PERMUTE
list. In the above example, the most preferred option is A B C
, and the
least preferred option is C B A
.
grouping#
(A B C)
partition start anchor#
^
partition end anchor#
$
empty pattern#
()
exclusion syntax#
{- row_pattern -}
Exclusion syntax is used to specify portions of the match to exclude from the
output. It is useful in combination with the ALL ROWS PER MATCH
option,
when only certain sections of the match are interesting.
If you change the example to use ALL ROWS PER MATCH
, and the pattern is
modified to PATTERN (A {- B+ C+ -} D+)
, the result consists of the initial
matched row and the trailing section of rows.
Specifying pattern exclusions does not affect the computation of expressions in
MEASURES
and DEFINE
clauses. Exclusions also do not affect pattern
matching. They have the same semantics as regular grouping with parentheses.
It is forbidden to specify pattern exclusions with the option ALL ROWS PER MATCH WITH UNMATCHED ROWS
.
quantifiers#
Pattern quantifiers allow to specify the desired number of repetitions of a sub-pattern in a match. They are appended after the relevant pattern component:
(A | B)*
There are following row pattern quantifiers:
zero or more repetitions:
*
one or more repetitions:
+
zero or one repetition:
?
exact number of repetitions, specified by a non-negative integer number:
{n}
number of repetitions ranging between bounds, specified by non-negative integer numbers:
{m, n}
Specifying bounds is optional. If the left bound is omitted, it defaults to
0
. So, {, 5}
can be described as «between zero and five repetitions».
If the right bound is omitted, the number of accepted repetitions is unbounded.
So, {5, }
can be described as «at least five repetitions». Also, {,}
is
equivalent to *
.
Quantifiers are greedy by default. It means that higher number of repetitions
is preferred over lower number. This behavior can be changed to reluctant by
appending ?
immediately after the quantifier. With {3, 5}
, 3
repetitions is the least desired option and 5 repetitions – the most desired.
With {3, 5}?
, 3 repetitions are most desired. Similarly, ?
prefers 1
repetition, while ??
prefers 0 repetitions.
Row pattern union variables#
As explained in Row pattern syntax, primary pattern variables are the
basic elements of row pattern. In addition to primary pattern variables, you
can define union variables. They are introduced in the SUBSET
clause:
SUBSET U = (C, D), ...
In the preceding example, union variable U
is defined as union of primary
variables C
and D
. Union variables are useful in MEASURES
,
DEFINE
and AFTER MATCH SKIP
clauses. They allow you to refer to set of
rows matched to either primary variable from a subset.
With the pattern: PATTERN((A | B){5} C+)
it cannot be determined upfront if
the match contains any A
or any B
. A union variable can be used to
access the last row matched to either A
or B
. Define SUBSET U = (A, B)
, and the expression LAST(U.totalprice)
returns the value of the
totalprice
column from the last row mapped to either A
or B
. Also,
AFTER MATCH SKIP TO LAST A
or AFTER MATCH SKIP TO LAST B
can result in
failure if A
or B
is not present in the match. AFTER MATCH SKIP TO LAST U
does not fail.
Row pattern variable definitions#
The DEFINE
clause is where row pattern primary variables are defined. Each
variable is associated with a boolean condition:
DEFINE B AS totalprice < PREV(totalprice), ...
During pattern matching, when a certain variable is considered for the next
step of the match, the boolean condition is evaluated in context of the current
match. If the result is true
, then the current row, «labeled» with the
variable, becomes part of the match.
In the preceding example, assume that the pattern allows to match B
at some
point. There are some rows already matched to some pattern variables. Now,
variable B
is being considered for the current row. Before the match is
made, the defining condition for B
is evaluated. In this example, it is
only true if the value of the totalprice
column in the current row is lower
than totalprice
in the preceding row.
The mechanism of matching variables to rows shows the difference between pattern matching in row sequences and regular expression matching in text. In text, characters remain constantly in their positions. In row pattern matching, a row can be mapped to different variables in different matches, depending on the preceding part of the match, and even on the match number.
It is not required that every primary variable has a definition in the
DEFINE
clause. Variables not mentioned in the DEFINE
clause are
implicitly associated with true
condition, which means that they can be
matched to every row.
Boolean expressions in the DEFINE
clause allow the same special syntax as
expressions in the MEASURES
clause. Details are explained in
Row pattern recognition expressions.
Row pattern recognition expressions#
Expressions in MEASURES and DEFINE clauses are scalar expressions evaluated over rows of the input table. They support special syntax, specific to pattern recognition context. They can combine input information with the information about the current match. Special syntax allows to access pattern variables assigned to rows, browse rows based on how they are matched, and refer to the sequential number of the match.
pattern variable references#
A.totalprice
U.orderdate
orderstatus
A column name prefixed with a pattern variable refers to values of this column
in all rows matched to this variable, or to any variable from the subset in
case of union variable. If a column name is not prefixed, it is considered as
prefixed with the universal pattern variable
, defined as union of all
primary pattern variables. In other words, a non-prefixed column name refers to
all rows of the current match.
It is forbidden to prefix a column name with a table name in the pattern recognition context.
classifier function#
CLASSIFIER()
CLASSIFIER(A)
CLASSIFIER(U)
The classifier
function returns the primary pattern variable associated
with the row. The return type is varchar
. The optional argument is a
pattern variable. It limits the rows of interest, the same way as with prefixed
column references. The classifier
function is particularly useful with a
union variable as the argument. It allows you to determine which variable from
the subset actually matched.
match_number function#
MATCH_NUMBER()
The match_number
function returns the sequential number of the match within
partition, starting from 1
. Empty matches are assigned sequential numbers
as well as non-empty matches. The return type is bigint
.
Aggregate functions#
It is allowed to use aggregate functions in a row pattern recognition context.
Aggregate functions are evaluated over all rows of the current match or over a
subset of rows based on the matched pattern variables. The
running and final semantics are supported, with
running
as the default.
The following expression returns the average value of the totalprice
column
for all rows matched to pattern variable A
:
avg(A.totalprice)
The following expression returns the average value of the totalprice
column
for all rows matched to pattern variables from subset U
:
avg(U.totalprice)
The following expression returns the average value of the totalprice
column
for all rows of the match:
avg(totalprice)
Aggregation arguments#
In case when the aggregate function has multiple arguments, it is required that all arguments refer consistently to the same set of rows:
max_by(totalprice, tax) /* aggregate over all rows of the match */
max_by(CLASSIFIER(A), A.tax) /* aggregate over all rows matched to A */
This is incorrect:
max_by(A.totalprice, tax)
max_by(A.totalprice, A.tax + B.tax)
If an aggregate argument does not contain any column reference or
classifier
function, it does not refer to any pattern variable. In such a
case other aggregate arguments determine the set of rows to aggregate over. If
none of the arguments contains a pattern variable reference, the universal row
pattern variable is implicit. This means that the aggregate function applies to
all rows of the match:
count(1) /* aggregate over all rows of the match */
min_by(1, 2) /* aggregate over all rows of the match */
min_by(1, totalprice) /* aggregate over all rows of the match */
min_by(totalprice, 1) /* aggregate over all rows of the match */
min_by(A.totalprice, 1) /* aggregate over all rows matched to A */
max_by(1, A.totalprice) /* aggregate over all rows matched to A */
Nesting of aggregate functions#
Aggregate function arguments must not contain pattern navigation functions. Similarly, aggregate functions cannot be nested in pattern navigation functions.
Usage of the classifier
and match_number
functions#
It is allowed to use the classifier
and match_number
functions in
aggregate function arguments. The following expression returns an array
containing all matched pattern variables:
array_agg(CLASSIFIER())
This is particularly useful in combination with the option
ONE ROW PER MATCH
. It allows to get all the components of the match while
keeping the output size reduced.
Row pattern count aggregation#
Like other aggregate functions in a row pattern recognition context, the
count
function can be applied to all rows of the match, or to rows
associated with certain row pattern variables:
count(*), count() /* count all rows of the match */
count(totalprice) /* count non-null values of the totalprice column
in all rows of the match */
count(A.totalprice) /* count non-null values of the totalprice column
in all rows matched to A */
The count
function in a row pattern recognition context allows special syntax
to support the count(*)
behavior over a limited set of rows:
count(A.*) /* count rows matched to A */
count(U.*) /* count rows matched to pattern variables from subset U */
RUNNING
and FINAL
semantics#
During pattern matching in a sequence of rows, one row after another is
examined to determine if it fits the pattern. At any step, a partial match is
known, but it is not yet known what rows will be added in the future or what
pattern variables they will be mapped to. So, when evaluating a boolean
condition in the DEFINE
clause for the current row, only the preceding part
of the match (plus the current row) is «visible». This is the running
semantics.
When evaluating expressions in the MEASURES
clause, the match is complete.
It is then possible to apply the final
semantics. In the final
semantics, the whole match is «visible» as from the position of the final row.
In the MEASURES
clause, the running
semantics can also be applied. When
outputting information row by row (as in ALL ROWS PER MATCH
), the
running
semantics evaluate expressions from the positions of consecutive
rows.
The running
and final
semantics are denoted by the keywords:
RUNNING
and FINAL
, preceding a logical navigation function first
or
last
, or an aggregate function:
RUNNING LAST(A.totalprice)
FINAL LAST(A.totalprice)
RUNNING avg(A.totalprice)
FINAL count(A.*)
The running
semantics is default in MEASURES
and DEFINE
clauses.
FINAL
can only be specified in the MEASURES
clause.
With the option ONE ROW PER MATCH
, row pattern measures are evaluated from
the position of the final row in the match. Therefore, running
and
final
semantics are the same.
Evaluating expressions in empty matches and unmatched rows#
An empty match occurs when the row pattern is successfully matched, but no pattern variables are assigned. The following pattern produces an empty match for every row:
PATTERN(())
When evaluating row pattern measures for an empty match:
all column references return
null
all navigation operations return
null
classifier
function returnsnull
match_number
function returns the sequential number of the matchall aggregate functions are evaluated over an empty set of rows
Like every match, an empty match has its starting row. All input values which are to be output along with the measures (as explained in Rows per match), are the values from the starting row.
An unmatched row is a row that is neither part of any non-empty match nor the
starting row of an empty match. With the option ALL ROWS PER MATCH WITH UNMATCHED ROWS
, a single output row is produced. In that row, all row pattern
measures are null
. All input values which are to be output along with the
measures (as explained in Rows per match), are the values from the
unmatched row. Using the match_number
function as a measure can help
differentiate between an empty match and unmatched row.