Row pattern recognition в оконных структурах#

Примечание

Ниже приведена оригинальная документация Trino. Скоро мы ее переведем на русский язык и дополним полезными примерами.

A window structure can be defined in the WINDOW clause or in the OVER clause of a window operation. In both cases, the window specification can include row pattern recognition clauses. They are part of the window frame. The syntax and semantics of row pattern recognition in window are similar to those of the MATCH_RECOGNIZE clause.

This section explains the details of row pattern recognition in window structures, and highlights the similarities and the differences between both pattern recognition mechanisms.

Window with row pattern recognition#

Window specification:

(
[ existing_window_name ]
[ PARTITION BY column [, ...] ]
[ ORDER BY column [, ...] ]
[ window_frame ]
)

Window frame:

[ MEASURES measure_definition [, ...] ]
frame_extent
[ AFTER MATCH skip_to ]
[ INITIAL | SEEK ]
[ PATTERN ( row_pattern ) ]
[ SUBSET subset_definition [, ...] ]
[ DEFINE variable_definition [, ...] ]

Generally, a window frame specifies the frame_extent, which defines the «sliding window» of rows to be processed by a window function. It can be defined in terms of ROWS, RANGE or GROUPS.

A window frame with row pattern recognition involves many other syntactical components, mandatory or optional, and enforces certain limitations on the frame_extent.

Window frame with row pattern recognition:

[ MEASURES measure_definition [, ...] ]
ROWS BETWEEN CURRENT ROW AND frame_end
[ AFTER MATCH skip_to ]
[ INITIAL | SEEK ]
PATTERN ( row_pattern )
[ SUBSET subset_definition [, ...] ]
DEFINE variable_definition [, ...]

Description of the pattern recognition clauses#

The frame_extent with row pattern recognition must be defined in terms of ROWS. The frame start must be at the CURRENT ROW, which limits the allowed frame extent values to the following:

ROWS BETWEEN CURRENT ROW AND CURRENT ROW

ROWS BETWEEN CURRENT ROW AND <expression> FOLLOWING

ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING

For every input row processed by the window, the portion of rows enclosed by the frame_extent limits the search area for row pattern recognition. Unlike in MATCH_RECOGNIZE, where the pattern search can explore all rows until the partition end, and all rows of the partition are available for computations, in window structures the pattern matching can neither match rows nor retrieve input values outside the frame.

Besides the frame_extent, pattern matching requires the PATTERN and DEFINE clauses.

The PATTERN clause specifies a row pattern, which is a form of a regular expression with some syntactical extensions. The row pattern syntax is similar to the row pattern syntax in MATCH_RECOGNIZE. However, the anchor patterns ^ and $ are not allowed in a window specification.

The DEFINE clause defines the row pattern primary variables in terms of boolean conditions that must be satisfied. It is similar to the DEFINE clause of MATCH_RECOGNIZE. The only difference is that the window syntax does not support the MATCH_NUMBER function.

The MEASURES clause is syntactically similar to the MEASURES clause of MATCH_RECOGNIZE. The only limitation is that the MATCH_NUMBER function is not allowed. However, the semantics of this clause differs between MATCH_RECOGNIZE and window. While in MATCH_RECOGNIZE every measure produces an output column, the measures in window should be considered as definitions associated with the window structure. They can be called over the window, in the same manner as regular window functions:

SELECT cust_key, value OVER w, label OVER w
    FROM orders
    WINDOW w AS (
                 PARTITION BY cust_key
                 ORDER BY order_date
                 MEASURES
                        RUNNING LAST(total_price) AS value,
                        CLASSIFIER() AS label
                 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
                 PATTERN (A B+ C+)
                 DEFINE
                        B AS B.value < PREV (B.value),
                        C AS C.value > PREV (C.value)
                )

Measures defined in a window can be referenced in the SELECT clause and in the ORDER BY clause of the enclosing query.

The RUNNING and FINAL keywords are allowed in the MEASURES clause. They can precede a logical navigation function FIRST or LAST, or an aggregate function. However, they have no effect. Every computation is performed from the position of the final row of the match, so the semantics is effectively FINAL.

The AFTER MATCH SKIP clause has the same syntax as the AFTER MATCH SKIP clause of MATCH_RECOGNIZE.

The INITIAL or SEEK modifier is specific to row pattern recognition in window. With INITIAL, which is the default, the pattern match for an input row can only be found starting from that row. With SEEK, if there is no match starting from the current row, the engine tries to find a match starting from subsequent rows within the frame. As a result, it is possible to associate an input row with a match which is detached from that row.

The SUBSET clause is used to define union variables as sets of primary pattern variables. You can use union variables to refer to a set of rows matched to any primary pattern variable from the subset:

SUBSET U = (A, B)

The following expression returns the total_price value from the last row matched to either A or B:

LAST(U.total_price)

If you want to refer to all rows of the match, there is no need to define a SUBSET containing all pattern variables. There is an implicit universal pattern variable applied to any non prefixed column name and any CLASSIFIER call without an argument. The following expression returns the total_price value from the last matched row:

LAST(total_price)

The following call returns the primary pattern variable of the first matched row:

FIRST(CLASSIFIER())

In window, unlike in MATCH_RECOGNIZE, you cannot specify ONE ROW PER MATCH or ALL ROWS PER MATCH. This is because all calls over window, whether they are regular window functions or measures, must comply with the window semantics. A call over window is supposed to produce exactly one output row for every input row. And so, the output mode of pattern recognition in window is a combination of ONE ROW PER MATCH and WITH UNMATCHED ROWS.

Processing input with row pattern recognition#

Pattern recognition in window processes input rows in two different cases:

  • upon a row pattern measure call over the window:

    some_measure OVER w
    
  • upon a window function call over the window:

    sum(total_price) OVER w
    

The output row produced for each input row, consists of:

  • all values from the input row

  • the value of the called measure or window function, computed with respect to the pattern match associated with the row

Processing the input can be described as the following sequence of steps:

  • Partition the input data accordingly to PARTITION BY

  • Order each partition by the ORDER BY expressions

  • For every row of the ordered partition:
    If the row is „skipped“ by a match of some previous row:
    • For a measure, produce a one-row output as for an unmatched row

    • For a window function, evaluate the function over an empty frame and produce a one-row output

    Otherwise:
    • Determine the frame extent

    • Try match the row pattern starting from the current row within the frame extent

    • If no match is found, and SEEK is specified, try to find a match starting from subsequent rows within the frame extent

    If no match is found:
    • For a measure, produce a one-row output for an unmatched row

    • For a window function, evaluate the function over an empty frame and produce a one-row output

    Otherwise:
    • For a measure, produce a one-row output for the match

    • For a window function, evaluate the function over a frame limited to the matched rows sequence and produce a one-row output

    • Evaluate the AFTER MATCH SKIP clause, and mark the „skipped“ rows

Empty matches and unmatched rows#

If no match can be associated with a particular input row, the row is unmatched. This happens when no match can be found for the row. This also happens when no match is attempted for the row, because it is skipped by the AFTER MATCH SKIP clause of some preceding row. For an unmatched row, every row pattern measure is null. Every window function is evaluated over an empty frame.

An empty match is a successful match which does not involve any pattern variables. In other words, an empty match does not contain any rows. If an empty match is associated with an input row, every row pattern measure for that row is evaluated over an empty sequence of rows. All navigation operations and the CLASSIFIER function return null. Every window function is evaluated over an empty frame.

In most cases, the results for empty matches and unmatched rows are the same. A constant measure can be helpful to distinguish between them:

The following call returns 'matched' for every matched row, including empty matches, and null for every unmatched row:

matched OVER (
              ...
              MEASURES 'matched' AS matched
              ...
             )