PT | EN | ES

Main Menu


Powered by <TEI:TOK>
Maarten Janssen, 2014-

CorpusSearch in PSD vs. XPath in PSDX

TEITOK provides the option to search throught PSDX files using XPath queries. For those people that are used to search through PSD files using CorpusSearch (henceforth CS), here are some useful correspondences between CS and XPath.

CS starts with a definition of the desired result node, and then defines a query. In XPath, there is no such separation: the result node and the "query" are defined in a single query, and the query can be empty. Take a look at the following examples.

CorpusSearchXPath
node: X//eTree[@Label="X"]
node: X
query: (A exists)
//eTree[@Label="X" and .//eTree[@Label="A"]]
node: X
query: (a exists)
//eTree[@Label="X" and .//eLeaf[@Text="a"]]

The first is a result node without a query, the second specifies not only the result node, but furthermore specifies additional requirements on the node. Notice that in XPath, it is necessary to specify whether you are looking for a node (eTree) or a terminal node (eLeaf).

To simplify matters, we will shorten the correspondence of (A exists) in this explanation to //A, which hence when used as a restriction is short for .//eTree[@Label="A"]. Using the shorthand (which cannot be used in the actual XPath search!), below is the XPath expression for a number of CorpusSearch queries:

CorpusSearchXPath
(A exists)//A
(A Dominates B)//A[.//B]
(A iDominates B)//A[./B] or //A[B]
(A hasSister B)//A[../B]
(A Precedes B)//B/preceding-sibling::A
(A isRoot)//forest/A or //A[parent::forest]
(A iDomsTotal 3)//A[count(eTree) = 3]
(A domsWords 3)//A[count(.//eLeaf) = 3]

In CS, nodes with the same name (label) are by default interpreted to be the same, unless they get a specific number, so (A Dominates B) and (B iDominates C) means an A node dominating a B node that directly dominates a C node. In XPath, this works differently: further specifications on a node are given directly as specifications on the node, whereas different nodes that happen the have the same label are given independently

CorpusSearchXPath
(A Dominates B) and (B iDominates C)//A[B[C]]
(A Dominates [1]B) and ([2]B iDominates C)//A[B] and //B[C]

To XPath search