PT | EN | ES

Menu principal


Powered by <TEI:TOK>
Maarten Janssen, 2014-

Querying PSDX Trees with XPath

Syntactic trees in TEITOK are stored in the PSDX format, which is the XML version of the Penn Treebank PSD format.

In the PSDX format, trees are represented as an XML hierarchy mimicking the syntactic tree.

PSDX elements and attributes

Elements

Attributes

Description

forest

 

the root node of a syntactic tree

eTree

 

any syntactic or morphosyntactic node (non terminal)

 

Label

syntactic or POS label

 

index

numerical index codifying a syntactic dependency

(matches the index of another element within the same tree)

eLeaf

 

a lexical/empty terminal node

 

Text

the lexical content of an eLeaf

 

Notext

the empty content of an eLeaf (null categories)

 

index

numerical index codifying a syntactic dependency

(matches the index of another element within the same tree)

To query through PSDX files, TEITOK offers an XPath search function. XPath is the most common way to indicate nodes in an XML tree. The idea behind it is comparable to that of the filepath for files on your computer, with slashes separating folders (query language overview). The following table presents the most relevant XPath syntax expressions for querying syntactic trees.

Some relevant XPath syntax

.

the current node

..

the parent of the current node

[ ]

any predicate of a node

@

any attribute

" "

any value

//

dominance

/

immediate dominance

preceding-sibling

precedence

and

conjunction of two search conditions

or

disjunction of two search conditions

An example of a syntactic XPath query is the following:

//eTree/eTree[@Label="NP-SBJ" and ./eLeaf[@Notext="*pro*"]]

In this query, we look for a node that has a child that is of type NP-SBJ (a subject NP), which dominates a terminal node with a @Notext attributte with the value "*pro*". Or, to say it in a different way, a phrasal element with a referential null subject.

In the same manner, we can also look for all IP-SUB nodes (subordinate clauses) that have a sister node of type WNP:

//eTree[@Label="IP-SUB" and ../eTree[@Label="WNP"]]

Apart from going up or down in the tree, it is also possible to do comparisons in XPath on numbers and strings, for instance, we can search all IP-SUB with exactly three trees below it:

//eTree[@Label="IP-SUB" and count(eTree) = 3]

Or we can select nodes with the same @Label as their parent:

//eTree[@Label = ../@Label]

For those that are used to using CorpusSearch, here is a comparison.