IntroductionXARA is a rule-based PropBank labeler for Alpino XML files written in Java. In addition to automatic role annotation, XARA is able to extract training instances (sets of features) from XML based treebank. Such a instances can be used to train machine learing algorithms for automatic semantic role labeling. In our research used the TiMBL machine learning tool for this purpose.
XARA was written with the Alpino XML dependency treebank format and the PropBank annotation scheme in mind. However, with some adjustments, XARA can be used with any XML-based treebank.
At the moment, no extensive documentation is available. Moreover, the current code is specialized for our specific needs. But for everyone who might find XARA useful, a downloadable version is provided on this page. Below, some general information about the workings of XARA is provided.
How does the automatic tagging work?
The input for the tagger is set of directories containing (Alpino) XML files, called a treebank. Each sentence is annotated separately by applying a set of rules. Rules are applied to local dependency domains (subtrees of the complete dependency tree). The local dependency domain to which a rule is applied, is called the rule's context. A context is simply defined by an XPath expression which selects a group of nodes.
Rules are implemented by the Rule class. They consist of an XPath expression which specifies a relative path from the context's root node to the target node and an output label. Upon application of the rule, the target node will be labeled with output label. The output label can have three kinds of values:
- A positive number n, to label a node with Argn.
- The value -1, to label the node with the first available numbered argument.
- A string value, to label the node with an arbitrary label, for example an ArgM.
Formally, a rule in XARA can be defined as a (path, label) pair. Suppose for example that
we want to select direct object nodes in the previously defined context and assign them the
label arg1. This can be written as:
The first element of this pair is an XPath expression that selects direct object daughters in an Alpnio XMIL file, the second element is a number that specifies which label we want to assign to these target nodes. In this case the label is a positive integer 1, which means the target node will receive the label Arg1.
The combination of a context and a set of rules is called a tagger in XARA and is implemented by the Tagger class. Each Tagger instance tags different parts of the XML tree. As an example, the listing below shows the defintion of a tagger for passive participles in Dutch alpino files:
t = new Tagger(); t.setContext(doc, "//node[@cat='ppart']" + "[preceding-sibling::node[@rel='hd'][@root='ben' or @root='heb']]"); // rules for numbered arguments t.addRule(new Rule("./node[@rel='hd']", "PRED")); t.addRule(new Rule("./node[@rel='su']", 0)); t.addRule(new Rule("./node[@rel='obj1']", 1)); t.addRule(new Rule("./node[@rel='obj2']", 2)); // rules for complements t.addRule(new Rule("./node[@rel='vc' and (@cat='cp' or @cat='ti' or @cat='oti' or @cat='ah1')]", -1)); t.addRule(new Rule("./node[@rel='vc']", -1)); t.addRule(new Rule("./node[@rel='pc' and @cat='pp']", -1)); // rules for modifiers/adjectives t.addRule(new Rule("./node[(@word='niet' or @word='nooit' or @word='geen' or @word='nergens')" + " and (@pos='adv')]", "ArgM-NEG")); t.addRule(new Rule("./node[@rel='mod' and @cat='oti']", "ArgM-PNC")); t.addRule(new Rule("./node[@rel='predm']", "ArgM-PRD")); t.addRule(new Rule("./node[@rel='ld' ]", "ArgM-LOC")); // apply these rules t.tag();
After a new tagger instance (t) has been created, first the context for the rule is defined. The setContext method takes two parameters: an XML document and an XPath expression. The context expression in the example above selects ppart nodes that are preceding siblings of the auxiliaries worden (root=word) and zijn (root=ben). The context definition is followed by one ore more rules specified in the addRule method. addRule takes two parameters: a rule instance and an output label specification. A rule is created by providing an XPath expression and a target label. The third rule for example, selects direct objects and labels them as arg1. The last line of the listing contains a call to the method that starts the actual tagging: tag(). This method starts the tagging process on the XML document specified in the context definition. That is, PropBank labels are added to the original XML document.
How does the feature extraction work?
The learning tool I used in my project was TiMBL (Tilburg Memory Based Learner). To be able to train a TiMBL classifier, a file with training data is needed. Training data is represented as a text file containing instances. Each line in the text file represents a single instance. An instance consists of a set of features seperated by commas and a target class. XARA is able to create such an instance base from a set of XML files. At the moment, detail on features to be extracted are hard-coded in the FeatureExtractor class. This means that modifying the current set of features requires some programming skills.
How to use XARA
XARA is a command line utility (it does not have an user interface). In addition to feature extraction and semantic role tagging, it can be used for various corpus related tasks.
The various usages of Alpino can be summarized as follows:
java Xara [Options]
In order for the above command to work, you must call Xara within the directory that contains Xara's .class files (Xara/classes). Alternatively, you can tell the Java interpreter by means of the -cp option where the class files can be found, for example:
java -cp <path_to_class_files> Xara [Options]
The following options are currently available:
|java Xara -f -i input_dir -o output_file||Extract features from XML files in <input_dir>. <input_dir> can also be a file name, in that case only features from that file are extracted.|
|java Xara -t -i input_dir||Use internally defined rules to tag all XML files in the input directory with semantic roles.|
|java Xara -q xpath_query -i input_dir||List all XML files in <input_dir> that satisfy xpath_query|
|java Xara -h||Show options|
|java Xara -p -i input_dir||Print sentences in input directory on screen (works only with Alpino XML files|
|java Xara -f ... -x||By specifying -x option, two extra features are extracted that are not used for the training,|
|java Xara -f ... -d||By default, duplicate files are automatically detected and ignored, with this option set duplicates will not be ignored.|
|java Xara -f0 ...||Only one feature will be extracted: the target class (PropBank label). This speeds up the extraction process and can be useful when you're only interested in the target class.|
Beware of shell quoting! Make sure you keep the shell (your command interpreter) from interpreting any special characters in XPaths queries. Use any of the following schemes:
Xara -q '//node[@cat="pp"]' ... Xara -q "//node[@cat='pp']" ... Xara -q "//node[@cat=\"pp\"]" ...
XARA is written in Java, and requires Java 2 to run. Xara will run on any platform that supports Java.
Current version: 0.40.6 download (zip)
Note: This tool is distributed 'as-is', no support, no warranty. Xara is still in the development phase and should be considered experimental software.