ConllRegExp¶
Inherits from ConllRelator and ConllSelector. It provides a very expressive RegExp module that allows to query complex patterns in the Conll Tree/Graph.
You should use this module if you want to match patterns into a sentence. Examples:
- Get all the tokens that begin with a consonant
- Get all the Parts-of-Speech of tokens that begin with a consonant
- Get all the tokens with a prepositional child and a parent that is a verb
- Get all the tokens with a child that is an adverb and begins with ‘fa’
RegExp can be as complex as you wish…
- Get all the tokens that do not contains numbers and have a child that is an adjective and a parent that is a verb that begins with an f
- Get all the tokens that do not contains numbers and have a child that is an adjective and this child itself has another child that is a conjunction that starts with a and has 3 letters
To learn all about RegExp in Conll trees read ConllRegExp
Parameters¶
# user can set the column indexes if format is not Conll
ConllRegExp( IDX_ID=0, IDX_FORM=1, IDX_LEMMA=2,
IDX_UPOS=3, IDX_XPOS=4, IDX_MORPH=5,
IDX_HEAD=6, IDX_DEPREL=7, IDX_GRAPH=8 )
Methods¶
ConllRegExp.set_sentence( sentence )
> sentence (object or list(list(str))): contains the Conll
ConllRegExp.set_target( target_index )
> target_index (int): index of the target starting from zero
resp = ConllRegExp.regexp( dict_pattern )
> dict_pattern (dict): nested dictionary encoding the regexp to match
> keys must be either relations or features
> values are standard regular expressions
> Example:
> # find all the prepositions starting with a
> # that have a parent that is a verb and is the root
> dict_pattern = { R.TOKEN: { F.UPOS: 'ADP', F.LEMMA: '^a.*' },
> R.PARENT: { F.XPOS: '^V.*', F.DEPREL: 'root' } }
>
> resp (list(int)): list of the indexes of tokens matching the regexp
resp = ConllRegExp.find( id_ , relation )
> id_ (int): index of the token we want to select
> relation (RELATION): relation or tuple of relations to find
> resp (list(int)): list of the indexes of tokens found
resp = ConllRegExp.select( id_ , feature )
> id_ (int): index of the token we want to select
> feature (FEATURE): name of the feature or tuple of features to get
> resp (str): string containing the values of the selected features
path = ConllRegExp.get_path_to_target( id_ )
> id_ (int): index of the token were we start the path to target
> path (list(int)): list of ids of tokens in the path to target
path = ConllRegExp.get_path_to_root( id_ )
> id_ (int): index of the token were we start the path to root
> path (list(int)): list of ids of tokens in the path to root
A ConllSentence¶
ID | FORM | LEMMA | UPOS | XPOS | MORPH | HEAD | DEPREL | GRAPH |
---|---|---|---|---|---|---|---|---|
1 | While | while | SCONJ | IN | _ | 2 | mark | _ |
2 | working | work | VERB | VBG | VerbForm=Ger | 11 | advcl | _ |
3 | at | at | ADP | IN | _ | 5 | case | _ |
4 | a | a | DET | DT | Definite=Ind|PronType=Art | 5 | det | _ |
5 | supermarket | supermarket | NOUN | NN | Number=Sing | 2 | obl | _ |
6 | as | as | ADP | IN | _ | 8 | case | _ |
7 | a | a | DET | DT | Definite=Ind|PronType=Art | 8 | det | _ |
8 | bagger | bagger | NOUN | NN | Number=Sing | 2 | obl | _ |
9 | , | , | PUNCT | , | _ | 11 | punct | _ |
10 | he | he | PRON | PRP | Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs | 11 | nsubj | 2:nsubj |
11 | released | release | VERB | VBD | Mood=Ind|Tense=Past|VerbForm=Fin | 0 | root | _ |
12 | music | music | NOUN | NN | Number=Sing | 11 | obj | _ |
13 | as | as | ADP | IN | _ | 16 | case | _ |
14 | an | a | DET | DT | Definite=Ind|PronType=Art | 16 | det | _ |
15 | independent | independent | ADJ | JJ | Degree=Pos | 16 | amod | _ |
16 | artist | artist | NOUN | NN | Number=Sing | 11 | obl | _ |
Usage¶
from ConllQuery import ConllRegExp
from ConllQuery.enumerations import FEATURES as F
from ConllQuery.enumerations import RELATION as R
# instantiation
parser = ConllRegExp()
# set the sentence object that will be processed
parser.set_sentence( sentence )
# find all the tokens that are prepositions or nouns
parser.regexp( { R.TOKEN: { F.UPOS: 'ADP|NOUN' }} )
> [2,4,5,7,11,12,15] # indexes start on zero
# find all the tokens that are verbs and roots
parser.regexp( { R.TOKEN: { F.UPOS: 'VERB',
F.DEPREL: 'root' }} )
> [10] # there is only one (released)
# Set the target to while
parser.set_target( 0 )
# the tokens that pass through the root while going to the target
parser.regexp( { R.TO_TARGET{ F.DEPREL: 'root' } } )
> [9,10,11,12,13,14,15] # ids of tokens that pass through root
# you can put constrains on several relations
# example: find the nouns that have a parent root that is a verb
parser.regexp( {R.TOKEN: { F.UPOS: 'NOUN'},
R.PARENT: { F.UPOS: 'VERB',
F.DEPREL: 'root'}} )
> [11, 15] # retrieves both 'music' and 'artist'
# you can also negate some features
# example: find the nouns that do not have an adjective child
parser.regexp( {R.TOKEN: { F.UPOS: 'NOUN'},
R.CHILD: { F.UPOS: '^(?!.*ADJ).*$'}} )
> [4,7,11] # doesn't find 'artist' (independent artist)