ConllQuery¶
Inherits from ConllRegExp, ConllSelector and ResponsePostProcessor. It provides a more complex but fully complete SQL alike language to query a Conll Tree/Graph.
You should use this module if you want to make SQL alike queries on your sentences. For example:
- Select the Lemmas of the Children of the ith word that are Prepositions
- Select the first 3 DependencyRelations of the Brothers of the ith word sorted alphabetically
- Select all the words that have a verb parent with a subject attachment and a prepositional child with a given lemma
- Build the path of Lemmas and POS from ith token to jth token
- Count how many Adjective children does the kth word has
Parameters¶
# user can set the column indexes if format is not Conll
ConllQuery( IDX_ID=0, IDX_FORM=1, IDX_LEMMA=2,
IDX_UPOS=3, IDX_XPOS=4, IDX_MORPH=5,
IDX_HEAD=6, IDX_DEPREL=7, IDX_GRAPH=8 )
Methods¶
ConllQuery.set_sentence( sentence )
> sentence (object or list(list(str))): contains the Conll
ConllQuery.set_target( target_index )
> target_index (int): index of the target starting from zero
resp = ConllQuery.query( id_, dict_pattern, select,
sort=S.NONE, limit=L.NONE, encode=E.RAW )
> id_ (int): index of token relative to the query,
> used only if id_ is in the query otherwise ignored
>
> dict_pattern (dict): nested dictionary encoding the regexp to match
> keys must be either relations or features
> values are standard regular expressions
> Example:
> # find all the prepositions starting with a
> # that have a parent that is a verb and is the root
> dict_pattern = { R.TOKEN: { F.UPOS: 'ADP', F.LEMMA: '^a.*' },
> R.PARENT: { F.ID: 'id_', F.DEPREL: 'root' } }
>
> select (FEATURE): name of the feature or tuple of features to get
>
> sort (SORT): name of the sort algorithm to use
>
> limit (LIMIT or int): name of the limit algorithm to use
> if is int -> number of elements to keep
>
> encode (ENCODER): name of the encoder or tuple of encoders to apply
>
> resp (any): encoded features of the elements matching the query
resp = ConllQuery.regexp( dict_pattern )
> dict_pattern (dict): nested dictionary encoding the regexp to match
> keys must be either relations or features
> values are standard regular expressions
> Example:
> # find all the prepositions starting with a
> # that have a parent that is a verb and is the root
> dict_pattern = { R.TOKEN: { F.UPOS: 'ADP', F.LEMMA: '^a.*' },
> R.PARENT: { F.XPOS: '^V.*', F.DEPREL: 'root' } }
>
> resp (list(int)): list of the indexes of tokens matching the regexp
resp = ConllQuery.find( id_ , relation )
> id_ (int): index of the token we want to select
> relation (RELATION): relation or tuple of relations to find
> resp (list(int)): list of the indexes of tokens found
resp = ConllQuery.select( id_ , feature )
> id_ (int): index of the token we want to select
> feature (FEATURE): name of the feature or tuple of features to get
> resp (str): string containing the values of the selected features
path = ConllQuery.get_path_to_target( id_ )
> id_ (int): index of the token were we start the path to target
> path (list(int)): list of ids of tokens in the path to target
path = ConllQuery.get_path_to_root( id_ )
> id_ (int): index of the token were we start the path to root
> path (list(int)): list of ids of tokens in the path to root
A ConllSentence¶
ID | FORM | LEMMA | UPOS | XPOS | MORPH | HEAD | DEPREL | GRAPH |
---|---|---|---|---|---|---|---|---|
1 | While | while | SCONJ | IN | _ | 2 | mark | _ |
2 | working | work | VERB | VBG | VerbForm=Ger | 11 | advcl | _ |
3 | at | at | ADP | IN | _ | 5 | case | _ |
4 | a | a | DET | DT | Definite=Ind|PronType=Art | 5 | det | _ |
5 | supermarket | supermarket | NOUN | NN | Number=Sing | 2 | obl | _ |
6 | as | as | ADP | IN | _ | 8 | case | _ |
7 | a | a | DET | DT | Definite=Ind|PronType=Art | 8 | det | _ |
8 | bagger | bagger | NOUN | NN | Number=Sing | 2 | obl | _ |
9 | , | , | PUNCT | , | _ | 11 | punct | _ |
10 | he | he | PRON | PRP | Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs | 11 | nsubj | 2:nsubj |
11 | released | release | VERB | VBD | Mood=Ind|Tense=Past|VerbForm=Fin | 0 | root | _ |
12 | music | music | NOUN | NN | Number=Sing | 11 | obj | _ |
13 | as | as | ADP | IN | _ | 16 | case | _ |
14 | an | a | DET | DT | Definite=Ind|PronType=Art | 16 | det | _ |
15 | independent | independent | ADJ | JJ | Degree=Pos | 16 | amod | _ |
16 | artist | artist | NOUN | NN | Number=Sing | 11 | obl | _ |
Usage¶
from ConllQuery import ConllQuery
from ConllQuery.enumerations import FEATURES as F
from ConllQuery.enumerations import RELATION as R
from ConllQuery.enumerations import SORT as S
from ConllQuery.enumerations import LIMITER as L
from ConllQuery.enumerations import ENCODER as E
# instantiation
parser = ConllRegExp()
# set the sentence object that will be processed
parser.set_sentence( sentence )
# find the UPOS of the 3rd word
where = {R.TOKEN: {F.ID:"id_"}}
select = F.UPOS
parser.query( 3 , where, select )
> ['a'] # indexes start on zero
# find the LEMMA of the children of 15th word that are ADP and case
where = {R.PARENT: {F.ID:"id_"},
R.TOKEN:{F.UPOS:"=ADP", F.DEPREL:"=case"}}
select = F.LEMMA
parser.query( 15 , where, select )
> ['as'] # it works!
# find the LEMMA of the children of the 15th word that begin with 'a'
where = {R.PARENT: {F.ID:"id_"},
R.TOKEN:{F.LEMMA:"^a"}}
select = F.LEMMA
encode = E.RAW
parser.query( 15 , where, select , encode=encode )
> ['as','an'] # raw returns a list (default)
sort = S.ALPHABETICALLY
parser.query( 15 , where, select , sort=sort, encode=encode )
> ['an','as'] # this time the list is alphabetically sorted
encode = E.CONCATENATE
parser.query( 15 , where, select , encode=encode )
> 'as:>:an' # encoding returns a ':>:'.join string
# Set the target to while
parser.set_target( 0 )
# find the UPOS of the target word
where = {R.TOKEN: {F.ID:"target"}}
select = F.UPOS
encode = E.RAW
parser.query( 123456 , where, select , encode )
> ['SCONJ'] # target is a keyword that refers to self.target_index
> # id_ here is not used
# find the UPOS of the target word
where = { R.PARENT: {F.DEPREL:"root"},
R.TOKEN:{ F.DEPREL:'nsubj', F.MORPH_CASE:'Case=Nom'} }
select = F.UPOS
encode = E.RAW
parser.query( 123456 , where, select , encode )
> ['SCONJ'] # you can query using morphology
>
# find the UPOS of the target word
where = { R.PARENT: {F.DEPREL:"root"},
R.TOKEN:{ F.DEPREL:'obl', F.MORPH_CASE:'Number=Sing'},
R.CHILD: { F.UPOS: 'ADJ'} }
select = F.FORM
encode = E.RAW
parser.query( 123456 , where, select , encode )
> ['artist'] # you can query very complex pattern
>