ConllQuery¶

Inherits from ConllRegExp, ConllSelector and ResponsePostProcessor. It provides a more complex but fully complete SQL alike language to query a Conll Tree/Graph.

You should use this module if you want to make SQL alike queries on your sentences. For example:

Select the Lemmas of the Children of the ith word that are Prepositions
Select the first 3 DependencyRelations of the Brothers of the ith word sorted alphabetically
Select all the words that have a verb parent with a subject attachment and a prepositional child with a given lemma
Build the path of Lemmas and POS from ith token to jth token
Count how many Adjective children does the kth word has

Parameters¶

# user can set the column indexes if format is not Conll
ConllQuery( IDX_ID=0, IDX_FORM=1, IDX_LEMMA=2, 
             IDX_UPOS=3, IDX_XPOS=4, IDX_MORPH=5, 
             IDX_HEAD=6, IDX_DEPREL=7, IDX_GRAPH=8 )  

Methods¶

ConllQuery.set_sentence( sentence  )  
> sentence (object or list(list(str))): contains the Conll

ConllQuery.set_target( target_index )  
> target_index (int): index of the target starting from zero

resp = ConllQuery.query( id_, dict_pattern, select, 
                         sort=S.NONE, limit=L.NONE, encode=E.RAW )  

> id_ (int): index of token relative to the query, 
>            used only if id_ is in the query otherwise ignored
>
> dict_pattern (dict): nested dictionary encoding the regexp to match
>                   keys must be either relations or features
>                   values are standard regular expressions
>  Example: 
> # find all the prepositions starting with a
> # that have a parent that is a verb and is the root
> dict_pattern = { R.TOKEN:  { F.UPOS: 'ADP',  F.LEMMA:  '^a.*' },
>                  R.PARENT: { F.ID: 'id_', F.DEPREL: 'root' } }
>
> select (FEATURE): name of the feature or tuple of features to get 
>
> sort (SORT): name of the sort algorithm to use 
>
> limit (LIMIT or int): name of the limit algorithm to use
>                    if is int -> number of elements to keep
>
> encode (ENCODER): name of the encoder or tuple of encoders to apply 
>
> resp (any): encoded features of the elements matching the query

resp = ConllQuery.regexp( dict_pattern )  
> dict_pattern (dict): nested dictionary encoding the regexp to match
>                   keys must be either relations or features
>                   values are standard regular expressions
>  Example: 
> # find all the prepositions starting with a
> # that have a parent that is a verb and is the root
> dict_pattern = { R.TOKEN:  { F.UPOS: 'ADP',  F.LEMMA:  '^a.*' },
>                  R.PARENT: { F.XPOS: '^V.*', F.DEPREL: 'root' } }
>
> resp (list(int)): list of the indexes of tokens matching the regexp

resp = ConllQuery.find( id_ , relation )  
> id_ (int): index of the token we want to select
> relation (RELATION): relation or tuple of relations to find
> resp (list(int)): list of the indexes of tokens found

resp = ConllQuery.select( id_ , feature )  
> id_ (int): index of the token we want to select
> feature (FEATURE): name of the feature or tuple of features to get
> resp (str): string containing the values of the selected features

path = ConllQuery.get_path_to_target( id_ )
> id_ (int): index of the token were we start the path to target  
> path (list(int)): list of ids of tokens in the path to target

path = ConllQuery.get_path_to_root( id_ )
> id_ (int):  index of the token were we start the path to root  
> path (list(int)): list of ids of tokens in the path to root

A ConllSentence¶

ID	FORM	LEMMA	UPOS	XPOS	MORPH	HEAD	DEPREL	GRAPH
1	While	while	SCONJ	IN	_	2	mark	_
2	working	work	VERB	VBG	VerbForm=Ger	11	advcl	_
3	at	at	ADP	IN	_	5	case	_
4	a	a	DET	DT	Definite=Ind\|PronType=Art	5	det	_
5	supermarket	supermarket	NOUN	NN	Number=Sing	2	obl	_
6	as	as	ADP	IN	_	8	case	_
7	a	a	DET	DT	Definite=Ind\|PronType=Art	8	det	_
8	bagger	bagger	NOUN	NN	Number=Sing	2	obl	_
9	,	,	PUNCT	,	_	11	punct	_
10	he	he	PRON	PRP	Case=Nom\|Gender=Masc\|Number=Sing\|Person=3\|PronType=Prs	11	nsubj	2:nsubj
11	released	release	VERB	VBD	Mood=Ind\|Tense=Past\|VerbForm=Fin	0	root	_
12	music	music	NOUN	NN	Number=Sing	11	obj	_
13	as	as	ADP	IN	_	16	case	_
14	an	a	DET	DT	Definite=Ind\|PronType=Art	16	det	_
15	independent	independent	ADJ	JJ	Degree=Pos	16	amod	_
16	artist	artist	NOUN	NN	Number=Sing	11	obl	_

Usage¶

from ConllQuery import ConllQuery
from ConllQuery.enumerations import FEATURES as F
from ConllQuery.enumerations import RELATION as R
from ConllQuery.enumerations import SORT     as S
from ConllQuery.enumerations import LIMITER  as L
from ConllQuery.enumerations import ENCODER  as E

# instantiation
parser = ConllRegExp()

# set the sentence object that will be processed
parser.set_sentence( sentence ) 

# find the UPOS of the 3rd word
where  = {R.TOKEN: {F.ID:"id_"}}
select = F.UPOS 
parser.query( 3 , where, select  ) 
> ['a']  # indexes start on zero

# find the LEMMA of the children of 15th word that are ADP and case
where  = {R.PARENT: {F.ID:"id_"}, 
          R.TOKEN:{F.UPOS:"=ADP", F.DEPREL:"=case"}}
select = F.LEMMA 
parser.query( 15 , where, select ) 
> ['as'] # it works!  

# find the LEMMA of the children of the 15th word that begin with 'a'
where  = {R.PARENT: {F.ID:"id_"}, 
          R.TOKEN:{F.LEMMA:"^a"}}
select = F.LEMMA 
encode = E.RAW
parser.query( 15 , where, select , encode=encode ) 
> ['as','an']  # raw returns a list (default)

sort = S.ALPHABETICALLY
parser.query( 15 , where, select , sort=sort, encode=encode ) 
> ['an','as']  # this time the list is alphabetically sorted 

encode = E.CONCATENATE
parser.query( 15 , where, select , encode=encode ) 
> 'as:>:an'  # encoding returns a ':>:'.join string

# Set the target to while
parser.set_target( 0 )

# find the UPOS of the target word
where  = {R.TOKEN: {F.ID:"target"}}
select = F.UPOS 
encode = E.RAW
parser.query( 123456 , where, select , encode ) 
> ['SCONJ']  # target is a keyword that refers to self.target_index
>          # id_ here is not used

# find the UPOS of the target word
where  = { R.PARENT: {F.DEPREL:"root"}, 
           R.TOKEN:{ F.DEPREL:'nsubj', F.MORPH_CASE:'Case=Nom'} }
select = F.UPOS 
encode = E.RAW
parser.query( 123456 , where, select , encode ) 
> ['SCONJ']  # you can query using morphology
>            

# find the UPOS of the target word
where  = { R.PARENT: {F.DEPREL:"root"}, 
           R.TOKEN:{ F.DEPREL:'obl', F.MORPH_CASE:'Number=Sing'}, 
           R.CHILD: { F.UPOS: 'ADJ'} }
select = F.FORM 
encode = E.RAW
parser.query( 123456 , where, select , encode ) 
> ['artist']  # you can query very complex pattern
>