edu.mit.jmwe.data.concordance
Class ConcordanceToken

java.lang.Object
  extended by edu.mit.jmwe.data.Token
      extended by edu.mit.jmwe.data.concordance.ConcordanceToken
All Implemented Interfaces:
IConcordanceToken, IHasForm, IToken
Direct Known Subclasses:
ConcordanceTagger.TaggerToken

public class ConcordanceToken
extends Token
implements IConcordanceToken

Default implementation of IConcordanceToken.

This class requires JSemcor to be on the classpath.

Since:
jMWE 1.0.0
Version:
$Id: ConcordanceToken.java 620 2011-05-08 21:13:58Z markaf $
Author:
M.A. Finlayson

Field Summary
static Pattern semcorTokenPattern
          A compiled regular expression pattern that captures the string representation of tagged tokens.
static Pattern whitespaceDelimited
          A compiled regular expression a non-empty run of whitespace.
 
Constructor Summary
ConcordanceToken(String text, String tag, int tokenNum, int partNum, String... stems)
          Constructs a new semcor token object with the specified text, tag, token number, part number and stems.
 
Method Summary
 boolean equals(Object obj)
           
 int getPartNumber()
          Returns the index of the part in the Semcor token from which this part was extracted.
 int getTokenNumber()
          Returns the index of the token in the Semcor sentence from which it was extracted.
 int hashCode()
           
static ConcordanceToken parse(String str)
          Parses a string of the form "test_NN_stem1_stem2_..._stemN_1_0" into a ConcordanceToken instance.
static List<ConcordanceToken> parseList(String str)
          Parses a string formed from the concatenation of strings of the form "test-1-0-NN-stem1:stem2 " into a list of corresponding ConcordanceToken instances.
 String toString()
           
static String toString(IConcordanceToken token)
          Returns the String representation of the given token.
static ConcordanceToken toToken(int tokenNum, int partNum, edu.mit.jsemcor.element.IWordform wf, edu.mit.jsemcor.element.ISentence sent)
          Constructs a semcor token object from the given token number, part number, IWordform, and sentence drawn from the semcor corpus.
static List<ConcordanceToken> toTokens(edu.mit.jsemcor.element.IToken t, int tokenNum, edu.mit.jsemcor.element.ISentence sent)
          Returns a list of Concordance token objects if the token specified by the token number in the sentence is a continuous MWE.
 
Methods inherited from class edu.mit.jmwe.data.Token
checkStems, checkString, getForm, getStems, getTag
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface edu.mit.jmwe.data.IToken
getStems, getTag
 
Methods inherited from interface edu.mit.jmwe.data.IHasForm
getForm
 

Field Detail

semcorTokenPattern

public static final Pattern semcorTokenPattern
A compiled regular expression pattern that captures the string representation of tagged tokens. Pattern: ([\\S&&[^_]]+)_([\\S&&[^_]]+)_(\\S*)_(\\d+)_(\\d+)
  1. ([\\S&&[^_]]+) group 1, token string as it appears in the sentence
  2. ([\\S&&[^_]]+)_ group 2, part of speech tag
  3. (\\S*) group 3, list of stems, may or may not occur,
  4. (\\d+) group 4, token number
  5. (\\d+) group 5, part number

Since:
jMWE 1.0.0

whitespaceDelimited

public static final Pattern whitespaceDelimited
A compiled regular expression a non-empty run of whitespace.

Since:
jMWE 1.0.0
Constructor Detail

ConcordanceToken

public ConcordanceToken(String text,
                        String tag,
                        int tokenNum,
                        int partNum,
                        String... stems)
Constructs a new semcor token object with the specified text, tag, token number, part number and stems. The stem array may be null or empty. If null, no stems have been assigned. If empty, the token is unstemmable.

Parameters:
text - the surface form of the token as it appears in the sentence, capitalization intact
tag - the tag of the token, if assigned, otherwise null
tokenNum - the token number. Must be greater than or equal to 0.
partNum - the part number representing the index of the token in a multi-word expression, 0 if it is not part of one. Must be greater than or equal to 0.
stems - the list of stems, possibly empty or null
Throws:
NullPointerException - if the text is null
IllegalArgumentException - if the text is empty or all whitespace or if the token number or part number is less than 0.
Since:
jMWE 1.0.0
Method Detail

getTokenNumber

public int getTokenNumber()
Description copied from interface: IConcordanceToken
Returns the index of the token in the Semcor sentence from which it was extracted.

Specified by:
getTokenNumber in interface IConcordanceToken
Returns:
the token number of the token in the Semcor sentence from which it was extracted.

getPartNumber

public int getPartNumber()
Description copied from interface: IConcordanceToken
Returns the index of the part in the Semcor token from which this part was extracted. This number will be greater than zero only if the token was originally a part of a multi-word expression in Semcor.

Specified by:
getPartNumber in interface IConcordanceToken
Returns:
the index of the part in the Semcor token from which this part was extracted.

toString

public String toString()
Overrides:
toString in class Token

hashCode

public int hashCode()
Overrides:
hashCode in class Object

equals

public boolean equals(Object obj)
Overrides:
equals in class Object

toString

public static String toString(IConcordanceToken token)
Returns the String representation of the given token. This has the form: form_tag_stem[1]_stem[2]_..._stem[n]_tokenNumber_partNumber

Parameters:
token - the token to be represented as a string
Returns:
the String representation of the given token.
Since:
jMWE 1.0.0

parse

public static ConcordanceToken parse(String str)
Parses a string of the form "test_NN_stem1_stem2_..._stemN_1_0" into a ConcordanceToken instance.

Parameters:
str - the string representing the tagged token
Returns:
a SemcorToken instance
Throws:
NullPointerException - if the specified string is null
IllegalArgumentException - if the specified string does not match the expected format
Since:
jMWE 1.0.0

parseList

public static List<ConcordanceToken> parseList(String str)
Parses a string formed from the concatenation of strings of the form "test-1-0-NN-stem1:stem2 " into a list of corresponding ConcordanceToken instances.

Parameters:
str - the concatenated string representing the tagged token
Returns:
a list of SemcorToken instances, or an empty list if the specified string does not contain any well-formed tagged token strings
Throws:
NullPointerException - if the specified string is null
IllegalArgumentException - if the specified string does not conform to the expected format
Since:
jMWE 1.0.0

toTokens

public static List<ConcordanceToken> toTokens(edu.mit.jsemcor.element.IToken t,
                                              int tokenNum,
                                              edu.mit.jsemcor.element.ISentence sent)
Returns a list of Concordance token objects if the token specified by the token number in the sentence is a continuous MWE. Otherwise, returns a singleton list.

Parameters:
t - the token specified by the token number in the given sentence
tokenNum - the token number of the token to be translated into a concordance token object
sent - the sentence
Returns:
a list of concordance tokens constructed from the specified token
Since:
jMWE 1.0.0

toToken

public static ConcordanceToken toToken(int tokenNum,
                                       int partNum,
                                       edu.mit.jsemcor.element.IWordform wf,
                                       edu.mit.jsemcor.element.ISentence sent)
Constructs a semcor token object from the given token number, part number, IWordform, and sentence drawn from the semcor corpus. Uses the word form's semantic tag to get the lemma. If the wordform is part of a discontinuous MWE, uses the semantic tag of the first wordform in the MWE to obtain the lemma.

Parameters:
tokenNum - the token number
partNum - the part number
wf - the word form
sent - the JSemcor sentence
Returns:
a new semcor token constructed out of the specified information
Since:
jMWE 1.0.0


Copyright © 2011 Massachusetts Institute of Technology. All Rights Reserved.