edu.mit.jmwe.data.concordance
Class ConcordanceTagger

java.lang.Object
  extended by edu.mit.jmwe.util.AbstractFileSelector
      extended by edu.mit.jmwe.data.concordance.ConcordanceTagger
All Implemented Interfaces:
Runnable

public class ConcordanceTagger
extends AbstractFileSelector
implements Runnable

Tags with parts of speech all words in all contexts provided in a given concordance set.

This class depends on three external libraries: JWI, JSemcor, and the Stanford POS Tagger.

Use the main method of this class for its default functionality.

Since:
jMWE 1.0.0
Version:
$Id: ConcordanceTagger.java 620 2011-05-08 21:13:58Z markaf $
Author:
M.A. Finlayson, N. Kulkarni
See Also:
TaggedConcordanceIterator

Nested Class Summary
protected static class ConcordanceTagger.TaggerToken
          Represents a semcor token that is not yet tagged.
 
Constructor Summary
ConcordanceTagger()
           
 
Method Summary
protected  void addWords(edu.mit.jsemcor.element.IWordform wf, int tokenNum, List<ConcordanceTagger.TaggerToken> result, edu.mit.jwi.morph.IStemmer stemmer)
          Stems each of the words in the provided wordform, adding the tagger tokens created from these stems, words and token number to the given results list.
protected  File getLocation(Class<?> key)
          Utility method for getting a location that has a default stored in the Java preferences.
protected  edu.stanford.nlp.ling.SentenceProcessor<edu.stanford.nlp.ling.HasWord,? extends edu.stanford.nlp.ling.TaggedWord> getPOSTagger()
          Returns a maximum entropy tagger using a Stanford NLP tagging model selected by the user.
protected  edu.mit.jsemcor.main.IConcordanceSet getSemcor()
          Returns the Semcor concordance set or null if the directory cannot be found.
protected  edu.mit.jwi.morph.IStemmer getStemmer()
          Returns a stemmer that requires Wordnet or null if the Wordnet directory cannot be found.
protected  Writer getWriter()
          Returns a writer for the file to which the tagged concordance will be written.
static void main(String[] args)
          Tags the Semcor corpus.
protected  ArrayList<edu.stanford.nlp.ling.HasWord> makeSentence(edu.mit.jsemcor.element.ISentence s, edu.mit.jwi.morph.IStemmer stemmer)
          Returns a Stanford parser sentence that contains all the tokens from the specified JSemcor sentence, with MWE expressions broken into their constituent tokens.
 void process(edu.mit.jsemcor.element.IContextID startContext, int startSent, Iterable<? extends edu.mit.jsemcor.main.IConcordance> cs, edu.stanford.nlp.ling.SentenceProcessor<edu.stanford.nlp.ling.HasWord,? extends edu.stanford.nlp.ling.TaggedWord> posTagger, edu.mit.jwi.morph.IStemmer stemmer, Writer writer, IProgressBar pb)
          Tags the all contexts provided by the concordance set, using the specified tagger, writing the data to the specified writer.
protected  void process(edu.mit.jsemcor.element.IContextID cid, edu.mit.jsemcor.element.ISentence s, edu.stanford.nlp.ling.SentenceProcessor<edu.stanford.nlp.ling.HasWord,? extends edu.stanford.nlp.ling.TaggedWord> posTagger, edu.mit.jwi.morph.IStemmer stemmer, Writer writer)
          Tags the provided sentence, using the specified tagger, writing the data to the specified writer.
 void process(Iterable<? extends edu.mit.jsemcor.main.IConcordance> cs, edu.stanford.nlp.ling.SentenceProcessor<edu.stanford.nlp.ling.HasWord,? extends edu.stanford.nlp.ling.TaggedWord> posTagger, edu.mit.jwi.morph.IStemmer stemmer, Writer writer, IProgressBar pb)
          Tags the all contexts provided by the concordance set, using the specified tagger, writing the data to the specified writer.
 void run()
           
protected  void setLocation(Class<?> key, File loc)
          Sets a default location into the Java Preferences.
protected  List<String> stem(String token, edu.mit.jsemcor.element.IWordform wf, edu.mit.jwi.morph.IStemmer stemmer)
          Stems the given token.
 
Methods inherited from class edu.mit.jmwe.util.AbstractFileSelector
choose, chooseDirectory, chooseFile, chooseFileForWriting, getFileChooser
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ConcordanceTagger

public ConcordanceTagger()
Method Detail

main

public static void main(String[] args)
Tags the Semcor corpus. Running this method will prompt the user for the following locations:
  1. a directory containing JSemcor-compatible concordance data files, e.g., SemCor
  2. a directory containing JWI-compatible electronic dictionary files, e.g., Wordnet
  3. a Stanford-POS-Tagger-compatible tagging model, e.g., left3words-wsj-0-18.tagger
  4. a file to which the tagged data should be written
The resulting file in (4) can be used via the TaggedConcordanceIterator class.

Parameters:
args - standard main method arguments; ignored
Since:
jMWE 1.0.0

run

public void run()
Specified by:
run in interface Runnable

getSemcor

protected edu.mit.jsemcor.main.IConcordanceSet getSemcor()
Returns the Semcor concordance set or null if the directory cannot be found.

Returns:
the Semcor concordance set or null if the directory cannot be found.
Since:
jMWE 1.0.0

getStemmer

protected edu.mit.jwi.morph.IStemmer getStemmer()
Returns a stemmer that requires Wordnet or null if the Wordnet directory cannot be found.

Returns:
a stemmer that requires Wordnet or null if the Wordnet directory cannot be found.
Since:
jMWE 1.0.0

getPOSTagger

protected edu.stanford.nlp.ling.SentenceProcessor<edu.stanford.nlp.ling.HasWord,? extends edu.stanford.nlp.ling.TaggedWord> getPOSTagger()
                                                                                                                                  throws Exception
Returns a maximum entropy tagger using a Stanford NLP tagging model selected by the user. Will return null if no model is selected or found.

Returns:
a MaxentTagger using a Stanford NLP tagging model selected by the user. Will return null if no model is selected or found.
Throws:
Exception
Since:
jMWE 1.0.0

getWriter

protected Writer getWriter()
                    throws IOException
Returns a writer for the file to which the tagged concordance will be written. The file is selected by the user. Will return null if no output file is selected.

Returns:
a writer for the file to which the tagged concordance will be written. Will return null if no output file is selected.
Throws:
IOException - if an exception occurs when constructing the file writer
Since:
jMWE 1.0.0

getLocation

protected File getLocation(Class<?> key)
Utility method for getting a location that has a default stored in the Java preferences.

Overrides:
getLocation in class AbstractFileSelector
Parameters:
key - the class that serves as key for this location
Returns:
the path to the stored location, or null if none
Since:
jMWE 1.0.0

setLocation

protected void setLocation(Class<?> key,
                           File loc)
Sets a default location into the Java Preferences.

Overrides:
setLocation in class AbstractFileSelector
Parameters:
key - the class that serves as key for this location
loc - the location to be saved to the preferences
Since:
jMWE 1.0.0

process

public void process(Iterable<? extends edu.mit.jsemcor.main.IConcordance> cs,
                    edu.stanford.nlp.ling.SentenceProcessor<edu.stanford.nlp.ling.HasWord,? extends edu.stanford.nlp.ling.TaggedWord> posTagger,
                    edu.mit.jwi.morph.IStemmer stemmer,
                    Writer writer,
                    IProgressBar pb)
             throws IOException
Tags the all contexts provided by the concordance set, using the specified tagger, writing the data to the specified writer. This method does not close the writer when finished.

Parameters:
cs - the concordance set from which contexts should be drawn, may not be null
posTagger - the part of speech tagger to be used to tag the sentences, may not be null
stemmer - a stemmer used to stem words
writer - the writer to which results should be written, may not be null
pb - the progress bar to which progress is to be reported; may be null
Throws:
IOException - if there is a problem writing to the provided writer
NullPointerException - if any argument is null
Since:
jMWE 1.0.0

process

public void process(edu.mit.jsemcor.element.IContextID startContext,
                    int startSent,
                    Iterable<? extends edu.mit.jsemcor.main.IConcordance> cs,
                    edu.stanford.nlp.ling.SentenceProcessor<edu.stanford.nlp.ling.HasWord,? extends edu.stanford.nlp.ling.TaggedWord> posTagger,
                    edu.mit.jwi.morph.IStemmer stemmer,
                    Writer writer,
                    IProgressBar pb)
             throws IOException
Tags the all contexts provided by the concordance set, using the specified tagger, writing the data to the specified writer. The method skips ahead past the sentence number in the specified context. If the context does not occur in the specified concordance set, this method will do nothing. This method does not close the writer.

Parameters:
startContext - the context where the tagging should begin. If null, the tagging will being with the first context.
startSent - the sentence number past which tagging should being. If the number is non-positive, no sentences in the specified context are skipped
cs - the concordance set from which contexts should be drawn, may not be null
posTagger - the part of speech tagger to be used to tag the sentences, may not be null
stemmer - a stemmer used to stem words
writer - the writer to which results should be written, may not be null
pb - the progress bar to which progress is to be reported; may be null
Throws:
IOException - if there is a problem writing to the provided writer
NullPointerException - if any of the concordance set, tagger, or writer are null
Since:
jMWE 1.0.0

process

protected void process(edu.mit.jsemcor.element.IContextID cid,
                       edu.mit.jsemcor.element.ISentence s,
                       edu.stanford.nlp.ling.SentenceProcessor<edu.stanford.nlp.ling.HasWord,? extends edu.stanford.nlp.ling.TaggedWord> posTagger,
                       edu.mit.jwi.morph.IStemmer stemmer,
                       Writer writer)
                throws IOException
Tags the provided sentence, using the specified tagger, writing the data to the specified writer. TThis method does not close the writer.

Parameters:
cid - the context containing the sentence
s - the sentence being tagged
posTagger - the part of speech tagger to be used to tag the sentences, may not be null
stemmer - the stemmer used to stem the tokens, may not be null
writer - the writer to which results should be written, may not be null
Throws:
IOException - if there is a problem writing to the provided writer
NullPointerException - if any of the sentence, tagger, or writer are null
Since:
jMWE 1.0.0

makeSentence

protected ArrayList<edu.stanford.nlp.ling.HasWord> makeSentence(edu.mit.jsemcor.element.ISentence s,
                                                                edu.mit.jwi.morph.IStemmer stemmer)
Returns a Stanford parser sentence that contains all the tokens from the specified JSemcor sentence, with MWE expressions broken into their constituent tokens. Each new token is marked with a token number and part number that indicates its source IToken object in the original semcor sentence.

Parameters:
s - a JSemcor ISentence object to be transformed
Returns:
a sentence object consisting of just the tokens, with MWEs split into their constituent tokens
Throws:
NullPointerException - if the specified sentence is null
Since:
jMWE 1.0.0

addWords

protected void addWords(edu.mit.jsemcor.element.IWordform wf,
                        int tokenNum,
                        List<ConcordanceTagger.TaggerToken> result,
                        edu.mit.jwi.morph.IStemmer stemmer)
Stems each of the words in the provided wordform, adding the tagger tokens created from these stems, words and token number to the given results list.

Parameters:
wf - the wordform whose constituent words are to be stemmed
tokenNum - the number of the token to be tagged, inside the wordform
result - the list to which the tagger tokens will be added
stemmer - the stemmer used to stem the tokens, may not be null
Since:
jMWE 1.0.0

stem

protected List<String> stem(String token,
                            edu.mit.jsemcor.element.IWordform wf,
                            edu.mit.jwi.morph.IStemmer stemmer)
Stems the given token.

Parameters:
token - the token to be stemmed
wf - the wordform from which the token is drawn the wordform from which the token is drawn
stemmer - the stemmer used to stem the tokens, may not be null
Returns:
a list of the stems of the given token if the given wordform has no specified/recognizable part of speech or has more than one constituent token. Otherwise, returns null.
Since:
jMWE 1.0.0


Copyright © 2011 Massachusetts Institute of Technology. All Rights Reserved.