edu.mit.jmwe.index
Class IndexBuilder

java.lang.Object
  extended by edu.mit.jmwe.util.AbstractFileSelector
      extended by edu.mit.jmwe.index.IndexBuilder
All Implemented Interfaces:
Runnable

public class IndexBuilder
extends AbstractFileSelector
implements Runnable

Builds a MWE index that can be loaded into memory from Wordnet, using Semcor as the reference concordance for obtaining frequencies relating to an MWE's occurrence as marked, unmarked, etc.

This class requires JWI and JSemcor to be on the classpath.

Since:
jMWE 1.0.0
Version:
$Id: IndexBuilder.java 639 2011-09-26 21:03:51Z markaf $
Author:
M.A. Finlayson

Nested Class Summary
static interface IndexBuilder.FileGetter
          Wouldn't it be nice to have first-class functions in Java?
static interface IndexBuilder.IMutableMWEDesc
           
static class IndexBuilder.MutableInfMWEDesc
           
static class IndexBuilder.MutableRootMWEDesc
          A root MWE description object whose counts can be incremented.
 
Constructor Summary
IndexBuilder()
           
 
Method Summary
protected  boolean contains(List<IMWE<IConcordanceToken>> list, IMWE<IConcordanceToken> mwe)
          Whether the specified MWE is contained in the specified list
 void countMarked(List<IMWE<IConcordanceToken>> answers, Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index)
          Counts instances of marked MWEs
 void countUnmarked(IMWEDetector detector, IConcordanceSentence sent, List<IMWE<IConcordanceToken>> answers)
          Counts the number of MWEs that are detected by the specified detected, but not marked in the answer set as being MWEs.
static File deleteFile(File file, IndexBuilder.FileGetter fg)
          Gets a pointer to a file that does not exist.
 Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> extractMWEs(edu.mit.jwi.IDictionary dict)
          Retrieves multi-word expressions from the specified IDictionary object and returns them as a map.
<T extends IToken>
void
findMissingMWEs(List<IMWE<T>> mwes, Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index, Set<IndexBuilder.MutableRootMWEDesc> missing)
          Finds MWEs that are marked in the the specified list, but not in the index.
protected  edu.mit.jsemcor.main.IConcordanceSet getConcordance()
          Gets the concordance set that will be used to interface with Semcor.
protected  File getConcordanceDir()
          Returns the location of the concordance data; may be overridden by subclasses to provide a different manner of locating the concordance.
protected  File getDataFile()
          Prompts the user to select the file to which the MWE descriptions will be written along with their counts relating to their occurrences in the reference concordance.
protected  List<String> getDataHeaderLines()
          Returns the set of lines to be included as a header in the data file.
protected  edu.mit.jwi.IDictionary getDictionary()
          Gets the IDictionary that will be used to interface with Wordnet.
protected  File getDictionaryDir()
          Returns the location of the wordnet dictionary; may be overridden by subclasses to provide a different manner of locating the dictionary.
protected  int getEstimatedSentenceCount()
          Returns the estimated number of sentences being used from the reference concordance (Semcor).
protected  File getIndexFile()
          Prompts the user to select the file to which the index will be written.
protected  List<String> getIndexHeaderLines()
          Returns the set of lines to be included as a header in the index file.
protected  IndexBuilder.MutableInfMWEDesc getInflectedForm(IndexBuilder.MutableRootMWEDesc root, String form)
          Returns an inflected form that matches the specified surface form, attached to the root description.
protected  File getTaggedConcordanceFile()
          Returns the location of the tagged concordance file; may be overridden by subclasses to provide a different manner of locating the concordance.
protected  Iterable<IConcordanceSentence> getTaggedIterator()
          Gets an iterator over the tagged semcor sentences.
protected  IMWEDetector getUmarkedDetector(IMWEIndex index)
          Creates a detector that can be used to find sequences of tokens (inflected or not) that match an MWE description, but are not marked as an MWE.
protected  boolean isMWE(edu.mit.jwi.item.IIndexWord idxWord)
          Returns true if the given word is an MWE.
protected
<T extends IConcordanceToken>
boolean
isSplit(IMWE<T> mwe)
          Returns true if this MWE is not continuous - if it has interstitial tokens that are not a part of it; false otherwise.
static void main(String[] args)
          Constructs the MWE index from Wordnet and Semcor and writes it to a file.
 void printTotals(Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> entries)
          Sums all the counts of the MWEs in the given map and prints the totals.
 void process(edu.mit.jwi.IDictionary dict, Iterable<? extends IConcordanceSentence> itr, edu.mit.jsemcor.main.IConcordanceSet cs, File dataFile, File indexFile)
          Constructs the index in five steps:
 void run()
           
static MWEPOS toMWEPOS(edu.mit.jsemcor.element.ISemanticTag tag)
          Translates the JSemcor ISemanticTag to a jMWE MWEPOS object.
static void writeDataFile(IMWEIndex index, OutputStream out, Iterable<String> headerLines)
          Writes the MWEIndex data to the specified file.
static void writeIndexFile(IMWEIndex index, OutputStream out, Iterable<String> headerLines)
          Writes the MWEIndex index to the specified file.
 
Methods inherited from class edu.mit.jmwe.util.AbstractFileSelector
choose, chooseDirectory, chooseFile, chooseFileForWriting, getFileChooser, getLocation, setLocation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

IndexBuilder

public IndexBuilder()
Method Detail

main

public static void main(String[] args)
Constructs the MWE index from Wordnet and Semcor and writes it to a file.

Parameters:
args - standard main arguments; ignored
Since:
jMWE 1.0.0

run

public void run()
Specified by:
run in interface Runnable

getDictionary

protected edu.mit.jwi.IDictionary getDictionary()
Gets the IDictionary that will be used to interface with Wordnet.

Returns:
an electronic dictionary from which the MWEs may be extracted. if the wordnet directory cannot be found.
Since:
jMWE 1.0.0

getDictionaryDir

protected File getDictionaryDir()
Returns the location of the wordnet dictionary; may be overridden by subclasses to provide a different manner of locating the dictionary.

Returns:
the directory containing the wordnet data files
Since:
jMWE 1.0.0

getConcordance

protected edu.mit.jsemcor.main.IConcordanceSet getConcordance()
Gets the concordance set that will be used to interface with Semcor.

Returns:
the concordance set containing the concordance information, or null if none
Since:
jMWE 1.0.0

getConcordanceDir

protected File getConcordanceDir()
Returns the location of the concordance data; may be overridden by subclasses to provide a different manner of locating the concordance.

Returns:
a directory containing a concordance set in Semcor format, or null if the concordance directory cannot be found.
Since:
jMWE 1.0.0

getTaggedIterator

protected Iterable<IConcordanceSentence> getTaggedIterator()
Gets an iterator over the tagged semcor sentences.

Returns:
the iterator over tagged semcor sentences. Will return null if the tagged concordance file cannot be found or if the user chooses to construct this index with no inflected forms or counts.
Since:
jMWE 1.0.0

getTaggedConcordanceFile

protected File getTaggedConcordanceFile()
Returns the location of the tagged concordance file; may be overridden by subclasses to provide a different manner of locating the concordance.

Returns:
a file containing the tagged concordance data, or null if none
Since:
jMWE 1.0.0

getDataFile

protected File getDataFile()
Prompts the user to select the file to which the MWE descriptions will be written along with their counts relating to their occurrences in the reference concordance. This file will contain lines of the form:
 account_for_V 14,0,0,0,0 accounted_for 5,0,0,0,5 accounting_for
 1,0,0,0,0 accounts_for 2,0,0,0,1
 

Returns:
the file to which the MWE descriptions and counts will be written.
Since:
jMWE 1.0.0

getIndexFile

protected File getIndexFile()
Prompts the user to select the file to which the index will be written. This file will contain lines of the form:

aberration chromatic_aberration_N chromosomal_aberration_N optical_aberration_N spherical_aberration_N

Returns:
the file to which the index will be written.
Since:
jMWE 1.0.0

getDataHeaderLines

protected List<String> getDataHeaderLines()
Returns the set of lines to be included as a header in the data file.

Returns:
the set of lines to be included as a header in the data file.
Since:
jMWE 1.0.0

getIndexHeaderLines

protected List<String> getIndexHeaderLines()
Returns the set of lines to be included as a header in the index file.

Returns:
the set of lines to be included as a header in the index file.
Since:
jMWE 1.0.0

process

public void process(edu.mit.jwi.IDictionary dict,
                    Iterable<? extends IConcordanceSentence> itr,
                    edu.mit.jsemcor.main.IConcordanceSet cs,
                    File dataFile,
                    File indexFile)
             throws IOException
Constructs the index in five steps:

1. Extracts the MWEs from the given dictionary

2. Finds the MWEs in the concordance that are missing from the dictionary.

3. Counts the number of times this MWE was marked as a continuous run of tokens, non-continuous run, appeared with a known inflection pattern, etc.

4. Records the counts for unmarked sequences of MWE parts

5. Writes the index to the data and index files

If the concordance set provided is null, skips steps 2-4.

Parameters:
dict - the dictionary containing the MWEs
itr - the iterator over the sentences in the reference concordance
cs - the possibly null reference concordance set.
dataFile - the file to which the descriptions and counts will be written
indexFile - the file to which the index will be written
Throws:
IOException
Since:
jMWE 1.0.0

getEstimatedSentenceCount

protected int getEstimatedSentenceCount()
Returns the estimated number of sentences being used from the reference concordance (Semcor).

Returns:
the estimated number of sentences being used from Semcor.
Since:
jMWE 1.0.0

extractMWEs

public Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> extractMWEs(edu.mit.jwi.IDictionary dict)
Retrieves multi-word expressions from the specified IDictionary object and returns them as a map. Multi-word expressions are indexed in the map according to their collocates. That is, the keys of the map are parts of multi-word expressions (String objects that do not contain an underscore), and the set associated with that key contains all multi-word expressions in the dictionary that contain that part. This method returns a map whose keys and value sets are sorted in their natural order.

Parameters:
dict - a JWI IDictionary object
Returns:
a Map with collocates as keys and a set of the multi-word expressions that they are a part of as values.
Throws:
NullPointerException - if the specified dictionary is null
Since:
jMWE 1.0.0

isMWE

protected boolean isMWE(edu.mit.jwi.item.IIndexWord idxWord)
Returns true if the given word is an MWE. Tests this by checking whether its lemma contains an underscore.

Parameters:
idxWord - the word to be checked.
Returns:
true if the given word is an MWE
Since:
jMWE 1.0.0

countMarked

public void countMarked(List<IMWE<IConcordanceToken>> answers,
                        Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index)
Counts instances of marked MWEs

Parameters:
answers - the list of answers for a sentence, may not be null
index - the index map,may not be null
Throws:
NullPointerException - if either argument is null
Since:
jMWE 1.0.0

getInflectedForm

protected IndexBuilder.MutableInfMWEDesc getInflectedForm(IndexBuilder.MutableRootMWEDesc root,
                                                          String form)
Returns an inflected form that matches the specified surface form, attached to the root description. If no such inflected form exists, one is created

Parameters:
root - the root form on which the inflected is to be created
form - the inflected form
Returns:
a MWE description object corresponding to the inflected form
Since:
jMWE 1.0.0

isSplit

protected <T extends IConcordanceToken> boolean isSplit(IMWE<T> mwe)
Returns true if this MWE is not continuous - if it has interstitial tokens that are not a part of it; false otherwise.

Parameters:
mwe - the MWE to test; may not be null
Returns:
true if the MWE is split; false otherwise
Throws:
NullPointerException - if the specified MWE is null
Since:
jMWE 1.0.0

findMissingMWEs

public <T extends IToken> void findMissingMWEs(List<IMWE<T>> mwes,
                                               Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index,
                                               Set<IndexBuilder.MutableRootMWEDesc> missing)
Finds MWEs that are marked in the the specified list, but not in the index.

Type Parameters:
T - the token type
Parameters:
mwes - the MWEs that may be unmarked
index - the MWE index
missing - the set to which missing MWEs should be added
Since:
jMWE 1.0.0

getUmarkedDetector

protected IMWEDetector getUmarkedDetector(IMWEIndex index)
Creates a detector that can be used to find sequences of tokens (inflected or not) that match an MWE description, but are not marked as an MWE.

Parameters:
index - the index, may not be null
Returns:
the detector
Throws:
NullPointerException - if the specified index is null
Since:
jMWE 1.0.0

countUnmarked

public void countUnmarked(IMWEDetector detector,
                          IConcordanceSentence sent,
                          List<IMWE<IConcordanceToken>> answers)
Counts the number of MWEs that are detected by the specified detected, but not marked in the answer set as being MWEs.

Parameters:
detector - the detector to be used; may not be null
sent - the sentence in which MWEs should be detected
answers - the actual set of MWEs for the sentence
Throws:
NullPointerException - if any argument is null
Since:
jMWE 1.0.0

contains

protected boolean contains(List<IMWE<IConcordanceToken>> list,
                           IMWE<IConcordanceToken> mwe)
Whether the specified MWE is contained in the specified list

Parameters:
list - the list to be searched
mwe - the MWE to look for
Returns:
true if the list contains the specified MWE; false otherwise
Since:
jMWE 1.0.0

printTotals

public void printTotals(Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> entries)
Sums all the counts of the MWEs in the given map and prints the totals.

Parameters:
entries - a map of description IDs to root descriptions whose counts will be summed and printed
Since:
jMWE 1.0.0

writeDataFile

public static void writeDataFile(IMWEIndex index,
                                 OutputStream out,
                                 Iterable<String> headerLines)
                          throws IOException
Writes the MWEIndex data to the specified file.

Parameters:
index - the MWE index whose data should be written
out - the output stream to which the data should be written
headerLines - comment lines that should be inserted at the beginning of the file. The lines may not contain linebreak (\n or \r) characters. Comment characters are not needed at the beginning of the lines; these are inserted by the method. This object may be null.
Throws:
IOException - if the is an error writing to the file
NullPointerException - if either of the first two arguments are null
Since:
jMWE 1.0.0

writeIndexFile

public static void writeIndexFile(IMWEIndex index,
                                  OutputStream out,
                                  Iterable<String> headerLines)
                           throws IOException
Writes the MWEIndex index to the specified file.

Parameters:
index - the MWE index whose index should be written
out - the output stream to which the index should be written
headerLines - comment lines that should be inserted at the beginning of the file. The lines may not contain linebreak (\n or \r) characters. Comment characters are not needed at the beginning of the lines; these are inserted by the method. This object may be null.
Throws:
IOException - if the is an error writing to the file
NullPointerException - if either of the first two arguments are null

deleteFile

public static File deleteFile(File file,
                              IndexBuilder.FileGetter fg)
Gets a pointer to a file that does not exist. If the specified file does not exist, this is returned. Otherwise, the file is deleted. If that fails, the file getter is queried for an alternative file and this method is called again with that new file.

Parameters:
file - the file to be deleted
fg - the file getter that allows that supplied an alternative file in case the specified file is not suitable
Returns:
the non-existant file finally selected
Since:
jMWE 1.0.0

toMWEPOS

public static MWEPOS toMWEPOS(edu.mit.jsemcor.element.ISemanticTag tag)
Translates the JSemcor ISemanticTag to a jMWE MWEPOS object.

Parameters:
tag - the semantic tag to be translated
Returns:
the equivalent MWEPOS object.
Since:
jMWE 1.0.0


Copyright © 2011 Massachusetts Institute of Technology. All Rights Reserved.