public class IndexBuilder extends AbstractFileSelector implements java.lang.Runnable
This class requires JWI and JSemcor to be on the classpath.
Modifier and Type | Class and Description |
---|---|
static interface |
IndexBuilder.FileGetter
Wouldn't it be nice to have first-class functions in Java?
|
static interface |
IndexBuilder.IMutableMWEDesc |
static class |
IndexBuilder.MutableInfMWEDesc |
static class |
IndexBuilder.MutableRootMWEDesc
A root MWE description object whose counts can be incremented.
|
Constructor and Description |
---|
IndexBuilder() |
Modifier and Type | Method and Description |
---|---|
protected boolean |
contains(java.util.List<IMWE<IConcordanceToken>> list,
IMWE<IConcordanceToken> mwe)
Whether the specified MWE is contained in the specified list
|
void |
countMarked(java.util.List<IMWE<IConcordanceToken>> answers,
java.util.Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index)
Counts instances of marked MWEs
|
void |
countUnmarked(IMWEDetector detector,
IConcordanceSentence sent,
java.util.List<IMWE<IConcordanceToken>> answers)
Counts the number of MWEs that are detected by the specified detected,
but not marked in the answer set as being MWEs.
|
static java.io.File |
deleteFile(java.io.File file,
IndexBuilder.FileGetter fg)
Gets a pointer to a file that does not exist.
|
java.util.Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> |
extractMWEs(edu.mit.jwi.IDictionary dict)
Retrieves multi-word expressions from the specified
IDictionary
object and returns them as a map. |
<T extends IToken> |
findMissingMWEs(java.util.List<IMWE<T>> mwes,
java.util.Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index,
java.util.Set<IndexBuilder.MutableRootMWEDesc> missing)
Finds MWEs that are marked in the the specified list, but not in the
index.
|
protected edu.mit.jsemcor.main.IConcordanceSet |
getConcordance()
Gets the concordance set that will be used to interface with Semcor.
|
protected java.io.File |
getConcordanceDir()
Returns the location of the concordance data; may be overridden by
subclasses to provide a different manner of locating the concordance.
|
protected java.io.File |
getDataFile()
Prompts the user to select the file to which the MWE descriptions will be
written along with their counts relating to their occurrences in the
reference concordance.
|
protected java.util.List<java.lang.String> |
getDataHeaderLines()
Returns the set of lines to be included as a header in the data file.
|
protected edu.mit.jwi.IDictionary |
getDictionary()
Gets the
IDictionary that will be used to interface with Wordnet. |
protected java.io.File |
getDictionaryDir()
Returns the location of the wordnet dictionary; may be overridden by
subclasses to provide a different manner of locating the dictionary.
|
protected int |
getEstimatedSentenceCount()
Returns the estimated number of sentences being used from the reference concordance (Semcor).
|
protected java.io.File |
getIndexFile()
Prompts the user to select the file to which the index will be written.
|
protected java.util.List<java.lang.String> |
getIndexHeaderLines()
Returns the set of lines to be included as a header in the index file.
|
protected IndexBuilder.MutableInfMWEDesc |
getInflectedForm(IndexBuilder.MutableRootMWEDesc root,
java.lang.String form)
Returns an inflected form that matches the specified surface form,
attached to the root description.
|
protected java.io.File |
getTaggedConcordanceFile()
Returns the location of the tagged concordance file; may be overridden by
subclasses to provide a different manner of locating the concordance.
|
protected java.lang.Iterable<IConcordanceSentence> |
getTaggedIterator()
Gets an iterator over the tagged semcor sentences.
|
protected IMWEDetector |
getUmarkedDetector(IMWEIndex index)
Creates a detector that can be used to find sequences of tokens
(inflected or not) that match an MWE description, but are not marked as
an MWE.
|
protected boolean |
isMWE(edu.mit.jwi.item.IIndexWord idxWord)
Returns true if the given word is an MWE.
|
protected <T extends IConcordanceToken> |
isSplit(IMWE<T> mwe)
Returns
true if this MWE is not continuous - if it has
interstitial tokens that are not a part of it; false
otherwise. |
static void |
main(java.lang.String[] args)
Constructs the MWE index from Wordnet and Semcor and writes it to a file.
|
void |
printTotals(java.util.Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> entries)
Sums all the counts of the MWEs in the given map and prints the totals.
|
void |
process(edu.mit.jwi.IDictionary dict,
java.lang.Iterable<? extends IConcordanceSentence> itr,
edu.mit.jsemcor.main.IConcordanceSet cs,
java.io.File dataFile,
java.io.File indexFile)
Constructs the index in five steps:
|
void |
run() |
static MWEPOS |
toMWEPOS(edu.mit.jsemcor.element.ISemanticTag tag)
Translates the JSemcor
ISemanticTag to a jMWE MWEPOS
object. |
static void |
writeDataFile(IMWEIndex index,
java.io.OutputStream out,
java.lang.Iterable<java.lang.String> headerLines)
Writes the MWEIndex data to the specified file.
|
static void |
writeIndexFile(IMWEIndex index,
java.io.OutputStream out,
java.lang.Iterable<java.lang.String> headerLines)
Writes the MWEIndex index to the specified file.
|
choose, chooseDirectory, chooseFile, chooseFileForWriting, getFileChooser, getLocation, setLocation
public static void main(java.lang.String[] args)
args
- standard main arguments; ignoredpublic void run()
run
in interface java.lang.Runnable
protected edu.mit.jwi.IDictionary getDictionary()
IDictionary
that will be used to interface with Wordnet.protected java.io.File getDictionaryDir()
protected edu.mit.jsemcor.main.IConcordanceSet getConcordance()
null
if noneprotected java.io.File getConcordanceDir()
null
if the concordance directory cannot be
found.protected java.lang.Iterable<IConcordanceSentence> getTaggedIterator()
null
if the tagged concordance file cannot be found
or if the user chooses to construct this index with no inflected
forms or counts.protected java.io.File getTaggedConcordanceFile()
null
if noneprotected java.io.File getDataFile()
account_for_V 14,0,0,0,0 accounted_for 5,0,0,0,5 accounting_for 1,0,0,0,0 accounts_for 2,0,0,0,1
protected java.io.File getIndexFile()
aberration chromatic_aberration_N chromosomal_aberration_N optical_aberration_N spherical_aberration_N
protected java.util.List<java.lang.String> getDataHeaderLines()
protected java.util.List<java.lang.String> getIndexHeaderLines()
public void process(edu.mit.jwi.IDictionary dict, java.lang.Iterable<? extends IConcordanceSentence> itr, edu.mit.jsemcor.main.IConcordanceSet cs, java.io.File dataFile, java.io.File indexFile) throws java.io.IOException
1. Extracts the MWEs from the given dictionary
2. Finds the MWEs in the concordance that are missing from the dictionary.
3. Counts the number of times this MWE was marked as a continuous run of tokens, non-continuous run, appeared with a known inflection pattern, etc.
4. Records the counts for unmarked sequences of MWE parts
5. Writes the index to the data and index files
If the concordance set provided isnull
, skips steps 2-4.dict
- the dictionary containing the MWEsitr
- the iterator over the sentences in the reference concordancecs
- the possibly null
reference concordance set.dataFile
- the file to which the descriptions and counts will be writtenindexFile
- the file to which the index will be writtenjava.io.IOException
- if there is a problem when accessing the specified files or dictionariesprotected int getEstimatedSentenceCount()
public java.util.Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> extractMWEs(edu.mit.jwi.IDictionary dict)
IDictionary
object and returns them as a map. Multi-word expressions are indexed in
the map according to their collocates. That is, the keys of the map are
parts of multi-word expressions (String
objects that do not
contain an underscore), and the set associated with that key contains all
multi-word expressions in the dictionary that contain that part.
This method returns a map whose keys and value sets are sorted in their
natural order.dict
- a JWI IDictionary
objectMap
with collocates as keys and a set of the multi-word
expressions that they are a part of as values.java.lang.NullPointerException
- if the specified dictionary is null
protected boolean isMWE(edu.mit.jwi.item.IIndexWord idxWord)
idxWord
- the word to be checked.public void countMarked(java.util.List<IMWE<IConcordanceToken>> answers, java.util.Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index)
answers
- the list of answers for a sentence, may not be
null
index
- the index map,may not be null
java.lang.NullPointerException
- if either argument is null
protected IndexBuilder.MutableInfMWEDesc getInflectedForm(IndexBuilder.MutableRootMWEDesc root, java.lang.String form)
root
- the root form on which the inflected is to be createdform
- the inflected formprotected <T extends IConcordanceToken> boolean isSplit(IMWE<T> mwe)
true
if this MWE is not continuous - if it has
interstitial tokens that are not a part of it; false
otherwise.T
- the type of token used by the mwemwe
- the MWE to test; may not be null
true
if the MWE is split; false
otherwisejava.lang.NullPointerException
- if the specified MWE is null
public <T extends IToken> void findMissingMWEs(java.util.List<IMWE<T>> mwes, java.util.Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index, java.util.Set<IndexBuilder.MutableRootMWEDesc> missing)
T
- the token typemwes
- the MWEs that may be unmarkedindex
- the MWE indexmissing
- the set to which missing MWEs should be addedprotected IMWEDetector getUmarkedDetector(IMWEIndex index)
index
- the index, may not be null
java.lang.NullPointerException
- if the specified index is null
public void countUnmarked(IMWEDetector detector, IConcordanceSentence sent, java.util.List<IMWE<IConcordanceToken>> answers)
detector
- the detector to be used; may not be null
sent
- the sentence in which MWEs should be detectedanswers
- the actual set of MWEs for the sentencejava.lang.NullPointerException
- if any argument is null
protected boolean contains(java.util.List<IMWE<IConcordanceToken>> list, IMWE<IConcordanceToken> mwe)
list
- the list to be searchedmwe
- the MWE to look fortrue
if the list contains the specified MWE;
false
otherwisepublic void printTotals(java.util.Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> entries)
entries
- a map of description IDs to root descriptions whose counts
will be summed and printedpublic static void writeDataFile(IMWEIndex index, java.io.OutputStream out, java.lang.Iterable<java.lang.String> headerLines) throws java.io.IOException
index
- the MWE index whose data should be writtenout
- the output stream to which the data should be writtenheaderLines
- comment lines that should be inserted at the beginning of the
file. The lines may not contain linebreak (\n or \r) characters. Comment
characters are not needed at the beginning of the lines; these
are inserted by the method. This object may be null
.java.io.IOException
- if the is an error writing to the filejava.lang.NullPointerException
- if either of the first two arguments are null
public static void writeIndexFile(IMWEIndex index, java.io.OutputStream out, java.lang.Iterable<java.lang.String> headerLines) throws java.io.IOException
index
- the MWE index whose index should be writtenout
- the output stream to which the index should be writtenheaderLines
- comment lines that should be inserted at the beginning of the
file. The lines may not contain linebreak (\n or \r) characters. Comment
characters are not needed at the beginning of the lines; these
are inserted by the method. This object may be null
.java.io.IOException
- if the is an error writing to the filejava.lang.NullPointerException
- if either of the first two arguments are null
public static java.io.File deleteFile(java.io.File file, IndexBuilder.FileGetter fg)
file
- the file to be deletedfg
- the file getter that allows that supplied an alternative file
in case the specified file is not suitableCopyright © 2011 Massachusetts Institute of Technology. All Rights Reserved.