|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.mit.jmwe.util.AbstractFileSelector
edu.mit.jmwe.index.IndexBuilder
public class IndexBuilder
Builds a MWE index that can be loaded into memory from Wordnet, using Semcor as the reference concordance for obtaining frequencies relating to an MWE's occurrence as marked, unmarked, etc.
This class requires JWI and JSemcor to be on the classpath.
Nested Class Summary | |
---|---|
static interface |
IndexBuilder.FileGetter
Wouldn't it be nice to have first-class functions in Java? |
static interface |
IndexBuilder.IMutableMWEDesc
|
static class |
IndexBuilder.MutableInfMWEDesc
|
static class |
IndexBuilder.MutableRootMWEDesc
A root MWE description object whose counts can be incremented. |
Constructor Summary | |
---|---|
IndexBuilder()
|
Method Summary | ||
---|---|---|
protected boolean |
contains(List<IMWE<IConcordanceToken>> list,
IMWE<IConcordanceToken> mwe)
Whether the specified MWE is contained in the specified list |
|
void |
countMarked(List<IMWE<IConcordanceToken>> answers,
Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index)
Counts instances of marked MWEs |
|
void |
countUnmarked(IMWEDetector detector,
IConcordanceSentence sent,
List<IMWE<IConcordanceToken>> answers)
Counts the number of MWEs that are detected by the specified detected, but not marked in the answer set as being MWEs. |
|
static File |
deleteFile(File file,
IndexBuilder.FileGetter fg)
Gets a pointer to a file that does not exist. |
|
Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> |
extractMWEs(edu.mit.jwi.IDictionary dict)
Retrieves multi-word expressions from the specified IDictionary
object and returns them as a map. |
|
|
findMissingMWEs(List<IMWE<T>> mwes,
Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index,
Set<IndexBuilder.MutableRootMWEDesc> missing)
Finds MWEs that are marked in the the specified list, but not in the index. |
|
protected edu.mit.jsemcor.main.IConcordanceSet |
getConcordance()
Gets the concordance set that will be used to interface with Semcor. |
|
protected File |
getConcordanceDir()
Returns the location of the concordance data; may be overridden by subclasses to provide a different manner of locating the concordance. |
|
protected File |
getDataFile()
Prompts the user to select the file to which the MWE descriptions will be written along with their counts relating to their occurrences in the reference concordance. |
|
protected List<String> |
getDataHeaderLines()
Returns the set of lines to be included as a header in the data file. |
|
protected edu.mit.jwi.IDictionary |
getDictionary()
Gets the IDictionary that will be used to interface with Wordnet. |
|
protected File |
getDictionaryDir()
Returns the location of the wordnet dictionary; may be overridden by subclasses to provide a different manner of locating the dictionary. |
|
protected int |
getEstimatedSentenceCount()
Returns the estimated number of sentences being used from the reference concordance (Semcor). |
|
protected File |
getIndexFile()
Prompts the user to select the file to which the index will be written. |
|
protected List<String> |
getIndexHeaderLines()
Returns the set of lines to be included as a header in the index file. |
|
protected IndexBuilder.MutableInfMWEDesc |
getInflectedForm(IndexBuilder.MutableRootMWEDesc root,
String form)
Returns an inflected form that matches the specified surface form, attached to the root description. |
|
protected File |
getTaggedConcordanceFile()
Returns the location of the tagged concordance file; may be overridden by subclasses to provide a different manner of locating the concordance. |
|
protected Iterable<IConcordanceSentence> |
getTaggedIterator()
Gets an iterator over the tagged semcor sentences. |
|
protected IMWEDetector |
getUmarkedDetector(IMWEIndex index)
Creates a detector that can be used to find sequences of tokens (inflected or not) that match an MWE description, but are not marked as an MWE. |
|
protected boolean |
isMWE(edu.mit.jwi.item.IIndexWord idxWord)
Returns true if the given word is an MWE. |
|
protected
|
isSplit(IMWE<T> mwe)
Returns true if this MWE is not continuous - if it has
interstitial tokens that are not a part of it; false
otherwise. |
|
static void |
main(String[] args)
Constructs the MWE index from Wordnet and Semcor and writes it to a file. |
|
void |
printTotals(Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> entries)
Sums all the counts of the MWEs in the given map and prints the totals. |
|
void |
process(edu.mit.jwi.IDictionary dict,
Iterable<? extends IConcordanceSentence> itr,
edu.mit.jsemcor.main.IConcordanceSet cs,
File dataFile,
File indexFile)
Constructs the index in five steps: |
|
void |
run()
|
|
static MWEPOS |
toMWEPOS(edu.mit.jsemcor.element.ISemanticTag tag)
Translates the JSemcor ISemanticTag to a jMWE MWEPOS
object. |
|
static void |
writeDataFile(IMWEIndex index,
OutputStream out,
Iterable<String> headerLines)
Writes the MWEIndex data to the specified file. |
|
static void |
writeIndexFile(IMWEIndex index,
OutputStream out,
Iterable<String> headerLines)
Writes the MWEIndex index to the specified file. |
Methods inherited from class edu.mit.jmwe.util.AbstractFileSelector |
---|
choose, chooseDirectory, chooseFile, chooseFileForWriting, getFileChooser, getLocation, setLocation |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public IndexBuilder()
Method Detail |
---|
public static void main(String[] args)
args
- standard main arguments; ignoredpublic void run()
run
in interface Runnable
protected edu.mit.jwi.IDictionary getDictionary()
IDictionary
that will be used to interface with Wordnet.
protected File getDictionaryDir()
protected edu.mit.jsemcor.main.IConcordanceSet getConcordance()
null
if noneprotected File getConcordanceDir()
null
if the concordance directory cannot be
found.protected Iterable<IConcordanceSentence> getTaggedIterator()
null
if the tagged concordance file cannot be found
or if the user chooses to construct this index with no inflected
forms or counts.protected File getTaggedConcordanceFile()
null
if noneprotected File getDataFile()
account_for_V 14,0,0,0,0 accounted_for 5,0,0,0,5 accounting_for 1,0,0,0,0 accounts_for 2,0,0,0,1
protected File getIndexFile()
aberration chromatic_aberration_N chromosomal_aberration_N optical_aberration_N spherical_aberration_N
protected List<String> getDataHeaderLines()
protected List<String> getIndexHeaderLines()
public void process(edu.mit.jwi.IDictionary dict, Iterable<? extends IConcordanceSentence> itr, edu.mit.jsemcor.main.IConcordanceSet cs, File dataFile, File indexFile) throws IOException
1. Extracts the MWEs from the given dictionary
2. Finds the MWEs in the concordance that are missing from the dictionary.
3. Counts the number of times this MWE was marked as a continuous run of tokens, non-continuous run, appeared with a known inflection pattern, etc.
4. Records the counts for unmarked sequences of MWE parts
5. Writes the index to the data and index files
If the concordance set provided isnull
, skips steps 2-4.
dict
- the dictionary containing the MWEsitr
- the iterator over the sentences in the reference concordancecs
- the possibly null
reference concordance set.dataFile
- the file to which the descriptions and counts will be writtenindexFile
- the file to which the index will be written
IOException
protected int getEstimatedSentenceCount()
public Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> extractMWEs(edu.mit.jwi.IDictionary dict)
IDictionary
object and returns them as a map. Multi-word expressions are indexed in
the map according to their collocates. That is, the keys of the map are
parts of multi-word expressions (String
objects that do not
contain an underscore), and the set associated with that key contains all
multi-word expressions in the dictionary that contain that part.
This method returns a map whose keys and value sets are sorted in their
natural order.
dict
- a JWI IDictionary
object
Map
with collocates as keys and a set of the multi-word
expressions that they are a part of as values.
NullPointerException
- if the specified dictionary is null
protected boolean isMWE(edu.mit.jwi.item.IIndexWord idxWord)
idxWord
- the word to be checked.
public void countMarked(List<IMWE<IConcordanceToken>> answers, Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index)
answers
- the list of answers for a sentence, may not be
null
index
- the index map,may not be null
NullPointerException
- if either argument is null
protected IndexBuilder.MutableInfMWEDesc getInflectedForm(IndexBuilder.MutableRootMWEDesc root, String form)
root
- the root form on which the inflected is to be createdform
- the inflected form
protected <T extends IConcordanceToken> boolean isSplit(IMWE<T> mwe)
true
if this MWE is not continuous - if it has
interstitial tokens that are not a part of it; false
otherwise.
mwe
- the MWE to test; may not be null
true
if the MWE is split; false
otherwise
NullPointerException
- if the specified MWE is null
public <T extends IToken> void findMissingMWEs(List<IMWE<T>> mwes, Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> index, Set<IndexBuilder.MutableRootMWEDesc> missing)
T
- the token typemwes
- the MWEs that may be unmarkedindex
- the MWE indexmissing
- the set to which missing MWEs should be addedprotected IMWEDetector getUmarkedDetector(IMWEIndex index)
index
- the index, may not be null
NullPointerException
- if the specified index is null
public void countUnmarked(IMWEDetector detector, IConcordanceSentence sent, List<IMWE<IConcordanceToken>> answers)
detector
- the detector to be used; may not be null
sent
- the sentence in which MWEs should be detectedanswers
- the actual set of MWEs for the sentence
NullPointerException
- if any argument is null
protected boolean contains(List<IMWE<IConcordanceToken>> list, IMWE<IConcordanceToken> mwe)
list
- the list to be searchedmwe
- the MWE to look for
true
if the list contains the specified MWE;
false
otherwisepublic void printTotals(Map<IMWEDescID,IndexBuilder.MutableRootMWEDesc> entries)
entries
- a map of description IDs to root descriptions whose counts
will be summed and printedpublic static void writeDataFile(IMWEIndex index, OutputStream out, Iterable<String> headerLines) throws IOException
index
- the MWE index whose data should be writtenout
- the output stream to which the data should be writtenheaderLines
- comment lines that should be inserted at the beginning of the
file. The lines may not contain linebreak (\n or \r) characters. Comment
characters are not needed at the beginning of the lines; these
are inserted by the method. This object may be null
.
IOException
- if the is an error writing to the file
NullPointerException
- if either of the first two arguments are null
public static void writeIndexFile(IMWEIndex index, OutputStream out, Iterable<String> headerLines) throws IOException
index
- the MWE index whose index should be writtenout
- the output stream to which the index should be writtenheaderLines
- comment lines that should be inserted at the beginning of the
file. The lines may not contain linebreak (\n or \r) characters. Comment
characters are not needed at the beginning of the lines; these
are inserted by the method. This object may be null
.
IOException
- if the is an error writing to the file
NullPointerException
- if either of the first two arguments are null
public static File deleteFile(File file, IndexBuilder.FileGetter fg)
file
- the file to be deletedfg
- the file getter that allows that supplied an alternative file
in case the specified file is not suitable
public static MWEPOS toMWEPOS(edu.mit.jsemcor.element.ISemanticTag tag)
ISemanticTag
to a jMWE MWEPOS
object.
tag
- the semantic tag to be translated
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |