Diversity Measure for Text Summarization

About

Our Projects main aim is to creat a summary of <= 665 bytes for a given set of documents.
The main inspiration for the work was Jeff Bilmes on Class of submodullar functions for document summarization.
The choice of submodular function lies in their inherent property of diminishing returns which suggests that adding content to a subset smaller summary than to a bigger one increases the function value more.

Challenges Faced

To understand the output of the Tool Cluto,after which finding the summary of the data folder was a simple implementation of the formula suggested in the paper.
The time taken to find summary of the documents was quite large, but this time could be reduced by ignoring the condition of the sub-modular function i.e. f(A+v) - f(A) > f(B+v) - f(B) at the cost of accuracy.

Data Set Used:

We are using the standard DUC-2004 dataset which is used to test generic multi-document summarization.

Approaches

The first approach that we used was basically exactly what was suggested in the paper by Jeff Bilmes
In the second approach, we dealt the problem a bit differently.Instead of using K-means clustering, we used Agglomerative since it was more intuitive that the inherent distribution/shape which the sentences follow should not be disturbed by using partition-based clustering.

Challenges faced

Understanding the output of the Tool CLUTO, after which finding the summary of the given documents using the clustering was a simple implementation of the formula suggested in the paper.
The time taken to find the summary was quite large but this could be solved by ignoring the sub-modular function condition i.e. f(A+v)-f(A) > f(B+v) - f(B) at the cost of Accuracy.

How To run the code

python filename with all the dependencies installed and uploaded folders in the same directory.

Tools Used:

Python libraries and dependencies : sklearn, PyStemmer, deepcopy etc,
CLUTO for clustering of the documents.
ROUGE to check the quality of the summaries