#
Diversity Measure for Text Summarization

##
About

- Our Projects main aim is to creat a summary of <= 665 bytes for a given set of documents.
- The main inspiration for the work was Jeff Bilmes on Class of submodullar functions for document summarization.
- The choice of submodular function lies in their inherent property of diminishing returns which suggests that adding content to a subset smaller summary than to a bigger one increases the function value more.

##
Challenges Faced

- To understand the output of the Tool Cluto,after which finding the summary of the data folder was a simple implementation of the formula suggested in the paper.
- The time taken to find summary of the documents was quite large, but this time could be reduced by ignoring the condition of the sub-modular function i.e. f(A+v) - f(A) > f(B+v) - f(B) at the cost of accuracy.

##
Data Set Used:

- We are using the standard DUC-2004 dataset which is used to test generic multi-document summarization.

##
Approaches

- The first approach that we used was basically exactly what was suggested in the paper by Jeff Bilmes
- In the second approach, we dealt the problem a bit differently.Instead of using K-means clustering, we used Agglomerative since it was more intuitive that the inherent distribution/shape which the sentences follow should not be disturbed by using partition-based clustering.

##
Challenges faced

- Understanding the output of the Tool CLUTO, after which finding the summary of the given documents using the clustering was a simple implementation of the formula suggested in the paper.
- The time taken to find the summary was quite large but this could be solved by ignoring the sub-modular function condition i.e. f(A+v)-f(A) > f(B+v) - f(B) at the cost of Accuracy.

##
How To run the code

- python
*filename* with all the dependencies installed and uploaded folders in the same directory.

##
Tools Used:

- Python libraries and dependencies : sklearn, PyStemmer, deepcopy etc,
- CLUTO for clustering of the documents.
- ROUGE to check the quality of the summaries