Diversity Measure for Text Summarization
About
- Our Projects main aim is to creat a summary of <= 665 bytes for a given set of documents.
- The main inspiration for the work was Jeff Bilmes on Class of submodullar functions for document summarization.
- The choice of submodular function lies in their inherent property of diminishing returns which suggests that adding content to a subset smaller summary than to a bigger one increases the function value more.
Challenges Faced
- To understand the output of the Tool Cluto,after which finding the summary of the data folder was a simple implementation of the formula suggested in the paper.
- The time taken to find summary of the documents was quite large, but this time could be reduced by ignoring the condition of the sub-modular function i.e. f(A+v) - f(A) > f(B+v) - f(B) at the cost of accuracy.
Data Set Used:
- We are using the standard DUC-2004 dataset which is used to test generic multi-document summarization.
Approaches
- The first approach that we used was basically exactly what was suggested in the paper by Jeff Bilmes
- In the second approach, we dealt the problem a bit differently.Instead of using K-means clustering, we used Agglomerative since it was more intuitive that the inherent distribution/shape which the sentences follow should not be disturbed by using partition-based clustering.
Challenges faced
- Understanding the output of the Tool CLUTO, after which finding the summary of the given documents using the clustering was a simple implementation of the formula suggested in the paper.
- The time taken to find the summary was quite large but this could be solved by ignoring the sub-modular function condition i.e. f(A+v)-f(A) > f(B+v) - f(B) at the cost of Accuracy.
How To run the code
- python filename with all the dependencies installed and uploaded folders in the same directory.
Tools Used:
- Python libraries and dependencies : sklearn, PyStemmer, deepcopy etc,
- CLUTO for clustering of the documents.
- ROUGE to check the quality of the summaries