p, li { white-space: pre-wrap; }

documentclassconference{IEEEtran}

% *** CITATION PACKAGES ***
%
usepackage{cite}
usepackage{verbatimbox}

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

% *** GRAPHICS RELATED PACKAGES ***
%
ifCLASSINFOpdf
usepackagepdftex{graphicx}
else
fi

egin{document}

itle{Measuring similarity between sentences.}

author{IEEEauthorblockN{Parag Pravin Dakle}
IEEEauthorblockA{Department of Computer Science
University of Texas at Dallas
Richardson, TX
Email: [email protected]}
and
IEEEauthorblockN{Dr. Dan Moldovan}
IEEEauthorblockA{Department of Computer Science
University of Texas at Dallas
Richardson, TX
Email: [email protected]}}

maketitle

egin{abstract}
A rise in the usage of Q&A websites has been seen recently. Considering the uniqueness of the domain, this project aims to measure sentence (question) similarity on a dataset of questions from Quora. An analysis of how effective both syntactic and semantic features generated from the dataset with SVM and Logistic Regression classifier is done. This project proposes a new syntactic feature called Lemma Jaccard Distance but for sematic features, the evaluation is done in a similar way to the one done in cite{han2013umbc_ebiquity}.
end{abstract}
IEEEpeerreviewmaketitle

section{Introduction}
The use of internet for expressing views in short pieces of text has increased significantly. Using various natural language processing techniques, we can analyze such small pieces of text on a large scale and use the analysis in various key fields. Among these, measuring text similarity has been a research subject not only in natural language processing, but also machine learning and aritificial intelligence for a long time cite{han2013umbc_ebiquity}. It has found multiple applications today in text classification, word sense disambiguation, summarization, duplicate text detection and automatic evaluation of machine translation.

One of the earliest works on text similarity used the vectorial model in information retrieval cite{salton_lesk}. In most of the works done on text similarity one of three basic feature types are used – syntactic, semantic or a mixture of both. Among the literature covered, although limited, cite{han2013umbc_ebiquity} and cite{microsoft_paper} explore semantic features to a great extent whereas cite{paper_2} explores a mixture of both syntactic and semantic features.

In this project, a hybrid feature based approach is used. Syntactic features from cite{paper_2} and semantic features from cite{han2013umbc_ebiquity} are taken. In addition to these features, we also propose a new syntactic feature – Lemma Jaccard Distance and add word sense disambiguation to the semantic feature generation procedure.

The rest of this report is structured as follows. Section 2 gives some background information about Quora, WordNet and various metrics to measure vector similarity and degree of synonimity. Section 3 elaborates on the approach followed in the project. Section 4 describes the implementation details of the project and also various components of the system. In section 5, results of the approach used in this project are reported and discussed. Section 6 and 7 present conclusions and the future work.

section{Background}
In this section some literation related to Quora, WordNet, word and vector similarity is discussed.

subsection{Quora}
Quora is a question-and-answer site where questions are asked, answered, edited and organized by its community of users cite{quora_wiki}. In addition, as per cite{quora_ceo} Quora claimed to have 200 million monthly unique visitors. Although there are no exact stats for number of questions asked on Quora, the high number of unique visitors, makes measuring similarity between the questions asked on Quora an interesting problem to work on. What makes this dataset more challenging is that, the questions asked are not always correct syntactically. Since while posting a question, the question does not undergo any syntactical check, many of the traditional nlp techniques are not applicable directly or have to be modified to suit to the problem.
subsection{WordNet}
WordNet is project developed by Princeton University cite{about_wordnet} and is one of the most widely used programmable database used by the researchers working in Natural Language Processing. The fundamental unit of WordNet is a synset. One can think of a synset as a mechanism to group sematically related words together. Using the concepts of synsets, WordNet represents many other relations like hyponyms, hypernyms, etc. and thus it is used in many natural language tasks like language generation, summarization, machine translation, etc. In this project, WordNet is basically used for word sense disambiguation and calculate the similarity measure between two synsets using one of distance measuring metric in WordNet.
subsection{Vector Similarity}
Representing words as vectors and using the vector representation for measuring distance between two words or clauses is very common practice in information retrieval methods. In this project, some of these metrics are used for calculating syntactic similarity between two questions cite{vec_sim_ref_1}. The basic idea in these approaches is using view the sentence as a bag of words cite{textbook} and then use the distance metrics to measure similarity between the two sentences.
Let the vectors to be compared by A and B, each of length L. The two vector distances used in this project are:
egin{itemize}
item Jaccard Distance:
egin{equation}
d = frac{sum_{i=1}^{L}(A_i – B_i)^2}{sum_{i=1}^{L}(A_i)^2 + sum_{i=1}^{L}(B_i)^2 + sum_{i=1}^{L}(A_iB_i)}
end{equation}

item Cosine Similarity:
egin{equation}
d = frac{sum_{i=1}^{L}A_iB_i}{||A||_2 + ||B||_2}
end{equation}

end{itemize}
subsection{Degree of Synonimity}
For measuring semantic similarity between two words many metrics have been developed over the time cite{textbook}. Among all these, this project uses metrics based on taxonomy trees and distances among words in those trees and description of these metrics cite{wordnet_doc} is as follows:
egin{itemize}
item Path similarity (PS): Return a score denoting how similar two word concepts are, based on the shortest path that connects the concepts in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.
item Leacock-Chodorow Similarity (LCS): Return a score denoting how similar two concepts are, based on the shortest path that connects the concepts and the maximum depth of the taxonomy in which the concepts occur. If p is the shortest path length and d the taxonomy depth.
egin{equation}
S_{lcs} = -log(frac{p}{2d})
end{equation}
item Wu-Palmer Similarity (WPS): Return a score denoting how similar two concepts are, based on the depth of the two concepts in the taxonomy and that of their Least Common Subsumer(LCS) (most specific ancestor node).
egin{equation}
S_{wup} = frac{2 * depth(LCS)}{depth(concept1) + depth(concept2)}
end{equation}
end{itemize}
section{Approach}
Based on the the previous works of cite{han2013umbc_ebiquity} and cite{paper_2}, first few of the syntactic and semantic features mentioned in them are evaluated. The overall approach for the project is described in Figure 1.
egin{center}
includegraphicsscale=0.45{Figure1}
Figure 1
end{center}

For generating syntactic features, we use the vector formulation and the bag of words model as discussed previously. The following syntactic features are used in the project:
egin{itemize}
item Cosine Similarity (CS): We measure the cosine similarity of the two questions after performing tokenization and stop words filtering operations on the two questions. The equation mentioned in section 2 for cosine similarity is used.
item Normal Jaccard Distance (NJD): This is distance is computed using the equation mentioned in section 2 and after performing tokenization and stop words filtering operations.
item Lemma Jaccard Distance (LJD): In addition to the normal Jaccard distance, an addition one post lemmatization is also computed. We propose this distance metric in this project as in our opinion it helps in capturing the similarity of the pure lemmas or concepts mentioned in the questions rather than the specific words used.
end{itemize}

For generating the semantic features we use the method descibed in cite{han2013umbc_ebiquity}. The basic idea behind these semantic features is that if two texts are similar semantically then the corresponding parts of speech specifically nouns and verbs used in the two texts are identical or semantically close to each other.
indent From the captured lemmas, generate sets of nouns, verbs, adverbs and adjectives for each question. Using some metric for degree of synonimity, we use the directional measure of similarity from cite{han2013umbc_ebiquity}. The directional measure is basically trying to see how close a part of speech list of one question is to another. For each noun (verb) in the set of nouns (verbs) belonging to one of the text segments, we try to identify the noun (verb) in the other text segment that has the highest semantic similarity (maxSim), according to the metric for degree of synonimity selected. If the metric returns a positive score, then we select the pair with the highest score and add this pair to a list $SW_{pos}$. For the other parts of speech: adjective and adverb we do a simple lexical match to see if there are any matching words and then add the matching words with a score 1.0 to the matching list.
indent If $Q_i$ and $Q_j$ are the two questions we are measuring similarity for then the previous directional measure is given by the equation:
egin{equation}
sim(Q_i, Q_j)_{(Q_i)} = frac{sum_{pos}(sum_{w_k epsilon SW_{pos}}(maxSim(w_k) * idf_{w_k}))}{sum_{w_k epsilon Q_{i_{pos}}} idf_{w_k}}
end{equation}
indent This measure is directional and the bidirection score can be computed using the following equation:
egin{equation}
sim(Q_i, Q_j) = frac{sim(Q_i, Q_j)_{(Q_i)} + sim(Q_i, Q_j)_{(Q_j)}}{2}
end{equation}
indent Since syntactic feature generation does not involve any additional processing, we elaborate semantic feature generation in more detail. Figure 2, shows the flow of semantic feature generation. The additional step required for semantic features is word sense disambiguation and for the same, the Simplified Lesk algorithm is used.
egin{center}
includegraphicsscale=0.5{Figure2}
Figure 2
end{center}
section{Implementation}

subsection{Data}
The dataset of questions from Quora has been taken from Kaggle. Quora had held a competition to solve the exact same problem. The solution is, however, not available to view openly. As per the rules the dataset cannot be publicly distributed but can be used for private academic projects. The dataset also consists of questions containing non English words. For this project, such questions are taken but the non-Enlish words are dropped.

subsection{Project Components}
The components of the system and brief description is as follows:
egin{itemize}
item Data loader: This component reads data from the given csv file and loads it into a list of samples.
item Feature generator: Using the data from the data loader as input, this component generates both syntactic and semantic features as per configuration. Based on the provided configuration parameters, different features are generated.
item Classifier trainer: Component is responsible for training the classifier selected via configuration. The features generated by the feature generator are used as input along with the class labels.
end{itemize}

The first two modules are used again during testing and then the classifier trained in the third module is used for measuring similarity between the two sentences.

subsection{Machine Learning Classifiers}
For evaluating how the generated features work, two classifiers are used – Logistic Regression classifier and Support Vector Machine classifier with an RBF kernel. The motivation to choose these two classifiers is as follows:
egin{itemize}
item Logistic Regression uses a probabilistic approach for classification and SVM’s do not.
item cite{textbook} mentions Logistic Regression whereas some of literature in NLP mentions SVM.
end{itemize}

section{Experimentation and Results}
We evaluate each of the syntactic and semantic features mentioned feature separately and then combine them for automatically identify if two questions are similar or not. The dataset considered consists of 404289 training samples. Since the test samples are not labelled, the training data owing to it’s huge size is split as 80-20 and used for training and testing.

Initially, stopwords were removed from the questions before doing part of speech tagging. However, this resulted in many words being tagged incorrectly. For example in the question extit{“How do I read and find my YouTube comments?”}, extit{Youtube} is an NNP, but converting the question to lowercase and removing stopwords results in extit{youtube} to be tagged as a CD. After observing this problem for some questions, stopwords filtering is now done after part of speech tagging.

Intuitively, individual semantic features were expected to work better than the syntactic features. However, results show that syntactic features perform better. For syntactic features, it can be seen that the proposed Lemma Jaccard Distance, performs better than all the semantic features and sometimes better than the syntactic features too.

Table 1 contains the results of the experiments for Logistic Regression classifier. In order to a get a broader understanding of effect of different features, we consider the following metrics – accuracy, precision, recall and f-score. Furthermore, each of the last three metrics are computed for both similar and non-similar classes i.e. when questions are similar and non-similar. Table 2 contains similar results for SVM.
egin{center}
addvbuffer6pt{egin{tabular}{||p{1.4cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}||}
hline
Metric & Acc. & Prec. & Rec. & F 0.5ex
hlinehline
multicolumn{5}{||c||}{Syntactic Features}
hline
CS & 0.6456 & (0.6456, -) & (-, -) & (0.7846, -)
hline
NJD & 0.6632 & (0.7061, 0.535) & (0.819, 0.3792) & (0.7584, 0.4438)
hline
LJD & extbf{0.6684} & extbf{(0.7082, 0.5463)} & ( extbf{0.8271}, 0.3792) & extbf{(0.7630, 0.4477)}
hline
CS+NJD +LJD & 0.67 & (0.7149, 0.5458) & (0.8128, 0.4097) & (0.7608, 0.468)
hline
multicolumn{5}{||c||}{Semantic Features}
hline
PS & 0.6408 & (0.6896, 0.4901) & (0.8066, 0.3386) & (0.7435, 0.4005)
hline
LCS & 0.6496 & (0.6594, 0.5268) & (0.9454, 0.1106) & (0.7769, 0.1828)
hline
WPS & 0.644 & (0.6799, 0.4959) & (0.8475, 0.2731) & (0.7545, 0.3522)
hlinehline
Combined & 0.67 & (0.7149, 0.5458) & (0.8128, 0.4097) & (0.7608, 0.468)
hline
end{tabular}}
Table 1
end{center}

egin{center}
addvbuffer6pt{egin{tabular}{||p{1.4cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}||}
hline
Metric & Acc. & Prec. & Rec. & F 0.5ex
hlinehline
multicolumn{5}{||c||}{Syntactic Features}
hline
CS & 0.6456 & (0.6456, -) & (-, -) & (0.7846, -)
hline
end{tabular}}
addvbuffer6pt{egin{tabular}{||p{1.4cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}||}
hline
NJD & extbf{0.6756} & ( extbf{0.8002}, 0.5322) & (0.6629, extbf{0.6986}) & (0.7251, extbf{0.6041})
hline
LJD & 0.6684 & (0.7727, 0.5286) & (0.6889, 0.6309) & (0.7284, 0.5784)
hline
CS+NJD +LJD & extbf{0.6768} & (0.7708, extbf{0.5385}) & ( extbf{0.7106}, 0.6151) & ( extbf{0.7395}, 0.5742)
hline
multicolumn{5}{||c||}{Semantic Features}
hline
PS & 0.6456 & (0.7061, 0.535) & (0.819, 0.3792) & (0.7584, 0.4438)
hline
LCS & 0.6456 & (0.7061, 0.535) & (0.819, 0.3792) & (0.7584, 0.4438)
hline
WPS & 0.6632 & (0.7061, 0.535) & (0.819, 0.3792) & (0.7584, 0.4438)
hlinehline
Combined & 0.6732 & (0.7556, 0.5366) & (0.7298, 0.5699) & (0.7425, 0.5528)
hline
end{tabular}}
Table 2
end{center}

Since the performance of the semantic features was not as expected, a deeper investigation on the feature values was done to know the reason for the same. We plot the values of semantic feature on X vs similarity classes (0 or 1) on Y to see if we can get a clear vertical separation between the classes. As the dataset size is very large, we randomly select 5000 samples from the data and plot them. Seeing the plot in Figure 3, we can see that the chosen semantic feature clearly fails to find some sort of vertical separation between the data items.

egin{center}
includegraphicsscale=0.28{Figure3}
Figure 3
end{center}

indent It is also worth mentioning the list of features/techniques that were tried but other better features/techinques were found.
egin{itemize}
item Length of the two questions: A very naive feature of comparing the length of the two questions.
item Stemming: In place of lemmatization, stemming was initially used. However, lemmatization not only gives better accuracy, but it also is useful for semantic feature generation when word sense disambiguation needs to be done.
item Common lemma senses: After lemmatization and performing word senses disambiguation of all the nouns, verbs, adjectives and adverbs, the common senses between the two questions were found. This metric was designed to capture how many word senses were common between the two questions.
item Distance between senses: A set of senses for all nouns, verbs, adjectives and adverbs for each question was created. Using some metric semantic distance between the two sets was found. It is important to note that unlike in the approach finally used, here we do not split the words (senses) based on the part of speech but consider all the words (senses) in a single set.
end{itemize}

section{Conclusion}
We described two feature sets (syntactic and semantic) for measuring sentence similarity. Three techinques for syntactic and one approach along with three distance metrics for semantic feature generation were used. The results obtained are close to the ones seen in cite{han2013umbc_ebiquity}, however, semantic features generated failed to capture the context effectively and increase the accuracy of prediction.

A maximum accuracy of 67.68\% was achieved by using all the syntactic features together. This was true for both Logistic Regression and SVM classifier. Between the two classifiers, no classifier outperformed the other classifer for all features. Analyzing the feature value distribution of semantic features, it seems that additional domain knowledge, as suggested in future work, might be needed.

Finally, the specific task of measuring question similarity resulted in evaluating various features which can be used for other sentence similarity measurement tasks.

section{Future Work}
Although in this project features from cite{han2013umbc_ebiquity}, cite{microsoft_paper} andcite{paper_2} are used, there is not a significant improvement in accuracy due to the semantic features used. Surprisingly, the syntactic features work better individually as compared to the semantic features. The following areas are where work can be done to improve the accuracy:
egin{itemize}
item K-Beam search for WSD – In place of using Simplified Lesk algorithm for performing word sense disambiguation, we can use a K-Beam search approach mentioned in cite{microsoft_paper}. In this, the sense that matches the senses of K nearby words is chosen. This approach considers context in terms of nearby word senses while choosing the correct sense. However, since this takes additional computation time depending on the value of K, it was not implemented in this project.
item Currently the TF-IDF measure for pair of words is calculated from the two questions itself and not some corpus. A corpus was not chosen for this project as the questions lie from numerous domains and thus a big and exhaustive corpus will be needed. Adding a similar corpus will again add on to the computational time.
item Majority of the approach of this project is from cite{han2013umbc_ebiquity} and cite{paper_2}, however, experiments can be done by using the semantic feature generation approach of cite{microsoft_paper} to see if different results are achieved.
item In this project, metrics for measuring degree of synonimity which use Information Content (IC) like Lesnik, Lin or Jiang-Conrath distance are not used. Calculating IC requires using a corpus and using an external exhaustive corpus has been avoided. However, these metrics can also be evaluated and used along with the syntactic features.
item Performance of the features with other classifiers like Neural Networks or aggregation techniques like Boosting can also be evaluated.
end{itemize}

% conference papers do not normally have an appendix

% trigger a
ewpage just before the given reference
% number – used to balance the columns on the last page
% adjust value as needed – may need to be readjusted if
% the document is modified later
%IEEEtriggeratref{8}
% The “triggered” command can be changed if desired:
%IEEEtriggercmd{enlargethispage{-5in}}

% references section

% can use a bibliography generated by BibTeX as a .bbl file
% BibTeX documentation can be easily obtained at:
% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
% The IEEEtran BibTeX style support page is at:
% http://www.michaelshell.org/tex/ieeetran/bibtex/
%ibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
%ibliography{IEEEabrv,../bib/paper}
%
% manually copy in the resultant .bbl file
% set second argument of egin to the number of references
% (used to reserve space for the reference number labels box)
egin{thebibliography}{1}
ibitem{han2013umbc_ebiquity}
Han, Lushan and Kashyap, Abhay L and Finin, Tim and Mayfield, James and Weese, Jonathan,
extit{UMBC\_EBIQUITY-CORE: Semantic Textual Similarity Systems.},
NAACL-HLT, 44-52, 2013.
ibitem{microsoft_paper}
Dao TN, Simpson T,
extit{Measuring similarity between sentences},
WordNet. Net, Tech. Rep., 2005.
ibitem{paper_2}
Sravanthi, P., and Srinivase, D., extit{SEMANTIC SIMILARITY BETWEEN SENTENCES}, 2017.
ibitem{salton_lesk}
Salton, G. and Lesk, M.E.,
extit{Computer evaluation of indexing and text processing},
Journal of the ACM (JACM), 15(1), pp.8-36, 1968.
ibitem{quora_wiki}
https://en.wikipedia.org/wiki/Quora
ibitem{quora_ceo}
https://www.quora.com/How-many-people-use-Quora-7/answer/Adam-DAngelo
ibitem{about_wordnet}
George A. Miller,
extit{A Lexical Database for English},
Communications of the ACM Vol. 38, No. 11: 39-41, 1995.
ibitem{vec_sim_ref_1}
Choi, S. S., Cha, S. H., Tappert, C. C.,
extit{A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics},
8(1), 43-48, 2010.
ibitem{textbook}
Jurafsky, D.,
extit{Speech and language processing: An introduction to natural language processing. Computational linguistics, and speech recognition},
2000.
ibitem{wordnet_doc}
WordNet NLTK documentation -http://www.nltk.org/howto/wordnet.html.
end{thebibliography}

% that’s all folks
end{document}

Categories: Articles

x

Hi!
I'm Garrett!

Would you like to get a custom essay? How about receiving a customized one?

Check it out