Introduction and Doubts

nirmal singhania nirmal.singhania at st.niituniversity.in
Wed Mar 9 06:42:33 GMT 2016


And Yes,the similarity measure for document similarity is cosine similarity.
For the algorithm i proposed in trailing mail,i have to implement euclidean
distance similarity measure and tweak it to make it work well with the
algorithm.

waiting for your suggestions.

Regards,
Nirmal Singhania
B.tech III Yr

On Wed, Mar 9, 2016 at 10:27 AM, nirmal singhania <
nirmal.singhania at st.niituniversity.in> wrote:

> Hello All,I am Nirmal Singhania from NIIT University,India.
> I am interested in Clustering of search results Topic.
>
> I have been in field of practical machine learning and information
> retrieval from quite some time.
> I took various courses/MOOC on Information retrieval and Text Mining and
> have been working on real life datasets(KDD99,AWID,Movielens).
> Because the problems you face in real life ML/IR scenario is different is
> different from what taught in theory.I am also working on R&D on "Hybrid
> Techniques for Intrusion Detection using Data Mining and Clustering on
> Newer Datasets".
>
> Taking initial look at the docsim folder in xapian-core.
> These are my insights
> The clustering used is Single Link Agglomerative Hierarchical clustering.
> Its Time Complexity is O(n^2) for n=number of documents.
> At first Choosing K-means seems to be viable solution as K-Means has O(n)
> Time Complexity.
> But it has various Shortcomings
> 1) The learning algorithm requires apriori specification of the number
> of  cluster centers.
> 2)Different Initial Partitions can result in different final clusters
> 3)It does not work well with clusters of different size and Different
> Density.
> After That we Can Think of KMeans++
> The *k*-means++ algorithm addresses the first of these obstacles by
> specifying a procedure to initialize the cluster centers before proceeding
> with the standard *k*-means optimization iterations
> But it is a little bit slow due to cluster initialization.
> Then we can think of bisecting k-means which is better than k-means.but the bisecting
> K-means algorithm is a divisive hierarchical clustering algorithm
> It is little bit faster than original k-means but the results of
> clustering are poorer than Hierarchical agglomerative clustering
> based on various Metrics of Cluster quality such as
> Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio.
>
> based on my some time of Research,I have in mind a clustering algorithm
> that can overcome Quality issues of K-means(and its variants) and Speed
> Issues of Hierarchical Agglomerative Clustering.
> Theoretically it can work O(n) and Can produce results better than HAC
> based on various metrics.
> I can't discuss it on mailing-list but you say we can discuss more about
> it and its implementation in xapian in PM.
>
> Thank you for your Time
>
>
>
>
>
>
> Regards,
> Nirmal Singhania
> B.tech III Yr
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160309/22b74b05/attachment.html>


More information about the Xapian-devel mailing list