http://curtis.ml.cmu.edu/w/courses/api.php?action=feedcontributions&user=Reyyan&feedformat=atomCohen Courses - User contributions [en]2024-03-29T08:38:14ZUser contributionsMediaWiki 1.33.1http://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=5411User:Reyyan2011-04-02T02:33:38Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
* [[Hassan et al, ICWSM 2009]]<br />
* [[Zhang et all, WWW 2007]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
* [[Linear Threshold Models - Diffusion models]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
*[[TREC BLOG06]]<br />
*[[UCLA Blogocenter]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Linear_Threshold_Models_-_Diffusion_models&diff=5410Linear Threshold Models - Diffusion models2011-04-02T02:32:33Z<p>Reyyan: Created page with '== Diffusion Models == Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inac…'</p>
<hr />
<div>== Diffusion Models ==<br />
<br />
Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inactive. Over iterations an inactive nodes becomes active as more of its neighbors become active. <br />
<br />
== Linear Threshold Model ==<br />
<br />
The Linear Threshold Model is one of the most popular diffusion models. <br />
<br />
Given <br />
* a set of active nodes as seeds <br />
* a threshold θ for each node selected uniformly at random<br />
<br />
At each step, an inactive node becomes active if the sum of the weights of the edges with active neighbors exceeds the threshold θ .<br />
<br />
[[File:Ltm.jpg]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=5409User:Reyyan2011-04-02T02:31:36Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
* [[Hassan et al, ICWSM 2009]]<br />
* [[Zhang et all, WWW 2007]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
* [[Linear Threshold Models - Diffusion models]]<br />
* [[Diffusion models]]<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
*[[TREC BLOG06]]<br />
*[[UCLA Blogocenter]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5408Zhang et all, WWW 20072011-04-02T02:30:02Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
Jun Zhang, Mark S. Ackerman, and Lada Adamic. 2007. Expertise networks in online communities: structure and algorithms. In Proceedings of the 16th international conference on World Wide Web (WWW '07). ACM, New York, NY, USA, 221-230. <br />
<br />
== Online version ==<br />
<br />
[http://portal.acm.org/citation.cfm?id=1242603 ACM]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. The prestige measure of this network is highly correlated with a user's expertise due to the way the network is constructed. Therefore this network is called ''community expertise network (CEN)''.<br />
<br />
'''Network Characteristics'''<br />
<br />
The authors experimented on the [[UsesDataset::Java Forum]] which is a large online help-seeking community. Before testing the algorithms they did several analysis to characterize the network. Below are the performed analysis and their results<br />
* The Bow tie structure analysis : More than half of the users only asks questions. 13% only answers and 12% both answers and asks.<br />
* Degree distribution analysis : The majority of users answers only a few questions but few active users answers a lot of questions. <br />
* Degree correlation analysis : Top repliers answer questions for everyone but less expert users do not reply to high expert users.<br />
<br />
It is important to note that these characteristics are different from WWW graphs. <br />
<br />
'''Expertise Ranking Algorithms'''<br />
<br />
* Simple statistical measures : Just counting the number of replies or counting the number of users helped to calculate the score of expertise of a user. <br />
* Z-score : A measure that combines one's asking and replying patterns.<br />
* Expertise Rank Algorithm : [[UsesMethod::PageRank]] like algorithm which uses not only count of users helped but also whom one helped.<br />
* HITS Authority : Similar to [[UsesMethod::HITS]] algorithm where good hub is a user who is helped by many experts and good authority is a user who helps many good hubs.<br />
<br />
In experiments the authors used Spearman's Rho and Kendall's Tau measures to understand the correlations between these ranking algorithms and the human-assigned ratings. It has been observed that they are highly correlated which means that structural information can be used to identify experts in online communities. <br />
<br />
It has been also observed that algorithms like PageRank and HITS which works really well in WWW, does not outperform simpler algorithms used in this online community which confirms that structural differences may be the reason why complex algorithms may not work well in other network structures. <br />
<br />
A simulation network model was created in which users make the best of their time by being more selective in choosing questions that are challenging to them yet they are still capable of answering. Analysis on this network showed that ExpertiseRank and Z score outperforms others especially HITS. This shows that performance of the expertise ranking algorithms depends highly on the dynamics of the communities. <br />
<br />
Expertise ranking algorithms may perform different in different structured networks therefore understanding the structural characteristics of network makes significant differences in the performance of these algorithms.<br />
<br />
A similar work is [[RelatedPaper::Littlepage et al]] and another one that works on emails is [[RelatedPaper::Dom et al, DMKD 2003]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5407Zhang et all, WWW 20072011-04-02T01:43:33Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
<br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They treat expertise as a relative concept. <br />
<br />
network based algorithms such as PageRank, HITS<br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. The prestige measure of this network is highly correlated with a user's expertise due to the way the network is constructed. Therefore this network is called ''community expertise network (CEN)''.<br />
<br />
'''Network Characteristics'''<br />
<br />
The authors experimented on the [[UsesDataset::Java Forum]] which is a large online help-seeking community. Before testing the algorithms they did several analysis to characterize the network. Below are the performed analysis and their results<br />
* The Bow tie structure analysis : More than half of the users only asks questions. 13% only answers and 12% both answers and asks.<br />
* Degree distribution analysis : The majority of users answers only a few questions but few active users answers a lot of questions. <br />
* Degree correlation analysis : Top repliers answer questions for everyone but less expert users do not reply to high expert users.<br />
<br />
It is important to note that these characteristics are different from WWW graphs. <br />
<br />
'''Expertise Ranking Algorithms'''<br />
<br />
* Simple statistical measures : Just counting the number of replies or counting the number of users helped to calculate the score of expertise of a user. <br />
* Z-score : A measure that combines one's asking and replying patterns.<br />
* Expertise Rank Algorithm : [[UsesMethod::PageRank]] like algorithm which uses not only count of users helped but also whom one helped.<br />
* HITS Authority : Similar to [[UsesMethod::HITS]] algorithm where good hub is a user who is helped by many experts and good authority is a user who helps many good hubs.<br />
<br />
In experiments the authors used Spearman's Rho and Kendall's Tau measures to understand the correlations between these ranking algorithms and the human-assigned ratings. It has been observed that they are highly correlated which means that structural information can be used to identify experts in online communities. <br />
<br />
It has been also observed that algorithms like PageRank and HITS which works really well in WWW, does not outperform simpler algorithms used in this online community which confirms that structural differences may be the reason why complex algorithms may not work well in other network structures. <br />
<br />
Therefore understanding the characteristics<br />
<br />
<br />
<br />
<br />
[[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5406Zhang et all, WWW 20072011-04-02T01:32:00Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
<br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They treat expertise as a relative concept. <br />
<br />
network based algorithms such as PageRank, HITS<br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. The prestige measure of this network is highly correlated with a user's expertise due to the way the network is constructed. Therefore this network is called ''community expertise network (CEN)''.<br />
<br />
'''Network Characteristics'''<br />
<br />
The authors experimented on the [[UsesDataset::Java Forum]] which is a large online help-seeking community. Before testing the algorithms they did several analysis to characterize the network. Below are the performed analysis and their results<br />
* The Bow tie structure analysis : More than half of the users only asks questions. 13% only answers and 12% both answers and asks.<br />
* Degree distribution analysis : The majority of users answers only a few questions but few active users answers a lot of questions. <br />
* Degree correlation analysis : Top repliers answer questions for everyone but less expert users do not reply to high expert users.<br />
<br />
It is important to note that these characteristics are different from WWW graphs. <br />
<br />
'''Expertise Ranking Algorithms'''<br />
<br />
* Simple statistical measures : Just counting the number of replies or counting the number of users helped to calculate the score of expertise of a user. <br />
* Z-score : A measure that combines one's asking and replying patterns.<br />
* Expertise Rank Algorithm : [[UsesMethod::PageRank]] like algorithm which uses not only count of users helped but also whom one helped.<br />
* HITS Authority : Similar to [[UsesMethod::HITS]] algorithm where good hub is a user who is helped by many experts and good authority is a user who helps many good hubs.<br />
<br />
In experiments the authors used Spearman's Rho and Kendall's Tau measures to understand the correlations between these ranking algorithms and the human-assigned ratings. It has been observed that they are highly correlated which means that structural information can be used to identify experts in online communities. <br />
<br />
It has been also observed that algorithms like PageRank and HITS which works really well in WWW, does not outperform simpler algorithms used in this online community. <br />
<br />
Therefore understanding the characteristics<br />
<br />
<br />
<br />
<br />
[[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5405Zhang et all, WWW 20072011-04-02T01:22:52Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
<br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They treat expertise as a relative concept. <br />
<br />
network based algorithms such as PageRank, HITS<br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. The prestige measure of this network is highly correlated with a user's expertise due to the way the network is constructed. Therefore this network is called ''community expertise network (CEN)''.<br />
<br />
'''Network Characteristics'''<br />
<br />
The authors experimented on the [[UsesDataset::Java Forum]] which is a large online help-seeking community. Before testing the algorithms they did several analysis to characterize the network. Below are the performed analysis and their results<br />
* The Bow tie structure analysis : More than half of the users only asks questions. 13% only answers and 12% both answers and asks.<br />
* Degree distribution analysis : The majority of users answers only a few questions but few active users answers a lot of questions. <br />
* Degree correlation analysis : Top repliers answer questions for everyone but less expert users do not reply to high expert users.<br />
<br />
It is important to note that these characteristics are different from WWW graphs. <br />
<br />
'''Expertise Ranking Algorithms'''<br />
<br />
* Simple statistical measures : Just counting the number of replies or counting the number of users helped to calculate the score of expertise of a user. <br />
* Z-score : A measure that combines one's asking and replying patterns.<br />
* Expertise Rank Algorithm : [[UsesMethod::PageRank]] like algorithm which uses not only count of users helped but also whom one helped.<br />
* HITS Authority : Similar to [[UsesMethod::HITS]] algorithm where good hub is a user who is helped by many experts and good authority is a user who helps many good hubs.<br />
<br />
<br />
<br />
<br />
<br />
[[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5404Zhang et all, WWW 20072011-04-01T23:50:15Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
<br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They treat expertise as a relative concept. <br />
<br />
network based algorithms such as PageRank, HITS<br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. The prestige measure of this network is highly correlated with a user's expertise due to the way the network is constructed. Therefore this network is called ''community expertise network (CEN)''.<br />
<br />
'''Network Characteristics'''<br />
<br />
The authors experimented on the [[UsesDataset::Java Forum]] which is a large online help-seeking community. Before testing the algorithms they did several analysis to characterize the network. Below are the performed analysis and their results<br />
* The Bow tie structure analysis : More than half of the users only asks questions. 13% only answers and 12% both answers and asks.<br />
* Degree distribution analysis : The majority of users answers only a few questions but few active users answers a lot of questions. <br />
* Degree correlation analysis : Top repliers answer questions for everyone but less expert users do not reply to high expert users.<br />
<br />
It is important to note that these characteristics are different from WWW graphs. <br />
<br />
'''Expertise Ranking Algorithms'''<br />
<br />
* Simple statistical measures : Just counting the number of replies or counting the number of users helped to calculate the score of expertise of a user. <br />
* Z-score : A measure that combines one's asking and replying patterns.<br />
* Expertise Rank Algorithm : [[UsesMethod::PageRank]] like algorithm which <br />
<br />
<br />
<br />
[[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5344Zhang et all, WWW 20072011-04-01T21:00:54Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
<br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They treat expertise as a relative concept. <br />
<br />
network based algorithms such as PageRank, HITS<br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. The prestige measure of this network is highly correlated with a user's expertise due to the way the network is constructed. Therefore this network is called ''community expertise network (CEN)''.<br />
<br />
'''Network Characteristics'''<br />
<br />
The authors experimented on the Java Forum which is a large online help-seeking community. Before testing the algorithms they did several analysis to characterize the network. Below are the performed analysis and their results<br />
* The Bow tie structure analysis : More than half of the users only asks questions. 13% only answers and 12% both answers and asks.<br />
* Degree distribution analysis : The majority of users answers only a few questions but few active users answers a lot of questions. <br />
* Degree correlation analysis : Top repliers answer questions for everyone but less expert users do not reply to high expert users.<br />
<br />
It is important to note that these characteristics are different from WWW graphs. <br />
<br />
'''Expertise Ranking Algorithms'''<br />
<br />
* Simple statistical measures : Just counting the number of replies or counting the number of users helped to calculate the score of expertise of a user. <br />
<br />
<br />
the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set by using a stochastic graph based method. <br />
<br />
The authors approached to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors. <br />
<br />
<br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. The authors also used blog related attributes such as number of posts, average length of posts etc. as priors. BlogRank algorithm takes diversity into account and penalize blogs that are quite similar to already selected blogs.<br />
<br />
[[UsesDataset::TREC BLOG06]] and [[UsesDataset::UCLA Blogocenter]] datasets had been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. Initially they marked the selected nodes as active and then applied the diffusion model and counted the number of activated nodes at the end. <br />
<br />
The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future. <br />
<br />
This work is similar to the Blog Distillation task in the TREC Blog Track. However in blog distillation task, given a query the aim is to return all relevant blogs. In this paper, given set of blogs related to topic, the aim is to select smaller set of blogs. Some related works are [[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5343Zhang et all, WWW 20072011-04-01T20:56:23Z<p>Reyyan: /* Summary */</p>
<hr />
<div>== Citation ==<br />
<br />
<br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They treat expertise as a relative concept. <br />
<br />
network based algorithms such as PageRank, HITS<br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. The prestige measure of this network is highly correlated with a user's expertise due to the way the network is constructed. Therefore this network is called ''community expertise network (CEN)''.<br />
<br />
'''Network Characteristics'''<br />
<br />
The authors experimented on the Java Forum which is a large online help-seeking community. Before testing the algorithms they did several analysis to characterize the network. Below are the performed analysis and their results<br />
* The Bow tie structure analysis : More than half of the users only asks questions. 13% only answers and 12% both answers and asks.<br />
* Degree distribution analysis : The majority of users answers only a few questions but few active users answers a lot of questions. <br />
* Degree correlation analysis : Top repliers answer questions for everyone but less expert users do not reply to high expert users.<br />
<br />
<br />
<br />
<br />
the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set by using a stochastic graph based method. <br />
<br />
The authors approached to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors. <br />
<br />
[[File:BlogRank.jpg]]<br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. The authors also used blog related attributes such as number of posts, average length of posts etc. as priors. BlogRank algorithm takes diversity into account and penalize blogs that are quite similar to already selected blogs.<br />
<br />
[[UsesDataset::TREC BLOG06]] and [[UsesDataset::UCLA Blogocenter]] datasets had been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. Initially they marked the selected nodes as active and then applied the diffusion model and counted the number of activated nodes at the end. <br />
<br />
The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future. <br />
<br />
This work is similar to the Blog Distillation task in the TREC Blog Track. However in blog distillation task, given a query the aim is to return all relevant blogs. In this paper, given set of blogs related to topic, the aim is to select smaller set of blogs. Some related works are [[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5337Zhang et all, WWW 20072011-04-01T20:33:16Z<p>Reyyan: /* Summary */</p>
<hr />
<div>== Citation ==<br />
<br />
<br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They treat expertise as a relative concept. <br />
<br />
network based algorithms such as PageRank, HITS<br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. This network which is called ''community expertise network (CEN)'' reflects the shared interests of the community members. <br />
<br />
They experimented on the Java Forum which is a large online help-seeking community<br />
<br />
<br />
the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set by using a stochastic graph based method. <br />
<br />
The authors approached to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors. <br />
<br />
[[File:BlogRank.jpg]]<br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. The authors also used blog related attributes such as number of posts, average length of posts etc. as priors. BlogRank algorithm takes diversity into account and penalize blogs that are quite similar to already selected blogs.<br />
<br />
[[UsesDataset::TREC BLOG06]] and [[UsesDataset::UCLA Blogocenter]] datasets had been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. Initially they marked the selected nodes as active and then applied the diffusion model and counted the number of activated nodes at the end. <br />
<br />
The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future. <br />
<br />
This work is similar to the Blog Distillation task in the TREC Blog Track. However in blog distillation task, given a query the aim is to return all relevant blogs. In this paper, given set of blogs related to topic, the aim is to select smaller set of blogs. Some related works are [[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Zhang_et_all,_WWW_2007&diff=5333Zhang et all, WWW 20072011-04-01T20:27:24Z<p>Reyyan: Created page with '== Citation == == Online version == [http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09] == Summary == The aim of this Category::paper is to id…'</p>
<hr />
<div>== Citation ==<br />
<br />
<br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify users with high expertise within online expertise-sharing communities. This [[AddressesProblem::expertise finding]] system uses graph-based algorithms on social networks within the community. <br />
<br />
They treat expertise as a relative concept. <br />
<br />
network based algorithms such as PageRank, HITS<br />
<br />
They created a post-reply network in which each user is represented as a node and a directed edge is created from each user who started the post to other users who replied to it. <br />
<br />
They experimented on the Java Forum which is a large online help-seeking community<br />
<br />
<br />
the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set by using a stochastic graph based method. <br />
<br />
The authors approached to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors. <br />
<br />
[[File:BlogRank.jpg]]<br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. The authors also used blog related attributes such as number of posts, average length of posts etc. as priors. BlogRank algorithm takes diversity into account and penalize blogs that are quite similar to already selected blogs.<br />
<br />
[[UsesDataset::TREC BLOG06]] and [[UsesDataset::UCLA Blogocenter]] datasets had been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. Initially they marked the selected nodes as active and then applied the diffusion model and counted the number of activated nodes at the end. <br />
<br />
The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future. <br />
<br />
This work is similar to the Blog Distillation task in the TREC Blog Track. However in blog distillation task, given a query the aim is to return all relevant blogs. In this paper, given set of blogs related to topic, the aim is to select smaller set of blogs. Some related works are [[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=5330User:Reyyan2011-04-01T20:09:18Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
* [[Hassan et al, ICWSM 2009]]<br />
* [[Zhang et all, WWW 2007]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
* [[Diffusion models]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
*[[TREC BLOG06]]<br />
*[[UCLA Blogocenter]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=5010User:Reyyan2011-03-31T07:04:13Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
* [[Hassan et al, ICWSM 2009]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
* [[Diffusion models]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
*[[TREC BLOG06]]<br />
*[[UCLA Blogocenter]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=UCLA_Blogocenter&diff=5009UCLA Blogocenter2011-03-31T07:03:53Z<p>Reyyan: </p>
<hr />
<div>UCLA Blogocenter dataset was built by the The Blogocenter group at UCLA. The dataset contains RSS feeds from the Bloglines, Blogspot, Microsoft Live Spaces, and syndic8 aggregators covering the past several years. The dataset contains over 192 million blog posts. More information about the dataset can be found at [[RelatedPaper::Sia et all, KDD 2008]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=UCLA_Blogocenter&diff=5008UCLA Blogocenter2011-03-31T07:03:24Z<p>Reyyan: Created page with 'UCLA Blogocenter dataset was built by the The Blogocenter group at UCLA. The dataset contains RSS feeds from the Bloglines, Blogspot, Microsoft Live Spaces, and syndic8 aggregato…'</p>
<hr />
<div>UCLA Blogocenter dataset was built by the The Blogocenter group at UCLA. The dataset contains RSS feeds from the Bloglines, Blogspot, Microsoft Live Spaces, and syndic8 aggregators covering the past several years. The dataset contains over 192 million blog posts. More information about the dataset can be reached from [[RelatedPaper::Sia et all, KDD 2008]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ICWSM_2009&diff=5004Hassan et al, ICWSM 20092011-03-31T06:59:10Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set by using a stochastic graph based method. <br />
<br />
The authors approached to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors. <br />
<br />
[[File:BlogRank.jpg]]<br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. The authors also used blog related attributes such as number of posts, average length of posts etc. as priors. BlogRank algorithm takes diversity into account and penalize blogs that are quite similar to already selected blogs.<br />
<br />
[[UsesDataset::TREC BLOG06]] and [[UsesDataset::UCLA Blogocenter]] datasets had been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. Initially they marked the selected nodes as active and then applied the diffusion model and counted the number of activated nodes at the end. <br />
<br />
The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future. <br />
<br />
This work is similar to the Blog Distillation task in the TREC Blog Track. However in blog distillation task, given a query the aim is to return all relevant blogs. In this paper, given set of blogs related to topic, the aim is to select smaller set of blogs. Some related works are [[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=4999User:Reyyan2011-03-31T06:57:39Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
* [[Hassan et al, ICWSM 2009]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
* [[Diffusion models]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
*[[TREC BLOG06]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=TREC_BLOG06&diff=4998TREC BLOG062011-03-31T06:57:15Z<p>Reyyan: Created page with 'BLOG06 is a TREC test collection which has been created and distributed by the University of Glasgow. The dataset contains feeds, permalinks and homepages over an 11 weeks peri…'</p>
<hr />
<div>BLOG06 is a TREC test collection which has been created and distributed by the University of Glasgow. <br />
<br />
The dataset contains feeds, permalinks and homepages over an 11 weeks period. <br />
* 100,649 feeds<br />
* 3,215,171 permalinks<br />
* 324,880 homepages<br />
<br />
17,969 spam blogs were added to the corpus in order to make it more realistic.<br />
<br />
More information about the dataset can be found at [[RelatedPaper:Macdonald and Ounis 2006]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ICWSM_2009&diff=4996Hassan et al, ICWSM 20092011-03-31T06:51:44Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set by using a stochastic graph based method. <br />
<br />
The authors approached to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors. <br />
<br />
[[File:BlogRank.jpg]]<br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. The authors also used blog related attributes such as number of posts, average length of posts etc. as priors. BlogRank algorithm takes diversity into account and penalize blogs that are quite similar to already selected blogs.<br />
<br />
[[UsesDataset::TREC BLOG06]] dataset has been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. Initially they marked the selected nodes as active and then applied the diffusion model and counted the number of activated nodes at the end. <br />
<br />
The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future. <br />
<br />
This work is similar to the Blog Distillation task in the TREC Blog Track. However in blog distillation task, given a query the aim is to return all relevant blogs. In this paper, given set of blogs related to topic, the aim is to select smaller set of blogs. Some related works are [[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:BlogRank.jpg&diff=4964File:BlogRank.jpg2011-03-31T05:26:06Z<p>Reyyan: uploaded a new version of "File:BlogRank.jpg"</p>
<hr />
<div></div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ICWSM_2009&diff=4963Hassan et al, ICWSM 20092011-03-31T05:25:04Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set. <br />
<br />
The authors approach to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors. <br />
<br />
[[File:BlogRank.jpg]]<br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. <br />
<br />
[[UsesDataset::TREC BLOG06]] dataset has been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. Initially they marked the selected nodes as active and then apply the diffusion model and count the number of activated nodes at the end. <br />
<br />
The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future. <br />
experimented by splitting the data i<br />
<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:BlogRank.jpg&diff=4962File:BlogRank.jpg2011-03-31T05:24:47Z<p>Reyyan: </p>
<hr />
<div></div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Diffusion_models&diff=4944Diffusion models2011-03-31T04:52:30Z<p>Reyyan: </p>
<hr />
<div>Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inactive. Over iterations an inactive nodes becomes active as more of its neighbors become active. <br />
<br />
== Linear Threshold Model ==<br />
<br />
The Linear Threshold Model is one of the most popular diffusion models. <br />
<br />
Given <br />
* a set of active nodes as seeds <br />
* a threshold θ for each node selected uniformly at random<br />
<br />
At each step, an inactive node becomes active if the sum of the weights of the edges with active neighbors exceeds the threshold θ .<br />
<br />
[[File:Ltm.jpg]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=4942User:Reyyan2011-03-31T04:51:48Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
* [[Hassan et al, ICWSM 2009]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
* [[Diffusion models]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Diffusion_models&diff=4941Diffusion models2011-03-31T04:51:08Z<p>Reyyan: </p>
<hr />
<div>Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inactive. Over iterations an inactive nodes becomes active as more of its neighbors become active. <br />
<br />
== Linear Threshold Model ==<br />
<br />
The Linear Threshold Model is one of the most popular diffusion models. <br />
<br />
Given <br />
* a set of active nodes as seeds <br />
* a threshold θ for each node selected uniformly at random<br />
<br />
At each step, an inactive node becomes active if the sum of the weights of the edges with active neighbors exceeds threshold<br />
<br />
[[File:Ltm.jpg]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Diffusion_models&diff=4940Diffusion models2011-03-31T04:48:54Z<p>Reyyan: </p>
<hr />
<div>Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inactive. Over iterations an inactive nodes becomes active as more of its neighbors become active. <br />
<br />
== Linear Threshold Model ==<br />
<br />
The Linear Threshold Model is one of the most popular diffusion models. <br />
<br />
Given <br />
* a set of active nodes as seeds <br />
* a threshold θ for each node selected uniformly at random<br />
<br />
At each step an inactive node becomes active if the sum of the weights of the edges with active neighbors exceeds threshold<br />
<br />
[[File:Ltm.jpg]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Diffusion_models&diff=4939Diffusion models2011-03-31T04:48:44Z<p>Reyyan: </p>
<hr />
<div>Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inactive. Over iterations an inactive nodes becomes active as more of its neighbors become active. <br />
<br />
== Linear Threshold Model ==<br />
<br />
The Linear Threshold Model is one of the most popular diffusion models. <br />
<br />
Given <br />
* a set of active nodes as seeds <br />
* a threshold θ for each node selected uniformly at random<br />
<br />
At each step an inactive node becomes active if the sum of the weights of the edges with active neighbors exceeds threshold<br />
[[File:Ltm.jpg]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=File:Ltm.jpg&diff=4936File:Ltm.jpg2011-03-31T04:48:13Z<p>Reyyan: </p>
<hr />
<div></div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Diffusion_models&diff=4935Diffusion models2011-03-31T04:47:49Z<p>Reyyan: /* Linear Threshold Model */</p>
<hr />
<div>Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inactive. Over iterations an inactive nodes becomes active as more of its neighbors become active. <br />
<br />
== Linear Threshold Model ==<br />
<br />
The Linear Threshold Model is one of the most popular diffusion models. <br />
<br />
Given <br />
* a set of active nodes as seeds <br />
* a threshold θ for each node selected uniformly at random<br />
<br />
At each step an inactive node becomes active if the sum of the weights of the edges with active neighbors exceeds threshold<br />
[[File:DSC01879.jpg]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Diffusion_models&diff=4934Diffusion models2011-03-31T04:44:38Z<p>Reyyan: Created page with 'Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inactive. Over iterations an…'</p>
<hr />
<div>Diffusion models were originally used in social networks to model the spread of influence in a network. In these models each node is either active or inactive. Over iterations an inactive nodes becomes active as more of its neighbors become active. <br />
<br />
== Linear Threshold Model ==<br />
<br />
The Linear Threshold Model is one of the most popular diffusion models. <br />
<br />
Given <br />
* a set of active nodes as seeds <br />
* a threshold θ for each node selected uniformly at random<br />
<br />
At each step an inactive node becomes active if the sum of the weights of the edges with active neighbors exceeds threshold</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ICWSM_2009&diff=4919Hassan et al, ICWSM 20092011-03-31T04:20:13Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set. <br />
<br />
The authors approach to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm to rank the blogs by their popularity. In their algorithm they represent each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph will calculate the importance score of a blog by using the scores of its neighbors. <br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. <br />
<br />
[[UsesDataset::TREC BLOG06]] dataset has been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. <br />
<br />
<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ICWSM_2009&diff=4918Hassan et al, ICWSM 20092011-03-31T04:18:36Z<p>Reyyan: /* Summary */</p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set. <br />
<br />
The authors approach to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors used a [[UsesMethod::PageRank]] like algorithm to rank the blogs by their popularity. In their algorithm they represent each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph will calculate the importance score of a blog by using the scores of its neighbors. <br />
<br />
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. <br />
<br />
[[UsesDataset::TREC BLOG06]] dataset has been used in the experiments. <br />
<br />
<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ICWSM_2009&diff=4916Hassan et al, ICWSM 20092011-03-31T04:14:53Z<p>Reyyan: /* Summary */</p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set. <br />
<br />
The authors approach to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors uses a [[UsesMethod::PageRank]] like algorithm to rank the blogs by their popularity. In their algorithm they represent each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph will calculate the importance score of a blog by using the scores of its neighbors. <br />
<br />
After some observations, the authors claim that the emotion pattern and word pattern of tweets change as a result of a change in public opinion. With this aim in mind, authors developed an [[UsesDataset::Emotion Corpus (Upinion)]] to detect emotions in tweets.<br />
<br />
Two methods are used to detect opinions<br />
<br />
* Vector Space Model : A binary vector has been created for each tweet. Each class of the Emotion Corpus is represented as a dimension in the vector and the value of each dimension is determined by the existence of any emotion word from the related class in the tweet. Centroid of vectors are calculated to represent an interval. [[UsesMethod::Cosine similarity]] is applied to centroid vectors to find the opinion similarity between two intervals. <br />
<br />
* Set Space Model : Each time interval is represented by a single document which is the union of tweets posted in that particular time interval. [[UsesMethod::Jaccard similarity]] is used to find the similarity between two intervals.<br />
<br />
The authors combine these two methods to detect a change and report a breakpoint.<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ICWSM_2009&diff=4915Hassan et al, ICWSM 20092011-03-31T04:14:07Z<p>Reyyan: /* Summary */</p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set. <br />
<br />
The authors approach to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.<br />
<br />
The authors uses a [[UsesMethod::Pagerank]] like algorithm to rank the blogs by their popularity. In their algorithm they represent each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph will calculate the importance score of a blog by using the scores of its neighbors. <br />
<br />
After some observations, the authors claim that the emotion pattern and word pattern of tweets change as a result of a change in public opinion. With this aim in mind, authors developed an [[UsesDataset::Emotion Corpus (Upinion)]] to detect emotions in tweets.<br />
<br />
Two methods are used to detect opinions<br />
<br />
* Vector Space Model : A binary vector has been created for each tweet. Each class of the Emotion Corpus is represented as a dimension in the vector and the value of each dimension is determined by the existence of any emotion word from the related class in the tweet. Centroid of vectors are calculated to represent an interval. [[UsesMethod::Cosine similarity]] is applied to centroid vectors to find the opinion similarity between two intervals. <br />
<br />
* Set Space Model : Each time interval is represented by a single document which is the union of tweets posted in that particular time interval. [[UsesMethod::Jaccard similarity]] is used to find the similarity between two intervals.<br />
<br />
The authors combine these two methods to detect a change and report a breakpoint.<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ACL_2010&diff=4908Hassan et al, ACL 20102011-03-31T03:57:53Z<p>Reyyan: Blanked the page</p>
<hr />
<div></div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ACL_2010&diff=4907Hassan et al, ACL 20102011-03-31T03:57:47Z<p>Reyyan: Replaced content with '== Citation ==
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Confere…'</p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009).</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ICWSM_2009&diff=4906Hassan et al, ICWSM 20092011-03-31T03:57:27Z<p>Reyyan: Created page with '== Citation == Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference …'</p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set. <br />
<br />
The authors approach to this [[AddressesProblem::blog retrieval]] problem by using [[UsesMethod::vector space models]].<br />
<br />
After some observations, the authors claim that the emotion pattern and word pattern of tweets change as a result of a change in public opinion. With this aim in mind, authors developed an [[UsesDataset::Emotion Corpus (Upinion)]] to detect emotions in tweets.<br />
<br />
Two methods are used to detect opinions<br />
<br />
* Vector Space Model : A binary vector has been created for each tweet. Each class of the Emotion Corpus is represented as a dimension in the vector and the value of each dimension is determined by the existence of any emotion word from the related class in the tweet. Centroid of vectors are calculated to represent an interval. [[UsesMethod::Cosine similarity]] is applied to centroid vectors to find the opinion similarity between two intervals. <br />
<br />
* Set Space Model : Each time interval is represented by a single document which is the union of tweets posted in that particular time interval. [[UsesMethod::Jaccard similarity]] is used to find the similarity between two intervals.<br />
<br />
The authors combine these two methods to detect a change and report a breakpoint.<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=4905User:Reyyan2011-03-31T03:57:20Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
* [[Hassan et al, ICWSM 2009]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ACL_2010&diff=4902Hassan et al, ACL 20102011-03-31T03:56:19Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009). <br />
<br />
== Online version ==<br />
<br />
[http://www-personal.umich.edu/~hassanam/my_publications/icwsm09.pdf ICWSM09]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set. <br />
<br />
The authors approach to this [[AddressesProblem::blog retrieval]] problem by using [[UsesMethod::vector space models]].<br />
<br />
After some observations, the authors claim that the emotion pattern and word pattern of tweets change as a result of a change in public opinion. With this aim in mind, authors developed an [[UsesDataset::Emotion Corpus (Upinion)]] to detect emotions in tweets.<br />
<br />
Two methods are used to detect opinions<br />
<br />
* Vector Space Model : A binary vector has been created for each tweet. Each class of the Emotion Corpus is represented as a dimension in the vector and the value of each dimension is determined by the existence of any emotion word from the related class in the tweet. Centroid of vectors are calculated to represent an interval. [[UsesMethod::Cosine similarity]] is applied to centroid vectors to find the opinion similarity between two intervals. <br />
<br />
* Set Space Model : Each time interval is represented by a single document which is the union of tweets posted in that particular time interval. [[UsesMethod::Jaccard similarity]] is used to find the similarity between two intervals.<br />
<br />
The authors combine these two methods to detect a change and report a breakpoint.<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ACL_2010&diff=4887Hassan et al, ACL 20102011-03-31T03:01:10Z<p>Reyyan: </p>
<hr />
<div>== Citation ==<br />
<br />
Hassan, A., and D. Radev. 2010. Identifying text polarity using random walks. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 395–403. <br />
<br />
== Online version ==<br />
<br />
[http://www.aclweb.org/anthology/P/P10/P10-1041.pdf ACL]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify breakpoints in public opinion by capturing trends of public opinion from twitter data. The authors approach to this [[AddressesProblem::opinion detection]] problem by using [[UsesMethod::vector space models]].<br />
<br />
After some observations, the authors claim that the emotion pattern and word pattern of tweets change as a result of a change in public opinion. With this aim in mind, authors developed an [[UsesDataset::Emotion Corpus (Upinion)]] to detect emotions in tweets.<br />
<br />
Two methods are used to detect opinions<br />
<br />
* Vector Space Model : A binary vector has been created for each tweet. Each class of the Emotion Corpus is represented as a dimension in the vector and the value of each dimension is determined by the existence of any emotion word from the related class in the tweet. Centroid of vectors are calculated to represent an interval. [[UsesMethod::Cosine similarity]] is applied to centroid vectors to find the opinion similarity between two intervals. <br />
<br />
* Set Space Model : Each time interval is represented by a single document which is the union of tweets posted in that particular time interval. [[UsesMethod::Jaccard similarity]] is used to find the similarity between two intervals.<br />
<br />
The authors combine these two methods to detect a change and report a breakpoint.<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Hassan_et_al,_ACL_2010&diff=4886Hassan et al, ACL 20102011-03-31T02:58:19Z<p>Reyyan: Created page with '== Citation == Cuneyt Gurcan Akcora, Murat Ali Bayir, Murat Demirbas, Hakan Ferhatosmanoglu, "Identifying BreakPoints in Public Opinion", SOMA 2010, SIGKDD Workshop on Social Me…'</p>
<hr />
<div>== Citation ==<br />
<br />
Cuneyt Gurcan Akcora, Murat Ali Bayir, Murat Demirbas, Hakan Ferhatosmanoglu, "Identifying BreakPoints in Public Opinion", SOMA 2010, SIGKDD Workshop on Social Media Analytics, Washington DC, USA.<br />
<br />
== Online version ==<br />
<br />
[http://upinion.cse.buffalo.edu/beta/SOMApaper.pdf Upinion]<br />
<br />
== Summary ==<br />
<br />
The aim of this [[Category::paper]] is to identify breakpoints in public opinion by capturing trends of public opinion from twitter data. The authors approach to this [[AddressesProblem::opinion detection]] problem by using [[UsesMethod::vector space models]].<br />
<br />
After some observations, the authors claim that the emotion pattern and word pattern of tweets change as a result of a change in public opinion. With this aim in mind, authors developed an [[UsesDataset::Emotion Corpus (Upinion)]] to detect emotions in tweets.<br />
<br />
Two methods are used to detect opinions<br />
<br />
* Vector Space Model : A binary vector has been created for each tweet. Each class of the Emotion Corpus is represented as a dimension in the vector and the value of each dimension is determined by the existence of any emotion word from the related class in the tweet. Centroid of vectors are calculated to represent an interval. [[UsesMethod::Cosine similarity]] is applied to centroid vectors to find the opinion similarity between two intervals. <br />
<br />
* Set Space Model : Each time interval is represented by a single document which is the union of tweets posted in that particular time interval. [[UsesMethod::Jaccard similarity]] is used to find the similarity between two intervals.<br />
<br />
The authors combine these two methods to detect a change and report a breakpoint.<br />
<br />
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.<br />
<br />
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.<br />
<br />
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. <br />
<br />
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=4885User:Reyyan2011-03-31T02:57:56Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
* [[Hassan et al, ACL 2010]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=4879User:Reyyan2011-03-31T02:37:20Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Project <br />
* [[Project Ideas - Derry, Reyyan]]<br />
* [[Project 2nd draft Derry Reyyan]]<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=4878User:Reyyan2011-03-31T02:33:36Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
<br />
Algorithms<br />
* [[Jaccard similarity]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=User:Reyyan&diff=4877User:Reyyan2011-03-31T02:32:59Z<p>Reyyan: </p>
<hr />
<div>== Reyyan Yeniterzi ==<br />
<br />
[[File:DSC01879.jpg]]<br />
<br />
http://www.cs.cmu.edu/~reyyan/<br />
<br />
Hi, I am Reyyan. I am a second year PhD student in LTI. I am currently working with Jamie Callan on Information Retrieval on Blogs. <br />
<br />
I am interested in social media especially how and why people interact with it. I am also interested in behaviors of social networks. With this course I am hoping to learn more about these and hopefully apply them in a cool project. <br />
<br />
In addition to IR, I am also working on Statistical Machine Translation as my 20% project. I am from Turkey therefore I focus mainly on SMT between English and Turkish. In my previous years, I worked on projects that are related to Computational Biology, Medical Informatics and Knowledge Representation.<br />
<br />
Paper Summaries<br />
* [[Akcora et al, SOMA 2010]]<br />
<br />
Algorithms<br />
* [[Jaccard Similarity]]<br />
<br />
Data Sets<br />
*[[Emotion Corpus (Upinion)]]<br />
<br />
== Related to Information Extraction ==<br />
If you are interested in Information Extraction below are some links to paper summaries and data sets. Enjoy :) <br />
<br />
Paper Summaries<br />
* [[Borkar et al, SIGMOD 2001]]<br />
* [[Kucuk and Yazici, FQAS 2009]]<br />
* [[Tur et al, NLEJ 2003]]<br />
* [[Cucerzan and Yarowsky, SIGDAT 1999]]<br />
* [[Mota and Grishman, ACL-IJCNLP 2009]]<br />
* [[Pasca, WWW 2007]]<br />
* [[Benajiba and Rosso, LREC 2008]]<br />
* [[Klein et al, CONLL 2003]]<br />
<br />
Paper Presentation<br />
* [[Pasca, CIKM 2007]]<br />
<br />
Data Sets<br />
* Web query data sets<br />
**[[Google Web Queries (Pasca)]]<br />
* Arabic NER data sets<br />
** [[ANERcorp]]<br />
** [[ANERgazet]]</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Jaccard_similarity&diff=4876Jaccard similarity2011-03-31T02:21:44Z<p>Reyyan: </p>
<hr />
<div>Jaccard similarity is used to measure the similarity between two sample sets. Jaccard similarity can be applied to binary sets. An extended version of Jaccard similarity which deals with attributes with counts or continuous values is called [[UsesMethod::Tanimoto coefficient]].<br />
<br />
== Algorithm ==<br />
<br />
* Input <br />
:<math> \mathbf{A} : \text{Binary Vector 1}</math><br />
:<math> \mathbf{B} : \text{Binary Vector 2}</math><br />
<br />
The size of A and B are same. <br />
<br />
* Output <br />
<br />
:<math> \mathbf{M_{11}} : \text{the number of attributes where A is 1 and B is 1}</math> <br />
:<math> \mathbf{M_{01}} : \text{the number of attributes where A is 0 and B is 1}</math><br />
:<math> \mathbf{M_{10}} : \text{the number of attributes where A is 1 and B is 0}</math><br />
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math><br />
<br />
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_{11} }{ M_{01} + M_{10} + M_{00} }</math><br />
<br />
:<math> \text{Jaccard dissimilarity} = 1 - \mathbf{J} </math><br />
<br />
== Relevant Papers ==<br />
<br />
{{#ask: [[UsesMethod::Jaccard_similarity]]<br />
| ?AddressesProblem<br />
| ?UsesDataset<br />
}}</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Jaccard_similarity&diff=4875Jaccard similarity2011-03-31T02:21:12Z<p>Reyyan: /* Algorithm */</p>
<hr />
<div>== What problem does it address ==<br />
<br />
Jaccard similarity is used to measure the similarity between two sample sets. Jaccard similarity can be applied to binary sets. An extended version of Jaccard similarity which deals with attributes with counts or continuous values is called [[UsesMethod::Tanimoto coefficient]].<br />
<br />
== Algorithm ==<br />
<br />
* Input <br />
:<math> \mathbf{A} : \text{Binary Vector 1}</math><br />
:<math> \mathbf{B} : \text{Binary Vector 2}</math><br />
<br />
The size of A and B are same. <br />
<br />
* Output <br />
<br />
:<math> \mathbf{M_{11}} : \text{the number of attributes where A is 1 and B is 1}</math> <br />
:<math> \mathbf{M_{01}} : \text{the number of attributes where A is 0 and B is 1}</math><br />
:<math> \mathbf{M_{10}} : \text{the number of attributes where A is 1 and B is 0}</math><br />
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math><br />
<br />
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_{11} }{ M_{01} + M_{10} + M_{00} }</math><br />
<br />
:<math> \text{Jaccard dissimilarity} = 1 - \mathbf{J} </math><br />
<br />
== Relevant Papers ==<br />
<br />
{{#ask: [[UsesMethod::Jaccard_similarity]]<br />
| ?AddressesProblem<br />
| ?UsesDataset<br />
}}</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Jaccard_similarity&diff=4874Jaccard similarity2011-03-31T02:20:50Z<p>Reyyan: </p>
<hr />
<div>== What problem does it address ==<br />
<br />
Jaccard similarity is used to measure the similarity between two sample sets. Jaccard similarity can be applied to binary sets. An extended version of Jaccard similarity which deals with attributes with counts or continuous values is called [[UsesMethod::Tanimoto coefficient]].<br />
<br />
== Algorithm ==<br />
<br />
* Input <br />
:<math> \mathbf{A} : \text{Binary Vector 1}</math><br />
:<math> \mathbf{B} : \text{Binary Vector 2}</math><br />
<br />
The size of A and B are same. <br />
<br />
* Output <br />
<br />
:<math> \mathbf{M_{11}} : \text{the number of attributes where A is 1 and B is 1}</math> <br />
:<math> \mathbf{M_{01}} : \text{the number of attributes where A is 0 and B is 1}</math><br />
:<math> \mathbf{M_{10}} : \text{the number of attributes where A is 1 and B is 0}</math><br />
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math><br />
<br />
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_{11} }{ M_{01} + M_{10} + M_{00} }</math><br />
<br />
:<math> \text{Jaccard dissimilarity} = 1 - J </math><br />
<br />
== Relevant Papers ==<br />
<br />
{{#ask: [[UsesMethod::Jaccard_similarity]]<br />
| ?AddressesProblem<br />
| ?UsesDataset<br />
}}</div>Reyyanhttp://curtis.ml.cmu.edu/w/courses/index.php?title=Jaccard_similarity&diff=4873Jaccard similarity2011-03-31T02:17:15Z<p>Reyyan: /* What problem does it address */</p>
<hr />
<div>== What problem does it address ==<br />
<br />
Jaccard similarity is used to measure the similarity between two sample sets. Jaccard similarity can be applied to binary sets. An extended version of Jaccard similarity which deals with attributes with counts or continuous values is called [[UsesMethod::Tanimoto coefficient]].<br />
<br />
== Algorithm ==<br />
<br />
* Input <br />
:<math> \mathbf{A} : \text{Binary Vector 1}</math><br />
:<math> \mathbf{B} : \text{Binary Vector 2}</math><br />
<br />
The size of A and B are same. <br />
<br />
* Output <br />
<br />
:<math> \mathbf{M_{11}} : \text{the number of attributes where A is 1 and B is 1}</math> <br />
:<math> \mathbf{M_{01}} : \text{the number of attributes where A is 0 and B is 1}</math><br />
:<math> \mathbf{M_{10}} : \text{the number of attributes where A is 1 and B is 0}</math><br />
:<math> \mathbf{M_{00}} : \text{the number of attributes where A is 0 and B is 0}</math><br />
<br />
:<math> \text{Jaccard similarity} = \mathbf{J} = \frac{ M_{11} }{ M_{01} + M_{10} + M_{00} }</math><br />
<br />
<br />
== Relevant Papers ==<br />
<br />
{{#ask: [[UsesMethod::Jaccard_similarity]]<br />
| ?AddressesProblem<br />
| ?UsesDataset<br />
}}</div>Reyyan