Abstract
It is easy to find expert knowledge on the Internet on almost any topic, but obtaining a complete overview of a given topic is not always easy: information can be scattered across many sources and must be aggregated to be useful. We introduce a method for intelligently growing a list of relevant items, starting from a small seed of examples. Our algorithm takes advantage of the wisdom of the crowd, in the sense that there are many experts who post lists of things on the Internet. We use a collection of simple machine learning components to find these experts and aggregate their lists to produce a single complete and meaningful list. We use experiments with gold standards and open-ended experiments without gold standards to show that our method significantly outperforms the state of the art. Our method uses the ranking algorithm Bayesian Sets even when its underlying independence assumption is violated, and we provide a theoretical generalization bound to motivate its use.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Beg MMS, Ahmad N (2003) Soft computing techniques for rank aggregation on the world wide web. World Wide Web 6(1):5–22
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER, Mitchell TM (2010a) Toward an architecture for never-ending language learning. In: Proceedings of the 24th conference on artificial intelligence, AAAI ’10.
Carlson A, Betteridge J, Wang RC, Hruschka ER, Mitchell TM (2010b) Coupled semi-supervised learning for information extraction. In: Proceedings of the 3rd ACM international conference on web search and data mining, WSDM ’10, pp 101–110.
Chang CH, Lui SC (2001) IEPAD: Information extraction based on pattern discovery. In: Proceedings of the 10th international conference on world wide web, WWW ’01, pp 681–688.
Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the web. In: Proceedings of the 10th international conference on world wide web, WWW ’01, pp 613–622.
Etzioni O, Cafarella M, Downey D, Popescu AM, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165(1):91–134
Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. In: Proceedings of the 15th national conference on artificial intelligence, AAAI ’98, pp 517–523.
Ghahramani Z, Heller KA (2005) Bayesian sets. In: Advances in neural information processing systems 18, NIPS ’05, pp 435–442.
Gupta R, Sarawagi S (2009) Answering table augmentation queries from unstructured lists on the web. Proceedings of the VLDB Endowment 2:289–300
Heller KA, Ghahramani Z (2006) A simple Bayesian framework for content-based image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR ’06, pp 2110–2117.
Hsu DF, Taksa I (2005) Comparing rank and score combination methods for data fusion in information retrieval. Inf Retr 8(3):449–480
Jindal P, Roth D (2011) Learning from negative examples in set-expansion. In: Proceedings of the 2011 11th IEEE international conference on data mining, ICDM ’11, pp 1110–1115.
Kozareva Z, Riloff E, Hovy E (2008) Semantic class learning from the web with hyponym pattern linkage graphs. In: Proceedings of the 46th annual meeting of the association for computational linguistics: human language technologies, ACL ’08, pp 1048–1056.
Kushmerick N (1997) Wrapper induction for information extraction. PhD thesis, University of Washington.
Lalmas M (2011) Aggregated search. In: Melucci M, Baeza-Yates R (eds) Advanced topics on information retrieval. Springer, Berlin
Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, pp 601–606.
Paşca M (2007a) Organizing and searching the world wide web of facts—step two: harnessing the wisdom of the crowds. In: Proceedings of the 16th international conference on world wide web, WWW ’07, pp 101–110.
Paşca M (2007b) Weakly-supervised discovery of named entities using web search queries. In: Proceedings of the 16th ACM conference on information and knowledge management, CIKM ’07, pp 683–690.
Pantel P, Crestan E, Borkovsky A, Popescu AM, Vyas V (2009) Web-scale distributional similarity and entity set expansion. In: Proceedings of the 2009 conference on empirical methods in natural language processing, EMNLP ’09, pp 938–947.
Renda ME, Straccia U (2003) Web metasearch: rank versus score based rank aggregation methods. In: Proceedings of the 2003 ACM symposium on applied computing, SAC ’03, pp 841–846.
Sadamitsu K, Saito K, Imamura K, Kikui G (2011) Entity set expansion using topic information. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, ACL ’11, vol 2, pp 726–731.
Sarmento L, Jijkoun V, de Rijke M, Oliveira E (2007) “More like these” : growing entity classes from seeds. In: Proceedings of the 16th ACM conference on information and knowledge management, CIKM ’07, pp 959–962.
Soderland S, Cardie C, Mooney R (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272
Tran MV, Nguyen TT, Nguyen TS, Le HQ (2010) Automatic named entity set expansion using semantic rules and wrappers for unary relations. In: Proceedings of the 2010 international conference on Asian language processing, IALP ’10, pp 170–173.
Verma S, Hruschka ER (2012) Coupled bayesian sets algorithm for semi-supervised learning and information extraction. In: Proceedings of the 2012 European conference on machine learning and knowledge discovery in databases, ECML PKDD ’12, pp 307–322.
Wang J, Lochovsky FH (2003) Data extraction and label assignment for web databases. In: Proceedings of the 12th international conference on world wide web, WWW ’03, pp 187–196.
Wang RC, Cohen WW (2007) Language-independent set expansion of named entities using the web. In: Proceedings of the 2007 7th IEEE international conference on data mining, ICDM ’07, pp 342–350.
Wang RC, Cohen WW (2008) Iterative set expansion of named entities using the web. In: Proceedings of the 2008 8th IEEE international conference on data mining, ICDM ’08, pp 1091–1096.
Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on world wide web, WWW ’05, pp 76–85.
Zhang L, Liu B (2011) Entity set expansion in opinion documents. In: Proceedings of the 22nd ACM conference on hypertext and hypermedia, HT ’11, pp 281–290.
Acknowledgments
Cynthia Rudin acknowledges funding from MIT Lincoln Laboratory and from NSF IIS-1053407. Katherine Heller acknowledges funding from a NSF postdoctoral fellowship and NIH P30 DA028803.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Zelezny.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix A implementation details
Appendix A implementation details
Source discovery: This step requires submitting the query “term1” “term2” to a search engine. In our experiments we used Google as the search engine, but any index would suffice. We retrieved the top 100 results.
List extraction: For each site found with the combinatorial search, we look for HTML tags around the seed items. We use the following lines of HTML to illustrate:
\(<\) \(\mathtt{h}2\) \(>\) \(<\) \(\mathtt{b}\) \(>\) \(<\)a href=“example1.com”\(>\) Boston Harborfest \(<\) \(\mathtt{/a}\) \(>\) \(<\) \(\mathtt{/b}\) \(>\) \(<\) \(\mathtt{/h}2\) \(>\)
\(<\) \(\mathtt{b}\) \(>\) \(<\)a href=“example2.com”\(>\) Jimmy fund scooper bowl \(<\) \(\mathtt{/a}\) \(>\) \(<\) \(\mathtt{/b}\) \(>\)
\(<\) \(\mathtt{b}\) \(>\) \(<\)a href =“example3.com”\(>\) the Boston Arts Festival 2012\(<\) \(\mathtt{/a}\) \(>\) \(<\) \(\mathtt{/b}\) \(>\)
\(<\) \(\mathtt{h3}\) \(>\) \(<\) \(\mathtt{b}\) \(>\) \(<\)a href=“example4.com”\(>\) Boston bacon takedown \(<\) \(\mathtt{/a}\) \(>\) \(<\) \(\mathtt{/b}\) \(>\) \(<\mathtt{/h3}\) \(>\)
\(<\)a href=“example5.com”\(>\) Just a url \(<\) \(\mathtt{/a}\) \(>\)
For each of the two seed items used to discover this source, we search the HTML for the pattern:
In the above example, if the first seed item is “Boston arts festival,” then it matches the pattern with the HTML tags: \(<\) \(\mathtt{b}\) \(>\) \(<\) \(\mathtt{a}\) \(>\). If the second seed item is “Boston harborfest,” it matches the pattern with HTML tags: \(<\) \(\mathtt{h2}\) \(>\) \(<\) \(\mathtt{b}\) \(>\) \(<\) \(\mathtt{a}\) \(>\). We then find the largest set of HTML tags that are common to both seed items, for this site. In this example, “Boston arts festival” does not have the \(<\) \(\mathtt{h2}\) \(>\) tag, so the largest set of common tags is: \(<\) \(\mathtt{b}\) \(>\) \(<\) \(\mathtt{a}\) \(>\). If there are no HTML tags common to both seed items, we discard the site. Otherwise, we extract all items on the page that use the same HTML tags. In this example, we extract everything with both a \(<\) \(\mathtt{b}\) \(>\) and an \(<\) \(\mathtt{a}\) \(>\) tag, which means “Jimmy fund scooper bowl” and “Boston bacon takedown,” but not “Just a url.”
In our experiments, to avoid search spam sites with extremely long lists of unrelated keywords, we reject sources that return more than 300 items. We additionally applied a basic filter rejecting items of more than 60 characters or items consisting of only numbers and punctuation. No other processing was done.
Feature space: We do separate Google searches for each item we have extracted to find the set of webpages containing it. We use quotes around the query term, discard results when Google’s spelling correction system modifies the query, and retrieve the top 300 search results.
Ranking: In all of our experiments we took \(\kappa _2=5\) and \(\kappa _1=2\), similarly to that done in Heller and Ghahramani (2006).
Feedback: To avoid filling the seed with duplicate items like “Boston arts festival” and “The boston arts festival 2012,” in our implicit feedback we do not add items to the seed if they are a sub- or super-string of a current seed item.
Access of online tools: The results for Boo!Wa! and Google Sets used in Sect.4.1 were retrieved from their respective online tools on November 14, 2012, and those used in Sect. 4.3 were retrieved on December 13, 2012. Results for our algorithm, which depend on Google search results and the content of the source webpages, were obtained in the period March 14–21, 2013 for the Wikipedia gold-standard experiments in Sects. 4.1 and 4.2, and in the period May 28–30, 2012 for the open-ended experiments in Sect. 4.3.
Rights and permissions
About this article
Cite this article
Letham, B., Rudin, C. & Heller, K.A. Growing a list. Data Min Knowl Disc 27, 372–395 (2013). https://doi.org/10.1007/s10618-013-0329-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0329-7