Abstract
Data is one of the most important digital assets in the world and its availability on the web is increasing. To use it effectively, we need tools that can retrieve the most relevant datasets to match our information needs. Web search engines are not well suited for this task, as they are designed primarily for documents, not data. In this paper, we present the first query log analysis for dataset search, based on logs of four national open data portals. Our aim is to gain a better understanding of the typical users of these portals and the types of queries they issue, and frame the findings in the broader context of dataset search. The logs suggest that queries issued on data portals differ from those issued to web search engines in their length and structure. From the analysis we could also infer that the portals are used exploratively, rather than to answer focused questions. These insights can inform the design of more effective dataset retrieval technology, and improve the user experience of data portals.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Data has become one of the most important digital assets in the world with its potential for generating business value and social impact. As we advance in the digital age, more and more of the data we generate can be accessed (or purchased) online. Cafarella estimates more than one billion sources of data on the web as of February 2011, counting structured data extracted from Web pages [3]; and The Web Data Commons project recently extracted 233 million data tables from the Common Crawl [12].
A growing number of organisations, mostly in the public sector, have set up their own data portals to publish datasets related to their activities. The European Data PortalFootnote 1 indexes to date 638, 817 datasets published by regional and national authorities in 28 EU countries. Similar trends can be observed in the commercial sector and in science, as open access and reproducibility become mainstream across subjects and research communities. Data is used in a variety of professional roles - whether it is a business analyst searching for evidence to substantiate their report, or a scientist replicating an experiment, the first and foremost step is to find, or retrieve the most relevant datasets for their needs. In our previous work we asked people to share their experiences when carrying out this task [10]. Using conventional search engines is not ideal, as these have been designed primarily for documents, not data [3]. Kunze et al. have recently introduced the concept of dataset retrieval, as a branch of information retrieval applied to data instead of documents focused on determining the most relevant datasets according to a user query [11]. Additionally, there have been proposals for special-purpose search systems, tailored for datasets in a given domain, including hydrology [1]; earth sciences [4]; or datasets from scientific experiments [15]. However, they are mostly restricted to their specific domains and rely on metadata that is created manually.
We advocate for a user-driven approach, where the needs of the dataset searchers are analysed and understood, serving as a basis for the research and development of search functionalities for data. Through the analysis of query logs we explore how people use existing data portals to look for data. We analysed query logs from four open data portals: three belong to the national governments of the UK, Australia, and Canada, the fourth one from the UK’s Office for National Statistics. Together, the logs include more than 2.2 million queries (of which 1.3 million are unique queries), issued between 2013 and 2016.
Our study considered the following questions: (ii) What are the main characteristics of queries in terms of length, distribution, and structure? (ii) Can we identify types of queries? Which ones are the most common? (iii) How these metrics differ between dataset and web search?
2 Background and Related Work
Analysing query logs serves as a proxy to analyse the search behaviour of users [9]. The first query log analysis on the web was made for the Altavista search engine [14], and the technique has been used since to study several aspects of web search (see [7] for a survey). As shown in [6], which reports transaction log analysis of nine search engines, results of different search log analyses are not directly comparable. Vertical search engines have also used query log analysis for example, in people search engines [18] and digital libraries [8]. To the best of our knowledge, this is the first query log analysis study for dataset search.
Various metrics for analysing query logs were developed in the area of general web search; a summary of the ones used in this study is shown in Table 1.
3 Data and Methods
Query logs. Four well-known data portals provided their query logs to us. All portals collect log data using Google Analytics, but are using different settings. As a consequence, the collected information and time frame per portal vary.
In Table 2 we present the number of queries for each portal and the time frame in which those were collected. We had available data on three types of information objects for analysis: queries, sessions and users. A query object is comprised of the following fields: search terms and total unique searches. The search terms of a query are made out of the string typed into the search box of the portal. Neither site had any additional event tracking, e.g., click-through data, configured.
Pre-processing. We filtered out potentially identical queries, for example, those that differ only in lettercase (LIDAR vs. lidar) or spelling mistakes (London vs. Londno). We applied two clustering methods on the raw data, Fingerprint and N-Gram Fingerprint, using the OpenRefine frameworkFootnote 2. Next, we discarded outliers in terms of length. \(99.9\%\) of all queries had less than 19 words. We considered longer queries to be likely the result of accidental pasting of text into the search box and discarded them from our analysis.
Analysis. Statistic performed in this analysis include: Query length including average query length and distribution for all and unique queries, Query characteristics queries containing keywords describing: location; time frame; file and dataset type; numbers; abbreviations - described in detail in Table 4 and Question queries. To recognise question queries we counted queries containing the words: what, who, where, when, why, how, which, whom, whose, whether, did, do, does, am, are, is, will, have, has as done in [2].
4 Results
This section describes the results of our analysis, reporting on query length and query types. Where applicable we provide related results from web search log analyses. However, as detailed in Sect. 6 on limitations, our results cannot be directly compared with web search - due to the different nature of data - but only serve as an indication of differences for the reader.
Query Length. Table 3 shows the average query length. Almost 90% of all queries have between one and three words, with an average of 2.03 words for all queries and 2.67 for the unique ones. Figure 1 shows the average percentage of queries according to their length, for all portals, for both all and unique internal queries. When considering all queries, single word queries represent almost half of the entire corpus. When focusing just on unique queries, this metric falls down to \(25\%\). The distribution for unique queries is very similar to the results reported for web search engines in 2001 by Spink [16]. It could be that advances in dataset search will lead to similar behaviour patterns as observed in web search today (as reported by [17] they were steadily growing in length over time), which would mean longer queries, closer to natural language as technologies and tools improve.
Query Types. Having learned that people seem to be interested in the provenance of the datasets they are looking for, we explored whether location, time, or specific publishing formats are in their focus.
Location, temporal, filetype, numerical, and acronym queries. Table 4 summarises the percentage of queries for the individual metrics. \(5.44\%\) of queries contain location keywords. A previous study on general web search [5] reports this figure at \(12.01\%\). This difference could be caused by the fact that data portals are already area-bound and users do not need to specify the location as frequently as in web search.
\(7.29\%\) of the queries contain temporal information. This number is much higher than the \(1.5\%\) reported for general web search [13]. This may mean that, for datasets, users have more interest in the time frame in which the data was created or the frequency of data releases and updates.
\(6.25\%\) of queries included common filetypes. We note that the governmental portals represented in this study offer filtering options for file types that are not reflected in our data - the actual number of queries filtering by a file format could be higher. From an interface point of view, this suggests the filters might not be prominent enough. From a data point of view, this figure is an indication that users search for particular filetypes and formats and that publishers need to be able to support different and popular formats for their data.
\(5.23\%\) of queries contain numerical values other than temporal information, notably, we discovered searches age and distance intervals associated to datasets (e.g., wage age 18–24 or LIDAR 25 meters) that might indicate a need for retrieving subsets or slices of datasets.
In our analysis, we also identified that users frequently use abbreviations in their queries, as many datasets use acronyms like rpi for Retail Price Index. \(5.11\%\) of queries contained at least one acronym. However, the full expansion of those acronyms is also used in queries.
Question queries. These are increasingly common in web search queries, thanks to advances in speech recognition and conversational search interfaces. This is not yet the case for dataset search - less than \(1\%\) of queries in our logs are question queries, significantly below the \(7.49\%\) reported by [2] for web search. We believe this is mostly due to the way portals are used: as a source of data to be downloaded for further use, and not as a question-answering engine.
5 Discussion and Implications
Our analysis shows differences in the ratio of question queries in dataset search and general web search and that they are generally short, on average one word shorter than web search queries, as per the 2011 report by [17]. We believe short queries potentially indicate a lack of trust that the search functionality will be able to provide relevant data for longer queries. It appears that users currently tend to treat the search field of a data portal as a starting point for further exploration. The categories and metadata attributes used in data portals, and the ability to link from one dataset to another becomes key in enabling users to find what they are looking for. Our analysis of query characteristics offers a starting point for data portal designers to refine their schemas and vocabularies.
We believe both temporal search, which is more prevalent than in document search, and geospatial search require better support. In both cases, relevant keywords are at different levels of granularity (e.g., months vs years, cities vs regions or countries), which is not always matched by the publishing practices of the data owners. While the data portals we analysed are location-bound to a country and most datasets hold national data, supporting question-answering and dataset search scenarios will require more advanced geospatial indexing and reasoning features. Queries including some indication of time were almost five times more frequent than in web search, suggesting that datasets have a stronger relationship to time than documents. This could be due, for instance, to periodic updates or temporal coverage of the data. Currently time references are recorded in the title of a dataset or as metadata attributes. This motivates the indexing of data along time dimensions, and the generation of temporal information from datasets that do not have it. The same is true about queries using numbers, abbreviations and named entities such as economic indicators. These kinds of information are likely to be found either in well-maintained, rich metadata descriptors or if data portals would expand their search capabilities, in the content of datasets. To get more accurate results for users’ information needs, portals could encourage users to ask more specific or longer queries. One solution might be implementing query recommendations based on the strongest co-occurrences of words in the datasets or past search queries. However, to achieve better results in dataset search, we need to improve the way datasets are indexed. We believe that automatically generated descriptions of the content of datasets, together with encouraging users to issue longer queries, could improve this process.
6 Limitations of Our Study
Comparisons of search log analysis present difficulties as concluded by [6] in their study comparing nine search engines by their transaction logs (over 1 billion queries in total). Even within web search, it is stated that findings resulting from the analysis of one search engine cannot be applied to all web search engines. Even more so, the comparison of our results with web search needs to be seen with caution, due to the different nature of the collected data. However, we believe that including data from several countries and different audiences increases the generalisation of our analysis.
Our study is based on dataset search engines that are part of governmental open data portals. Further studies with other kinds of dataset search engines are required before drawing general conclusions.
As we did not have control on the analytics being collected by each data portal, the time frames was different for each. In cases where all queries were considered, there is a bias towards DGU, as the one with more available data.
7 Conclusions and Future Work
We have presented the first analysis of query log data for the search vertical of dataset retrieval, based on query logs of four national data portals. Our findings can be summarised as: (i) Dataset queries are generally short. (ii) The portals are used exploratively, rather than to answer focused questions. (iii) There is a difference in topics between dataset queries issued directly to data portals and general web search queries.
As future work, we would like to (i) Analyse query log data from commercial dataset search engines, to identify differences and similarities with this study. (ii) Extend our study to click-through data: knowing which dataset pages users visited after performing a search and if a user downloaded them can prove invaluable to evaluate the effectiveness of the dataset search. (iii) Create a dataset search corpus in order to evaluate dataset search engines. (iv) Develop metrics specifically tailored for the analysis of dataset search logs, due to the unique characteristics of this vertical.
References
Ames, D.P., Horsburgh, J.S., Cao, Y., Kadlec, J., Whiteaker, T.L., Valentine, D.: HydroDesktop: web services-based software for hydrologic data discovery, download, visualization, and analysis. Environ. Model. Softw. 37, 146–156 (2012)
Bendersky, M., Croft, W.B.: Analysis of long queries in a large scale search log. In: Proceedings of the 2009 Workshop on Web Search Click Data, pp. 8–14. ACM (2009)
Cafarella, M.J., Halevy, A., Madhavan, J.: Structured data on the web. Commun. ACM 54(2), 72–79 (2011)
Devarakonda, R., Palanisamy, G., Wilson, B.E., Green, J.M.: Mercury: reusable metadata management, data discovery and access system. Earth Sci. Inform. 3(1), 87–94 (2010)
Gan, Q., Attenberg, J., Markowetz, A., Suel, T.: Analysis of geographic queries in a search engine log. In: Proceedings of the First International Workshop on Location and the Web, pp. 49–56. ACM (2008)
Jansen, B.J., Spink, A.: How are we searching the world wide web? a comparison of nine search engine transaction logs. Inf. Process. Manag. 42(1), 248–263 (2006)
Jiang, D., Pei, J., Li, H.: Mining search and browse logs for web search: a survey. ACM Trans. Intell. Syst. Technol. 4(4), 57:1–57:37 (2013)
Jones, S., Cunningham, S.J., McNab, R., Boddie, S.: A transaction log analysis of a digital library. Int. J. Digit. Libr. 3(2), 152–169 (2000)
Kelly, D.: Methods for evaluating interactive information retrieval systems with users. Found. Trends Inf. Retrieval 3(1–2), 1–224 (2009)
Koesten, L.M., Kacprzak, E., Tennison, J., Simperl, E.: The trials and tribulations of working with structured data - a study on information seeking behaviour. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI 2017. ACM (2017, to appear)
Kunze, S.R., Auer, S.: Dataset retrieval. In: 2013 IEEE Seventh International Conference on Semantic Computing, September 2013
Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of web tables containing time and context metadata. In: Proceedings of the 25th International Conference Companion on World Wide Web, pp. 75–76 (2016)
Nunes, S., Ribeiro, C., David, G.: Use of temporal expressions in web search. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 580–584. Springer, Heidelberg (2008). doi:10.1007/978-3-540-78646-7_59
Silverstein, C., Marais, H., Henzinger, M., Moricz, M.: Analysis of a very large web search engine query log. ACM SIGIR Forum 33(1), 6–12 (1999)
Singhal, A., Kasturi, R., Sivakumar, V., Srivastava, J.: Leveraging web intelligence for finding interesting research datasets. In: IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (2013)
Spink, A., Wolfram, D., Jansen, M.B., Saracevic, T.: Searching the web: the public and their queries. J. Am. Soc. Inf. Sci. Technol. 52(3), 226–234 (2001)
Taghavi, M., Patel, A., Schmidt, N., Wills, C., Tew, Y.: An analysis of web proxy logs with query distribution pattern approach for search engines. Comput. Stand. Interfaces 34(1), 162–170 (2012)
Weerkamp, W., Berendsen, R., Kovachev, B., Meij, E., Balog, K., de Rijke, M.: People searching for people: analysis of a people search engine log. In: Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval (2011)
Acknowledgement
This project is supported by the European Union Horizon 2020 program under the Marie Sklodowska-Curie grant agreement No. 642795.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kacprzak, E., Koesten, L.M., Ibáñez, LD., Simperl, E., Tennison, J. (2017). A Query Log Analysis of Dataset Search. In: Cabot, J., De Virgilio, R., Torlone, R. (eds) Web Engineering. ICWE 2017. Lecture Notes in Computer Science(), vol 10360. Springer, Cham. https://doi.org/10.1007/978-3-319-60131-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-60131-1_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60130-4
Online ISBN: 978-3-319-60131-1
eBook Packages: Computer ScienceComputer Science (R0)