Keywords

1 Introduction

Data has become one of the most important digital assets in the world with its potential for generating business value and social impact. As we advance in the digital age, more and more of the data we generate can be accessed (or purchased) online. Cafarella estimates more than one billion sources of data on the web as of February 2011, counting structured data extracted from Web pages [3]; and The Web Data Commons project recently extracted 233 million data tables from the Common Crawl [12].

A growing number of organisations, mostly in the public sector, have set up their own data portals to publish datasets related to their activities. The European Data PortalFootnote 1 indexes to date 638, 817 datasets published by regional and national authorities in 28 EU countries. Similar trends can be observed in the commercial sector and in science, as open access and reproducibility become mainstream across subjects and research communities. Data is used in a variety of professional roles - whether it is a business analyst searching for evidence to substantiate their report, or a scientist replicating an experiment, the first and foremost step is to find, or retrieve the most relevant datasets for their needs. In our previous work we asked people to share their experiences when carrying out this task [10]. Using conventional search engines is not ideal, as these have been designed primarily for documents, not data  [3]. Kunze et al. have recently introduced the concept of dataset retrieval, as a branch of information retrieval applied to data instead of documents focused on determining the most relevant datasets according to a user query [11]. Additionally, there have been proposals for special-purpose search systems, tailored for datasets in a given domain, including hydrology [1]; earth sciences [4]; or datasets from scientific experiments [15]. However, they are mostly restricted to their specific domains and rely on metadata that is created manually.

We advocate for a user-driven approach, where the needs of the dataset searchers are analysed and understood, serving as a basis for the research and development of search functionalities for data. Through the analysis of query logs we explore how people use existing data portals to look for data. We analysed query logs from four open data portals: three belong to the national governments of the UK, Australia, and Canada, the fourth one from the UK’s Office for National Statistics. Together, the logs include more than 2.2 million queries (of which 1.3 million are unique queries), issued between 2013 and 2016.

Our study considered the following questions: (ii) What are the main characteristics of queries in terms of length, distribution, and structure? (ii) Can we identify types of queries? Which ones are the most common? (iii) How these metrics differ between dataset and web search?

2 Background and Related Work

Analysing query logs serves as a proxy to analyse the search behaviour of users [9]. The first query log analysis on the web was made for the Altavista search engine [14], and the technique has been used since to study several aspects of web search (see [7] for a survey). As shown in [6], which reports transaction log analysis of nine search engines, results of different search log analyses are not directly comparable. Vertical search engines have also used query log analysis for example, in people search engines [18] and digital libraries [8]. To the best of our knowledge, this is the first query log analysis study for dataset search.

Various metrics for analysing query logs were developed in the area of general web search; a summary of the ones used in this study is shown in Table 1.

Table 1. Metrics from web search studies used in this study

3 Data and Methods

Query logs. Four well-known data portals provided their query logs to us. All portals collect log data using Google Analytics, but are using different settings. As a consequence, the collected information and time frame per portal vary.

In Table 2 we present the number of queries for each portal and the time frame in which those were collected. We had available data on three types of information objects for analysis: queries, sessions and users. A query object is comprised of the following fields: search terms and total unique searches. The search terms of a query are made out of the string typed into the search box of the portal. Neither site had any additional event tracking, e.g., click-through data, configured.

Table 2. Summary of search log data after pre-processing. Column all queries refers to the total number of queries per portal, while column unique queries refers to the number of unique queries.

Pre-processing. We filtered out potentially identical queries, for example, those that differ only in lettercase (LIDAR vs. lidar) or spelling mistakes (London vs. Londno). We applied two clustering methods on the raw data, Fingerprint and N-Gram Fingerprint, using the OpenRefine frameworkFootnote 2. Next, we discarded outliers in terms of length. \(99.9\%\) of all queries had less than 19 words. We considered longer queries to be likely the result of accidental pasting of text into the search box and discarded them from our analysis.

Analysis. Statistic performed in this analysis include: Query length including average query length and distribution for all and unique queries, Query characteristics queries containing keywords describing: location; time frame; file and dataset type; numbers; abbreviations - described in detail in Table 4 and Question queries. To recognise question queries we counted queries containing the words: what, who, where, when, why, how, which, whom, whose, whether, did, do, does, am, are, is, will, have, has as done in [2].

4 Results

This section describes the results of our analysis, reporting on query length and query types. Where applicable we provide related results from web search log analyses. However, as detailed in Sect. 6 on limitations, our results cannot be directly compared with web search - due to the different nature of data - but only serve as an indication of differences for the reader.

Query Length. Table 3 shows the average query length. Almost 90% of all queries have between one and three words, with an average of 2.03 words for all queries and 2.67 for the unique ones. Figure 1 shows the average percentage of queries according to their length, for all portals, for both all and unique internal queries. When considering all queries, single word queries represent almost half of the entire corpus. When focusing just on unique queries, this metric falls down to \(25\%\). The distribution for unique queries is very similar to the results reported for web search engines in 2001 by Spink [16]. It could be that advances in dataset search will lead to similar behaviour patterns as observed in web search today (as reported by [17] they were steadily growing in length over time), which would mean longer queries, closer to natural language as technologies and tools improve.

Fig. 1.
figure 1

Percentage of internal queries according to average number of words (all and unique queries, all portals)

Table 3. Average number of words per query

Query Types. Having learned that people seem to be interested in the provenance of the datasets they are looking for, we explored whether location, time, or specific publishing formats are in their focus.

Table 4. Definition of query characteristic metrics. Percentage of queries computed for four portals

Location, temporal, filetype, numerical, and acronym queries. Table 4 summarises the percentage of queries for the individual metrics. \(5.44\%\) of queries contain location keywords. A previous study on general web search [5] reports this figure at \(12.01\%\). This difference could be caused by the fact that data portals are already area-bound and users do not need to specify the location as frequently as in web search.

\(7.29\%\) of the queries contain temporal information. This number is much higher than the \(1.5\%\) reported for general web search [13]. This may mean that, for datasets, users have more interest in the time frame in which the data was created or the frequency of data releases and updates.

\(6.25\%\) of queries included common filetypes. We note that the governmental portals represented in this study offer filtering options for file types that are not reflected in our data - the actual number of queries filtering by a file format could be higher. From an interface point of view, this suggests the filters might not be prominent enough. From a data point of view, this figure is an indication that users search for particular filetypes and formats and that publishers need to be able to support different and popular formats for their data.

\(5.23\%\) of queries contain numerical values other than temporal information, notably, we discovered searches age and distance intervals associated to datasets (e.g., wage age 18–24 or LIDAR 25 meters) that might indicate a need for retrieving subsets or slices of datasets.

In our analysis, we also identified that users frequently use abbreviations in their queries, as many datasets use acronyms like rpi for Retail Price Index. \(5.11\%\) of queries contained at least one acronym. However, the full expansion of those acronyms is also used in queries.

Question queries. These are increasingly common in web search queries, thanks to advances in speech recognition and conversational search interfaces. This is not yet the case for dataset search - less than \(1\%\) of queries in our logs are question queries, significantly below the \(7.49\%\) reported by [2] for web search. We believe this is mostly due to the way portals are used: as a source of data to be downloaded for further use, and not as a question-answering engine.

5 Discussion and Implications

Our analysis shows differences in the ratio of question queries in dataset search and general web search and that they are generally short, on average one word shorter than web search queries, as per the 2011 report by [17]. We believe short queries potentially indicate a lack of trust that the search functionality will be able to provide relevant data for longer queries. It appears that users currently tend to treat the search field of a data portal as a starting point for further exploration. The categories and metadata attributes used in data portals, and the ability to link from one dataset to another becomes key in enabling users to find what they are looking for. Our analysis of query characteristics offers a starting point for data portal designers to refine their schemas and vocabularies.

We believe both temporal search, which is more prevalent than in document search, and geospatial search require better support. In both cases, relevant keywords are at different levels of granularity (e.g., months vs years, cities vs regions or countries), which is not always matched by the publishing practices of the data owners. While the data portals we analysed are location-bound to a country and most datasets hold national data, supporting question-answering and dataset search scenarios will require more advanced geospatial indexing and reasoning features. Queries including some indication of time were almost five times more frequent than in web search, suggesting that datasets have a stronger relationship to time than documents. This could be due, for instance, to periodic updates or temporal coverage of the data. Currently time references are recorded in the title of a dataset or as metadata attributes. This motivates the indexing of data along time dimensions, and the generation of temporal information from datasets that do not have it. The same is true about queries using numbers, abbreviations and named entities such as economic indicators. These kinds of information are likely to be found either in well-maintained, rich metadata descriptors or if data portals would expand their search capabilities, in the content of datasets. To get more accurate results for users’ information needs, portals could encourage users to ask more specific or longer queries. One solution might be implementing query recommendations based on the strongest co-occurrences of words in the datasets or past search queries. However, to achieve better results in dataset search, we need to improve the way datasets are indexed. We believe that automatically generated descriptions of the content of datasets, together with encouraging users to issue longer queries, could improve this process.

6 Limitations of Our Study

Comparisons of search log analysis present difficulties as concluded by [6] in their study comparing nine search engines by their transaction logs (over 1 billion queries in total). Even within web search, it is stated that findings resulting from the analysis of one search engine cannot be applied to all web search engines. Even more so, the comparison of our results with web search needs to be seen with caution, due to the different nature of the collected data. However, we believe that including data from several countries and different audiences increases the generalisation of our analysis.

Our study is based on dataset search engines that are part of governmental open data portals. Further studies with other kinds of dataset search engines are required before drawing general conclusions.

As we did not have control on the analytics being collected by each data portal, the time frames was different for each. In cases where all queries were considered, there is a bias towards DGU, as the one with more available data.

7 Conclusions and Future Work

We have presented the first analysis of query log data for the search vertical of dataset retrieval, based on query logs of four national data portals. Our findings can be summarised as: (i) Dataset queries are generally short. (ii) The portals are used exploratively, rather than to answer focused questions. (iii) There is a difference in topics between dataset queries issued directly to data portals and general web search queries.

As future work, we would like to (i) Analyse query log data from commercial dataset search engines, to identify differences and similarities with this study. (ii) Extend our study to click-through data: knowing which dataset pages users visited after performing a search and if a user downloaded them can prove invaluable to evaluate the effectiveness of the dataset search. (iii) Create a dataset search corpus in order to evaluate dataset search engines. (iv) Develop metrics specifically tailored for the analysis of dataset search logs, due to the unique characteristics of this vertical.