General Questions


Where can I read the paper?


Our paper is available here through the Hydrology and Earth System Sciences journal.


How was the corpus collected?


The English papers used in the model were collected from the journals from in the Web of Science and Scopus databases, as well as documents from an assembled reference file from EndNote. For the Spanish and Portuguese papers, papers were downloaded manually. For the first time, there is a comprehensive collection of peer-reviewed texts related to water research from Latin American and Caribbean countries. This corpus is the input to build our topic models. More information can be found on the supplemental materials of the research paper.


Can I access the corpus?
Access agreements prevent us from hosting the entire corpus, however we do provide metadata for every article in the corpus; the paper’s R package also contains extracted features from the corpus.

Topic Models


What is a topic model?


A topic model is a statistical model that uses LDA (latent Dirichlet allocation) to determine the probability of correspondence between specified topics in a collection of documents. The model generates these topics on the basis of word co-occurrence. After running the model, we categorized topics into four categories: general research topics (using NSF categories), specific subfield topics, water budget topics, and method topics. These topics allow us to interpret which fields of water science have the most and the least comprehensive research—or in other words, which are the “bright” and “blind” spots of water science in Latin America.


What is a topic?


A topic encompasses a cluster of statistically relevant and co-occurring words related to a specific topic. Each topic is labeled by a number, theme, National Science Foundation (NSF) Specific topic, NSF General topic, and contains a brief description.


What is a topic label?


A topic label encompasses a cluster of statistically relevant and co-occurring words related to a specific topic. Each topic is labeled by a number, theme, National Science Foundation (NSF) specific topic, NSF general topic, and, where appropriate, contains a brief description of the water budget and methods themes that define the research landscape of water management in Latin American countries.

More information about the NSF specific topics may be found at the NSF Field of Study List.

Breakdown of the topic label categories:

How do I use the topic model?


To navigate the topic model, you must first find the numbers corresponding to the topic labels related to your query. Refer to the Topic Labels at the bottom of the Topic Overview page to find the number representing the topic you want to research. By default, the table will be organized by the number corresponding to each topic (hint: you can use the Search tool for a quick search on a given topic that you are interested). Refer to the Topic Label instructions for further guidance navigating the Topic Labels tab.

Once you have your topic number, there are a few ways to search it within the topic model. The most straightforward option is to insert a number into the Selected Topic search bar. You can also navigate to the topic by selecting the Next Topic and Previous Topic buttons, or by clicking on the circle with your topic number. A topic will highlight red when selected. To reset the model, select Clear Topic.

Now that your topic is projected on the page, let’s go over how to interpret the data. The page is organized in two panels. On the left side is a two-dimensional plane, titled Intertopic Distance map (via multidimensional scaling), which projects the relative distance between each topic. Topics are also represented by a circle, the center of which is determined by the Jensen-Shannon divergence while the area is determined by the prevalence of each topic.

The right side, Top-30 Most Relevant Terms for Topic –, is a horizontal bar chart which displays the 30 words which co-occur most frequently within the selected topic. For each word, the blue bar represents its corpus-wide frequency while the red bar depicts its topic-specific frequency. At the top, a λ slider allows you to alter the topic specific probability which will reorder the Top-30 term ranking by term relevance.

When hovering over the words in the Relevant Term panel, bubbles arrange on the left to demonstrate the conditional probability of the specific term. This is essentially the predicted frequency the term will appear in relevant topics.


How were the topic labels created?


We used latent Dirichlet allocation (LDA) topic modeling to identify 105 topics in the English corpus and 65 topics in the Spanish and Portuguese corpora. Our interdisciplinary research team labeled the topics by reading a random subset of 2,000 papers and manually identifying single-word tags based on keywords and main research topics. Topics were labeled “irrelevant” if multiple team members could not determine a coherent theme from the most frequently occurring words of a topic. The remaining relevant topics were tagged with five labels (see below) independently by multiple reviewers.

Topics were assigned labels for each of the following levels: (i) specific topic name; (ii) theme; (iii) specific; or (iv) broad categories of scientific research as defined by the US National Science Foundation (NSF); and (v) description: spatial scale, water budget, or methods. These labels were then consolidated into four topic categories: general, specific, methods, and water budget.


What are irrelevant topics?


Irrelevant topics are topics that have no relevance to the research question. They either do not fall under any of the NSF topics or locations studied. Generally, they are hard to interpret because they either have no discernible meaning or are too general. For example, topic 59 is composed of common Spanish and Portuguese words that do not relate to any of the existing topics.


How did we define the water budget and methods?


The interdisciplinary research panel of eight water experts assigned labels for each topic while simultaneously evaluating their significance and described each topic on the basis whether the topic was related to a spatial scale or water balance. There are 17 topics described as water-budget and 13 topics described as methods


How were the country predictions made?


We used semi-supervised text classification to predict the location of the country of study of each paper in the English corpus. The training labels were provided by a manual human-reading of randomly chosen articles from the corpus (1,428 human-derived labels) and from text mining article metadata (2,663 text-mined labels). Our team read a random subset of 2,000 papers from the corpus to validate the results from the topic model and to identify the location of study.


Why is there a download button for the topic labels table?


The user can download the topic labels as well as the articles in the Articles Listing section because together they can act as a set of associative data that can be used for further corpus exploration in R. The table also shows the four levels of topic relatedness which aims to explain the scientific, political, and sociological landscape surrounding water research in Latin American and Caribbean countries. The data would also allow the user to add other labels that were not included in the study.


What data is in the article listings?


The article listings table provides metadata for all the articles that make up each of the three corpora on the platform (English, Portuguese, and Spanish). This metadata includes information such as the author(s), title, year, source, and DOI for every article; the English listings also supply a predicted country of study, a top topic, and a topic label. Users can search or filter for specific articles of interest, using either publishing information or the topic associations (where available). In the English model, topic associations may be further explored in the Topic Overview tab. The article listings table also provides a method for exploring how articles relate to NSF general and specific topics, both in the context of the topic models and other visualizations on the wateReview site, including:


How do the topic models inform other aspects of the wateReview platform?


Neither of the non-English corpora contained enough papers to create meaningful and generalizable conclusions on the basis of the Portuguese and Spanish topic models alone. For a handful of specific topics, we compared the topic model output from the English corpus to the output from the other two corpora. Despite the relative lack of data, we determined that there was significant alignment between topics present across models of all three corpora, which allowed us to conclude that the English corpus could be representative of the Portuguese and Spanish corpora. Therefore we decided to use the results from the English topic model as the basis of other facets of the study, including our analyses of research spread and connectivity. For more details, refer to Figure 1 and Tables S4-6 in our paper.



Country Groups


How were the country groups created?


Countries are grouped in relation to social and hydrological features and validation matrices. The clustering is performed with Euclidean distances and following Ward’s criterion. Both the total within sum of squares and the average silhouette width determine the optimal number of clusters is two clusters. However, further inspection in PCA (principal component analysis) dimensions indicates that the cluster with Mexico and Brazil is significantly distinct from all other countries.

In addition, four validation metrics are used to assess the stability of the clustering under the complete set of clustering variables through a iterative procedure where one variable is removed from the set:

  1. The average proportion of (APN) measures the proportion of observations not placed in the same cluster under both cases and evaluates how robust are the clusters under cross-validation
  2. The average distance between means (ADM) measures the variation of the cluster center and evaluates the stability of the localization of the cluster in the multi-dimensional clustering variable space
  3. Average distance (AD) measures the distance between observations placed in the same cluster and evaluates within-cluster stability
  4. The figure of merit (FOM) estimates the predictive power of the clustering algorithm by measuring the within-cluster variance of the removed variable

Validation metrics exhibit optimal null values of APN and ADM for two or three clusters. In addition, AD and FOM are lower for three clusters than for two. Based on these results, we chose three clusters to describe the grouping of countries based on their socio-hydrologic variables.

For more information about the clustering process refer to the paper or the section on clustering in the wateReview R package documentation.


Why are some countries (i.e. Caribbean nations) not included in the clustering?


There was not enough information about the hydrological and social features to cluster those countries with the other country groups. However, those countries still represent blind spots in terms of water research. More specifically, countries with less than 30 articles are not included.


Why do some country cards have missing data?


Some countries do not have a complete country card because there was not enough research output to draw meaningful conclusions within our study. If the article listing for each country does not sum to 30, the listing will appear with no data.



Bright Spots and Blind Spots


What is a bright spot?


Bright spots are topics and locations where water issues are better understood due to high research. Examples include countries such as Brazil and Mexico and topics within physical science and life science.


What is a blind spot?


Blind spots are topics and locations where water issues are less understood due to limited research. These include the locations such as the Caribbean nations and Central America, and topics such as reservoirs and risk assessment. Future researchers may want to focus on these regions or topics in order to contribute to a more comprehensive understanding of water research in Latin American and Caribbean countries.


How were the bright and blind spots determined?


The bright and blinds spots were determined by abundance, spread, and connectivity. Abundance was measured as research volume between countries and topics by using a weighted bipartite network. Spread was estimated by topic normality across countries and articles, which is described by how close a topic’s probability distribution is to the standard normal distribution. Connectivity was determined with a weighted citation network across countries and topics, describing the probability that a specific node (country or topic) is cited by other nodes.

For more information on topic normality, refer to the normality section below.


What do the values and colors represent in the overview heat map?


The values are indicators of the volume of research based on the country and topic. The larger the value, the more research there is on the topic in that location. These remain the same in every heat map and can be downloaded as an excel file from the page. Meanwhile, the colors are scaled by row and variation by country. This means by looking at the colors of a single column, it is possible to see how much a topic is studied in that country—or essentially, that country’s bright and blind spots.


What is a Sankey diagram?


A Sankey diagram is a graphical visualization of the predicted volume of water research in Latin American and Caribbean countries based on the top or bottom 25% of studied topics in a country (based on if the diagram displays bright or blind spots). Each nation is represented by their socio-hydrologic group color, and the size of the color is relative to the predicted amount of research about the country in the entire corpus. The bars on the right represent specific topics and are proportionate to a given topic. The individual lines represent the proportion of a topic in the top or bottom 25% of topics for a given country.


How do I interpret a Sankey diagram?


The colors in the diagram correspond to the socio-hydrologic group composed of Latin American countries. On the left are the NSF general categories that are subdivided into specific topics. The width of the countries and categories are proportional to the predicted amount of research about these locations and topics.

For example, countries such as Brazil, Mexico, and Chile have produced a majority of the water research, and physical and life sciences dominate on the topic side. On the other hand, social sciences is far less studied, and there is much less research about the Caribbean region and Central America.


What is hydro-social group #1?


Hydro-social group 1 comprises Brazil and Mexico, the countries in Latin America with the greatest overall representation in research.


How were hydro-social groups defined?


Socio-hydrological groups were defined based on 37 shared characteristics from databases which indicate environmental health, resource availability, political trends, quality of life standards, and risk assessments for each country. Comparing clustering methods and validation metrics determined the optimal number of three groups. For more details on our clustering and validation methods, refer to these supplemental materials:


What is the list of the not researched general categories?


The Not Researched list indicates which general categories represent no data within a specified socio-hydrological group and within the scope of our study. This means the probability each listed topic is researched in relation to water is significantly low. For example, we cannot conclude that there is no research relating animal science and water in Group 1, but it is likely very limited.


How was normality calculated?


The normal distribution was calculated across topics and across countries using the Jensen-Shannon distance. Please refer to Appendix A in the paper for a detained derivation of the Jensen-Shannon distance.


How do I interpret the normality measures?


When the normality across countries (y-axis) and normality across documents (x-axis) are closer to 1, the standardized probability distribution of topic probability distribution across documents and topics is closer to a standard distribution, meaning the distribution is evenly spread. For example, the National Science Foundations Specific Topics plot depicts a point in the top right corner with values close to 1, which suggests that hydrology and water resources display distributions close to normal across countries and documents. For the Themes plot, the cluster of points in the top left corner suggest that the majority of themes are close to normally distributed across countries, but far from normal across topics.


Why are some countries excluded from the spread analysis?


Specific countries in Central America and the Caribbean are excluded due to a shortage of research in these regions.



Research Connectivity


How do I interpret the network diagrams?


The network diagrams show the connection between countries and/or topics based on the estimated degree of connectivity, which is the probability that one node is cited by other nodes. It is measured by directional citations between articles’ characteristics. The direction of the edges can be viewed by hovering over the edges that link one topic to another.

Nodes represent countries or topics depending on the graph, and edges are the lines that link each country or topic (node). The size of the nodes are proportionate to the volume of research. Edge thickness is relative to the strength of the connection or the citation proportion which can be viewed by hovering over the edge.

To explore the connectivity, set the citation volume to a range of your interest. The ranges for weak, medium, and strong citation proportions are listed in the info button near the top of the page.


What do the components of the network diagram indicate?


The node size represents the volume of research related to each country or topic (depending on the diagram).

Edges represent a connection between nodes.Edge thickness is based on the weight of the connection between nodes. The thicker the edge, the stronger the connection.


What is a self-citation?


A self-citation is when a country or topic (node) cites their own articles. It is represented in the citation networks as an edge that starts and ends at the same node. If there is a relatively large self-citation value, this may indicate that the same groups of people are studying those topics without interdisciplinary collaborations.


What do the weighted in-degree and weighted out-degree represent?


For each node (country or topic), the weighted in-degree is the number of incoming edges (connections) and the weighted out-degree is the number of outgoing edges.


What are the US National Science Foundation (NSF) categories?


See the NSF Field of Study List for a classification of the categories.