Machine Learning Methods to Predict Fecal Coliforms In Florida's Coastal Waters
Monitoring water quality is incredibly important when it comes to monitoring estuarine health and maintaining public safety standards. In particular, we are worried about pathogens that can be measured through fecal indicator bacteria loads. Fecal indicator bacteria are not harmful alone but are indicative of more harmful pathogens in the water which can lead to public health issues if not measured and managed for. In these systems we look at fecal coliforms specifically and these fecal coliforms mainly originate from human wastewater, septic systems, and waterfowl and livestock manure which are then washed from these terrestrial sources into these coastal water systems through stormwater runoff.
So, what do we do to manage this type of contamination in terms of bivalve shellfish sanitation? Generally, because contamination is tied so directly to rainfall events, to mitigate risk of contamination, they split shellfish growing waters into different areas which can be open or closed to harvesting. What determines whether an area is opened or closed on a given day is generally the 24 hour rainfall totals and in some systems, freshwater input or river stage. Because rainfall and river stage thresholds are set for each growing area, if the thresholds are met, that triggers an automatic closure of an area. The issue here is that this immediate closure response leaves shellfish growers with little to no time to make optimal management decisions for their businesses and can lead to wasted time and resources.
This study aims to solve that problem by developing predictive models for fecal coliform concentrations in shellfish growing waters to provide regulators and growers with a day to day shellfish growing area closure forecasting tool. To do this we aim to identify the key drivers of fecal coliform dynamics, actually estimate the concentrations, and use those predicted concentrations to determine when and where shellfish harvesting closures will occur based on area specific thresholds.
In Florida, the Florida Department of Agriculture & Consumer Services is the shellfish sanitation authority who measures water quality from more than 13 hundred individual sampling stations that cover more than 5,000 sq. km. of shellfish growing areas. What makes Florida an ideal study system for this project is their use of adverse pollution condition sampling. Without getting into the nitty gritty, it means that they not only sample when their shellfish growing areas are open, but they sample intentionally when they are closed. So their many decades long datasets capture not just water quality when conditions are good, but also when the water quality conditions are bad which is perfect for predicting closures.
We created separate models for each of Florida’s major watersheds as defined by the USGS. We separated the models using a larger scale hydrological resolution (HUC4) but used a smaller scale hydrological resolution (HUC12) in the dataset to capture the smaller variations in watershed characteristics that affect fecal coliform concentrations. So we chose predictors that are known to affect fecal coliform dynamics by both the source and transport of them from land to sea, and predictors that affect the ability of fecal coliforms to persist in a system like air and water temperature.
We created 8 models total, for each of the major watershed areas using 56 predictor variables. Using the caret package in R, we used Random Forest algorithms to create our supervised regression models with 20% hold out for testing. We chose random forest because these algorithms are generally very robust when predicting nonlinear responses using data of this size. It is also appropriate to use this method with a wide range of input factors. We looked into R^2 values as well as RMSE to evaluate performance. When we look at our variable importance scores, we used metrics that looked at how much the MSE increases with random permutation of a variable, and node purity which is measured by Gini Indices. Even with both metrics, generally for all of the models, the most important predictors for estimating fecal coliform concentrations were rainfall, wind speed and direction, and river stage. Which aligns with what we already know in terms of drivers of elevated fecal coliforms in coastal systems as rainfall and riverstage are already used to manage these waters on a day to day basis.
So to summarize: We can very reasonably predict fecal coliform concentrations given meteorological and watershed specific data in Florida’s coastal waters. The variables that are most important when predicting these values are rainfall, wind speed and direction, and river stage. Our next steps will be to first, use model selection to minimize predictor variables. Then, to replace measured river stage and precipitation estimates with forecasted data so that we can see how these models perform with forecasted values instead of real time prediction. And if all goes well, we will use those forecasted values to determine if the concentrations will be above or below shellfish growing area closure thresholds, and finally wrap all of that in a GUI for regulators and shellfish harvesters to use.