6.2 Cluster Analysis




In the previous section the trajectories from multiple simulations were counted as they passed over an arbitrary grid. Although very simple and computationally efficient, this method tends to obscure the individual flow patterns that could dominate during the simulation period. Another approach is to merge trajectories that are near each other and represent those groups, called clusters, by their mean trajectory. Differences between trajectories within a cluster are minimized while differences between clusters are maximized. Computationally, trajectories are combined until the total variance of the individual trajectories about their cluster mean starts to increase. This occurs when disparate clusters are combined. The clustering computation is described in more detail in the next section.

  1. Clustering requires a large number of single trajectory files which should have been created in the previous section. All trajectory input files need to be complete through the required duration. If these files are not already available, go back and complete that step. With the trajectory files completed for September 1983, open the Trajectory / Special Runs / Clustering Standard menu tab, which will open the main clustering menu.

  2. Several changes need to be made to the default settings. First change the hours to cluster from 36 to 48 which permits all or part of the trajectories to be compared to each other. Second, change the endpoints folder, using the browse button, from ../cluster/endpts to c:/hysplit4/working, the endpoints location from the previous section. Third, change the base file name from tdump to fdump to be consistent with previous calculation. The time interval and skip options are used to select a subset of endpoints or trajectories for analysis to speed up the clustering of a large number of trajectories.

  3. Just like with the frequency analysis, a file of file names for analysis must be created first by pressing the Make INFILE button. The INFILE contents can be manually edited to remove unwanted files prior to pressing the Run Cluster Analysis button which then shows the progress of the clustering.

  4. When the clustering is complete, press the Display Total Spatial Variance button to display a graphic showing the change in the Total Spatial Variance (TSV) as the trajectories are merged into one cluster. Only the last 30 clusters are shown. In this particular case there is a large jump in TSV when going from 3 to 2 clusters, suggesting that perhaps 3, 4, or 5 clusters would be an appropriate solution. Another possibility is to use an objective percentage change criterion of either 20% or 30%.

  5. The third and last step of this process is to decide the final cluster number and enter that number, for this example 4 into the Number of Clusters text field. Then press the Run button to Assign Trajectories to Clusters. The final step would be to Display Means graphic. The individual trajectories in each cluster can also be displayed. As an exercise rerun this step, but using three clusters to illustrate which two clusters get merged into one cluster, clearly illustrating why three clusters is not the correct answer.

  6. All the intermediate clustering files and graphical results are saved in the \cluster\working directory. For instance, the step that assigns the cluster number to individual trajectories is written to a file called CLUSLIST_{x}, where x is the final cluster number. This file can be used as input for other post-processing trajectory applications.

In summary the clustering results shown here for September 1983 present a more nuanced view of the probability of conducting a successful experiment: a 38% chance of simple west to east transport (Cluster #3) through the center of the sampling network.