Work by Kyosuke Imai

Data Visualization: The Evolution of Baseball

Software Engineer

HTML , d3.js , Observable,

The Question: We want to find out how exactly Major League Baseball has changed over the extensive history of the game. Our sliding timescale over four graphics allows the viewer to see various short and long term trends in the league. This will give viewers a window into the correlations between each team's historical reliance on different stats and measures, and if less performance-related details such as game attendance have any influence on game outcomes. We also included turning points in league history to see how or if major events in the league affected individual team performance. Interactivity also affords users the ability to quickly answer these questions, and explore metrics both across teams as well as within one team over time. Additionally, modern analytics that use a formula to calculate data points, colloquially referred to as "sabermetrics", can be found in the graphic. An example would be the three true outcomes percentage, a single number calculated using home runs, strikeouts, walks, as a percentage of total plate appearances.

The Data: The data we are using is from openintro.org . It is an expansive dataset of individual team data starting from 1880 all the way to the 2020 MLB season — we manually added data from the past two years using baseball-reference.com , both datasets being made intentionally public by the league. Each row of our CSV is one team’s stats for a single seasons, with 41 distinct variables including team name, attendance, year, record, and performance statistics such as home runs, walks, etc. Using this data, we also calculate other modern statistical metrics to answer specific questions about the evolution of baseball. The reason we used this data, and the main intrigue behind our transforming it into an interactive graphic, is because of its expansive size (around 2800 data points) over a century, which documents both the historic era of baseball and the modern iteration of the game. And the information is already well organized. Note: We adapted team names from the early 1900s to their modern names for ease of understanding in a contemporary setting.

Design Justification: Our design includes 4 different charts for 4 ways to aggregate the swathes of evolutionary baseball data. We decided to group these particular ones together because of their unique components that within one graph would become too messy. Baseball is a game too complicated for just one timeline with thirty teams in the modern day. So we first use scatterplots showing individual team performance, the first one is color encoded by team-name and the other is has color encodings based on divisional ranks, in order to show how wins/losses can be affected by a team's metrics, the stats organized in a dropdown menu. If users don't care about rank, they can simply isolate one or more teams and compare them to all other teams in the first plot diagram. The third chart is a regression graph measuring the same advanced statistics against winning percentages. We wanted to show how tightly-correlated different stats were with both team winning percentage and Pythagorean winning percentage (the latter being an expected winning percentage based on peripheral factors instead of just success in high variance games). The regression line's main visual channel is orientation, meaning the higher the slope, the bigger the correlation. You can use the year slider to see how this slope changes over time. We decided to include the slider because of multiple suggestions that using a slider to actually watch evolution manually would be very helpful and informative. The last graph we included has all basic statistics over time, with the ability to include markers for when different major events happened in league history, and see the direct and long term effects. This graph shows the most about how the game has changed over the last 150 years where you can see the evolution all at once by just highlighting a line.

Discussion Takeaways: The main goal is for less informed users to have the option to simply learn general trends as they hop from chart to chart, and for more avid fans to draw more complex conclusions using relationships between one or more graphs. For both parties, the timeline is an easy place to start. Some major events seem to directly cause a few metrics to change, like modern substitutions being allowed and the beginning/ending of the steroids era, but others seem to mildly exacerbate other existing trends. Key takeaways from the three other graphs are which advanced stats do matter for winning percentage and which stats really do not seem to. Statistics, like runs created and fielding independent pitching, which measured the sum of the teams total offense or pitching contribution, were highly correlated with teams winning and their pythagorean winning percentage. Unfortunately, statistics like three true outcome percentage were not at all correlated with winning percentage, which is curious because these statistics have some explanatory power over stats like runs created. Also, it seems fans do not care if their team is any good as our regression line might actually have the worst correlation I've ever seen when attempting to tie home attendance to winning percentage — this is probably because we do not account for the price of the tickets which are probably an equally if not larger determining factor when explaining team attendance. On another note, the we can see a strong correlation between new leagure regulations and rules, and it's affect on various baseball metrics.