A scatterplot is specifically designed to display the profound relationship between two numeric variables.
Each individual dot on the graph represents exactly one specific observation from your dataset.
You construct a scatterplot using the exact same plot(x, y) function we learned previously.
Instead of passing a sequence of 5 numbers, you pass vectors containing hundreds of data points!
R automatically maps the X and Y coordinates and distributes the dots across the canvas.
# Vector representing human ages age <- c(25, 30, 45, 50, 65, 70, 22, 35) # Vector representing their income in thousands income <- c(40, 50, 80, 85, 120, 130, 35, 60)plot(age, income, main="Age vs Income", col="blue")
Data scientists rely heavily on scatterplots to rapidly identify data correlations.
If the dots form a pattern sloping tightly upwards from left to right, it indicates a strong "Positive Correlation".
If the dots are scattered entirely randomly across the entire canvas, it indicates absolutely zero correlation.
When plotting thousands of dots, they can easily overlap and become completely unreadable.
You can adjust the physical size of the dots using the cex (character expansion) parameter.
Setting cex = 0.5 makes the dots 50% smaller, clearing up massive amounts of visual clutter!
What is the primary purpose of rendering a scatterplot in data science?
Which parameter explicitly shrinks or expands the physical size of the plotted dots?