Can Cricket chirp frequency predict temperature?
Final Project
Intro to Data Science (DS 210)
In this project, I will demonstrate knowledge and understanding of the data life cycle by exploring the following question: Can the outside temperature be estimated by the frequency of cricket chirps?
The dataset is made up of 59 rows of values counting the cricket chirps per 15 second interval and the temperature in Farenheight. For the Chirps15s feature as imported from the original source, the average chirps were 34.5, with a minimum of 12.5, and a maximum number of 361. For the TempFarenheight feature, the average temperature was 64.8, with a minimum of 6, and a maximum of 80.5.
Exploratory Data Analysis
Within the Chirps15s feature of the dataframe, there is one obvious outlier of 361 chirps. df.describe() gives us Q1, Q2 and Q3 (22.75, 29.75 and 35 respectively) - bringing us to an IQR of 12.25. Anything above 53.375 or below 10.5 falls as an outlier. 361 could be an input error - 36.1 missing a decimal point. The temperature associated with this number 73 degrees is not repeated in the data, but 73.5 degrees had a count of 35 chirps, and 72.5 degrees had a count of 36.2 and and 37.1 chirps. 36.1 is within that range and will be adjusted accordingly using df["Chirps15s"] = df["Chirps15s"].replace([361.000],36.1). In a larger dataset I would prefer to remove the row, but given the scant 59 rows here, this seems the best option.
There is one missing value in the Chirps15s feature, corresponding to the temperature of 67 degrees. The average number of chirps for that temperature is 29.5, and I will be replacing the missing value with that using df["Chirps15s"] = df["Chirps15s"].fillna(29.5).
Within the TempFarenheight feature the standout outlier is a temperature of 6 degrees, with a corresponding 25 chirps. This seems to be an unreasonable temperature for a cricket, being 26 degrees below freezing. This looks like it could be another input error with a misplaced decimal point, but the other entry for 60 degrees is 21.3 chirps which seems farther off than the previous instance in the Chirps15s feature. The row was removed using df = df[df.TempFarenheight > 32].
There is one missing value within the TempFarenheight feature, with a chirp count of 30. Given there are no other chirp counts of thirty to compare it to, this row will also be removed using df = df.dropna()
Within the newly cleaned data, we can look at both the median and mean for the Chirps15s and TempFarenheight features. The mean was found using df.describe() giving us 28.95 and 65.86 respectively. The median for chirps was 29.5 (using df["Chirps15s"].median()) anf for the temperature was 67.0 (using df["TempFarenheight"].median()).
The cleaned dataset was then exported to be further examined using KNIME - df.to_csv("tidy_crickets.csv") creating the following scatter plot:
Refining the Question
At this point in the Data Science Life Cycle, it is time to step back and look at the results so far - asking if there is any reason to change the original question. Looking at the scatter point, there is no reason to change the original question - there appears to be a strong correlation between the temperature and the number of chirps per 15 second interval. If there had been a weak relationship, or none at all, we may have needed to change the question and reexamine the data for different possible findings, but it is with confidence that I can proceed to the next step of the life cycle, model building.
Model Building
I used KNIME to build a linear regression model to predict the temperature based on the number of cricket chirps heard in 15 seconds. Linear regression is a type of supervised learning that uses one value to predict another value. This model is a simple regression, as there are only two variables to be looked at.
The Correlation coefficient from the Linear Correlation module in KNIME (shown above) gave a value of .98. The closer that number is to one, the stronger the relationship between the two variables being considered. As demonstrated previously with the scatter plot, there is a strong correlation between the number of cricket chirps per 15 second interval and the temperature.
A regression line is next created from the data above, and can be used to predict the value of one variable from another. This is a line of best fit - that best fits into the data fed into it, i.e it is at a minimum distance from all points on the plot of that data. The formula to calculate the regression line is:
y = mx + b
In this formula, y represents the value being predicted, what we are solving for. M is the slope of the line we will use for the prediction. X will be the value we feed into the formula to make a prediction on (in this case, a number of chirps in a 15 second period of time), and b is the intercept - the point at which the line meets the x axis of the chart.
The intercept (as output by the Linear Regression Learner function, shown above) is 40.0105.
As a test of the model, I fed 40 chirps into the equation, coming to a result of 75.7 degrees.
Interpretation/Summary
Within the scope of this project, I have followed the Data Science Life Cycle through to its conclusion. I imported data using python and pandas, cleaned the data with feature engineering and removed missing values, plotted the cleaned data, and built a predictive model with linear regression using KNIME. I can safely conclude that the temperature can be predicted based on the number of cricket chirps.