Analyzing Acoustic Signals To Predict Earthquake 🫨

Freaking Thought

8 min readJul 14, 2024

In this article, we will drop out our

focus on analyzing signals to predict the next earthquake, actually we discussed more things along with that too, but during the practice of analysis, we draw out harmful patterns of earth plates that occur in earthquakes, now firstly we will be focusing on introduction of earthquake prediction competition and along with that we also understand the different formats of signal data and once we land out from that we finally discussing the exploration our competition data to bring more hidden insights in front of our eye.

Table of Content

Introducing LANL earthquake prediction competition
Formats for signal data
Exploring our competition data

Introducing LANL Earthquake Prediction Competition

The LANL Earthquake Prediction competition centers on utilizing seismic
signals to determine the precise timing of a laboratory induced earthquakes.
Currently, predicting natural earthquakes remains beyond the reach of our

scientific knowledge and technological capabilities. The ideal scenario for
scientists predict the timing, location and magnitude of such an event.

Simulated earthquakes, however, created in highly controlled artificial
environments, mimic real-world seismic activities. These simulations enable
attempts to forecast lab-generated quakes using the same types of signals

observed in natural settings. In this competition, participants use an acoustic data input signal to estimate the time until the next artificial earthquake
occurs. The challenge is to predict the timing of
the earthquake, addressing one of the three critical unknowns in earthquake
forecasting, when it will happen, where it will occur, and how powerful it
will be.

The training data is a single file with two columns, acoustic signal amplitude
and time to fail. The test data consists of multiple files 2,526 in total
with acoustic signal amplitude segments for which we will have to predict
the time to fail. A sample submission file has one column with the

segment ID, seg_id and the value to predict, time_to_failure.

The competitors are tasked with training their models with the acoustic
signal and time-to-failure data in the training file and predicting the time-to-failure for each segment from each file in the test folder. This competition

data is in a very convenient format, that is, comma-separated values (CSV)
format, but this is not a requirement. Other competitions or datasets on
Kaggle with signal data uses different, less common formats. Because this
section is about analyzing signal data, this is the right place to review this

format. Let’s first look into some of these formats.

Formats For Signal Data

Several competitions on Kaggle used sound data as an addition to regular
tabular features. There were three competitions organized by Cornell Lab of
Ornithology’s BirdCLEF (LifeCLEF Bird Recognition Challenge) in 2021,

2022 and 2023 for predicting a bird species from samples of bird songs. The format used

in these competitions was .ogg. The .ogg format is used to store audio data
with less bandwidth. It is considered technically superior to the .mp3 format. We can read these types of file formats using the librosa library. The code can be used to load an .ogg file and
display the sound wave.

The library librosa, when loading the audio sound, will return values as a
time series with floating-point values. It isn’t just the .ogg

format that is supported, it will work with any code supported by soundfile
or Audioread. The default sampling rate is 22050 but this can be also set
upon load, using the parameter sr. Other parameters that can be used when

loading an audio wave are the offset and the duration both given in seconds
together, they allow you to select the time interval of the sound wave you
will load.

In an earlier version of the BirdCLEF competition, Cornell Birdcall
Identification, audio sounds in the dataset were given in
.mp3 format. For this format, we can use librosa to load, transform or
visualize the sound waves. Waveform Audio File format (or WAV) another frequently used format, can also be loaded using librosa.

For .wav format, we can alternatively use the scipy.io module wavfile to
load data. The following code will load and display a file in .wav format. In

this case, the amplitude is not scaled down to a -1:1 interval, the maximum
value is 32K.

Signal, not specifically audio signal, data can also be stored in .npy or .npz
format, which are both numpy formats to store array data. These formats can be loaded using numpy functions, as you can see in the code
snippets. For npy format, this will load a multi column array.

For .npz format, the code will load a similar structure, previously
compressed one file only.

For data stored in .rds format, an R-specific format for saving data, we can
load the data using this code.

To store multi-dimensional array data, NetCDF-4 format Network

Common Data Form, version 4, is used. we have an
example of such multi dimensional signal data, from Earthdata NASA
satellite measurements, from the dataset EarthData MERRA2 CO. The

following code snippet reads a subset of measurements for CO, focusing on
the COCL dimension Column Burden kg m-2 and includes values for
latitude, longitude and time.

For more details, you can consult. For now, let’s get back to our

competition data, which is in CSV format, although it represents an audio
signal (sound waves), as we already clarified.

Exploring Our Competition Data

The LANL Earthquake Prediction dataset consists:

A train.csv file, with two columns only.

acoustic_data — This is the amplitude of the acoustic signal.
time_to_failure — This is the time to failure corresponding to the
current data segment.

2. A test folder with 2,624 files with small segments of acoustic data.

3. A sample_submission.csv file for each test file, those competing will
need to give an estimate for time to failure.

The training data (9.56 GB) contains 692 million rows. The actual time
constant for the samples in the training data results from the continuous
variation of time_to_failure values. The acoustic data is integer values

from -5,515 to 5,444, with an average of 4.52 and a standard deviation of
10.7 values oscillating around 0. The time_to_failure values are real
numbers, ranging from 0 to 16, with a mean of 5.68 and a standard deviation

of 3.67. To reduce the memory footprint for the training data, we read the
data with a reduced dimension for both acoustic data and time_to_failure.

Let’s check the first values in the training data. We will not use all the
time_to_failure data, only values associated with the end-of-time interval
for which we will aggregate interval acoustic data therefore, rounding in
order to reduce the size of the time to failure from double to float is not

important here.

Figure 1.1 - First rows of data in the training data

Let’s visualize, on the same graph, the acoustic signal values and the time to
failure. We will use a subsampling rate of 1/100 sample each 100th row to
represent the full training data (see Figure 1.2). We will use the
code to represent these graphs.

Figure 1.2 - Acoustic signal data and time to failure data over an entire training set, subsampled at

1/100

Let’s zoom into the first part of the time interval. We will show the first 1%
of the data no sub sampling. In Figure 1.3, we are showing, on the same
graph, the acoustic signal and time to failure for the first 6.29 million rows

of data. We can observe that before the failure but not very close in time,
there is a large oscillation, with both negative and positive peaks. This
oscillation is also preceded by a few smaller ones, at irregular time intervals.

Figure 1.3 - Acoustic signal data and time to failure data for the first 1% of the data

Let’s also look at the next 1% of the training data without sub sampling. In
Figure 1.4, we show this time series for acoustic signal values and time to
failure. There is no failure during this time interval. We observe many
irregular small oscillations, with both negative and positive peaks.

Figure 1.4 - Acoustic signal data and time to failure for the second 1% of the data in the training set

Let’s also look at the last few percentages of the data last 5% of time in the
training set. In Figure 1.5, we observe the same pattern of several larger

oscillations superposed on smaller irregular oscillations and with a major
oscillation just before the failure.

Figure 1.5 - Acoustic signal data and time to failure for the last 5% of the data in the training set

Let’s now also look at a few examples of variations of the acoustic signal in
the test data samples. There are 2,624 data segment files in the test data. We

will select a few of them to visualize. We will use a modified visualization
function since in the test data, we only have the acoustic signals.

In Figure 1.6, we are showing the acoustic signal graph for the segment
seg_00030f.

Figure 1.6 - Acoustic signal data for test segment seg_00030f

In the next figure, we are showing the acoustic signal graph for segment
seg_0012b5.

Figure 1.7 - Acoustic signal data for test segment seg_0012b5

In the notebooks associated with this section, you can see more examples of
such test acoustic signals. The test segments show quite a large variety of
signal profiles, depicting the same sequence of small oscillations with
intercalated peaks with variable amplitude, similar to what we can see in the
training data sub sampled earlier.

Conclusion

Finally, we will crack down the concept of acoustic signals analysis and we also understand how signals patterns tell us about the harms of earthquake and even destruction patterns, actually during our analysis, we found that most of the spikes are alarming that basically leads to earthquake, when we see natural changes in our earth plates that disturb the p aligned of plates, that one simply leads to earthquake happenings.