Automatically Updating Dataset In Kaggle

Updating Dataset

A.I Hub
6 min readJun 18, 2024

In this article, we will learning the concept of how to update dataset using kaggle and obviously whenever we make any project for company or even for our personal use upgraded dataset is most necessary element for enhancing flexibility and efficiency of our machine learning and data science project.

Image by Vecteezy

Table of Content

  • Using a notebook to automatically update a dataset
  • Using the kaggle API to create, update, download and monitor your notebooks

Using a Notebook to automatically update a Dataset

Image by Freepik

You can automatize the generation of a Dataset using Kaggle Notebooks by
combining two features, a scheduled re-run of notebooks and an update of a

Dataset upon a Notebook run. First, create the Notebook that will collect the data. It can be, for example, a
Notebook that crawls pages of a certain site to retrieve RSS News feeds or
connect to the Twitter API as in the previous example to download tweets.

Set as the Notebook output the collected data. After the notebook runs for the first time, initialize a Dataset with the output
of the notebook by selecting Output, Create Dataset and set the option for
the Dataset to be updated every time the notebook is running. Then, edit the notebook again and schedule it to run with the frequency that
you want your data to be refreshed, as you can see in the following
screenshot. Once you set it like that, you will have the notebook running automatically and because the Dataset has the setting to be updated when
running the notebook, the update of the Dataset will happen automatically

going forward.

Figure 1.1 - Scheduling a Notebook to run daily, starting from August 7, 2023

The mechanism described here allows you to perform the entire

automatization process, using only Kaggle tools available from the user
interface. For more complex processes, you can always use the Kaggle API
to define and automatically perform your tasks. In the next subsection, we

will describe the basic functionality available with the Kaggle API, with a
focus on manipulating notebooks.

Using the Kaggle API to Create,

Update, Download and Monitor

your Notebooks

Image by iStock

The Kaggle API is a powerful tool that extends the functionality available in
the Kaggle user interface. You can use it for various tasks, define, update,

and download datasets, submit to competitions, define new notebooks, push
or pull versions of notebooks or verify a run status.

There are just two simple steps for you to start using the Kaggle API.

  1. First, you will need to create an authentication token. Navigate to your
    account and from the right-side icon, select the menu item Account.
    Then go to the API section. Here, click on the Create new API token
    button to download your authentication token it is a file named

    kaggle.json. If you will be using the Kaggle API from a Windows

    machine, its location is C:\Users\<your_name>\.kaggle\kaggle.json.
    On a Mac or Linux machine, the path to the file should be
    ~/.kaggle/kaggle.json.
  2. Next, you will have to install the Kaggle API Python module. Run the
    following in your selected Python or conda environment !pip install kaggle.

With these two steps, you are ready to start using the Kaggle API.

The API also provides multiple options to list notebooks in your account,
check notebook status, download a copy, create the first version of a
notebook, run it and more. Let’s look at each of these options.

  • To list all notebooks based on a certain name pattern, run the following
    command. Kaggle kernels list -s <name-pattern>

The command will return a table with the {username}/{kernel-slug},

which matches the name pattern, the last runtime, the number of votes,
the notebook title and the author-readable name.

  • To verify the status of a certain notebook in your environment, run the
    following command. kaggle kernels status {username}/{kernel-slug}.

Here, {username}/{kernel-slug} is not the entire path to the notebook
on Kaggle but the part of the path that will follow the platform path https://www.kaggle.com.

  • The preceding command will return the kernel status. For example, if
    the kernel execution was complete, it will return. {username}/{kernel-slug} has status "complete"
  • You can download a notebook by running the following command. kaggle kernels pull {username}/{kernel-slug} /path/to/download

In this case, a Jupyter Notebook with the name {kernel-slug}.ipynb

will be downloaded in the folder specified by /path/to/download.

  • To create the first version of a notebook and run it, first define a Kaggle
    metadata file with the command. kaggle kernels init -p /path/to/kernel

Your generated Kaggle metadata file will look like this.

For the purpose of this demonstration, I edited the metadata file to
generate a notebook called Test Kaggle API, which uses Python. For

your convenience, I replaced my own username with {username}. You
need to take care to correlate the {kernel-slug} with the real title,
since normally the {kernel-slug} is generated as the lowercase
version, without special characters and replacing spaces with dashes. Here is the result

  • After you edit the metadata file, you can initiate the notebook with the
    following command. Kaggle kernels push -p /path/to/kernel
  • If you also created the prototype of your notebook in the

    /path/to/kernel folder and it is named test_kaggle_api.ipynb, you
    will receive the following answer to your command. Kernel version 1 successfully pushed. Please check progress.
  • You can also use the API to download the output of an existing
    notebook. For this, use the following code. Kaggle kernels output {username}/{kernel-slug}

This will download a file called {kernel-slug}.log in the current folder.
Alternatively, you can specify the path to the following destination. Kaggle kernels output {username}/{kernel-slug} – p /path/to/dest.

The file contains the execution logs of the kernel’s last run.

We have learned how to create an authentication token and install the Kaggle
API. Then, we saw how to use the Kaggle API to create a notebook, update
it and download it.

Conclusion

In this section, we learned what Kaggle Notebooks are, what types we can
use and with what programming languages. We also learned how to create,
run and update notebooks. We then visited some of the basic features for

using notebooks, which will allow you to start using notebooks in an

effective way, to ingest and analyze data from datasets or competitions, to
start training models and to prepare submissions for competitions.

Additionally, we also reviewed some of the advanced features and even
introduced the use of the Kaggle API to further extend your usage of
notebooks, allowing you to build external data and ML pipelines that
integrate with your Kaggle environment.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet