8 Tips For Data Science Competitions

  • Benoite Yver
  • August 29, 2019
Data Science Competition

Getting your hands dirty working with real data and predictive modeling problems is one of the best ways to boost your data science skills, but finding quality data sets and interesting learning tasks on your own isn’t easy. Data science competition platforms like Kaggle, DrivenData, and CrowdAnalytix give you the chance to work on real-world machine learning problems in a structured environment and offer cash prizes for the best submissions. Whether you’re new to data competitions or you’re aiming to make a top model, consider these tips to help optimize your time and your score.

1. Read the rules carefully

Data science competition platforms typically host machine learning problems sponsored by other companies and organizations whose needs vary, so the rules and submission process can change from one competition to the next. It is essential to read the instructions and guidelines for each event in detail before you begin so that you get a clear understanding of the prediction task and avoid violating the rules. It is not uncommon for competitions to prohibit private code sharing and the use of outside data. If you break the rules, you may not be eligible to win a prize even if you make a top submission.

2. Do due diligence before downloading the data

The nature of the prediction task and the size and format of the data will vary for each competition, so it is important to get a sense of how difficult the problem is and whether you have the necessary skills and computational resources to tackle it before downloading the data. The competition should provide information about the data, such as its general structure, how large it is, and whether it is split up into multiple files.

Pay special attention to whether the training and test data sets are small enough for you to load into your computer’s memory. If the data fits in memory, you probably have adequate resources to tackle the problem; if not, it will require more work to load and process the data in chunks. Competitions involving image data usually require deep learning with neural networks to get a good score, which can be very computationally expensive, so beware of working on such competitions without access to a powerful graphics card.

3. Check the forums

Communicating with other competitors is vital to troubleshoot problems and learn about different ways of tackling machine learning tasks. Options for communication will depend on the platform you are using, but many popular sites like Kaggle and DataDriven have discussion forums for each competition that allows users to ask questions and post information about common problems or approaches. Users sometimes share key insights that can help get you on the path to making a good baseline model quickly or avoid major pitfalls lurking in the data. Regularly checking the forums can also help keep you informed of any significant changes in the competition.

4. Start with a simple model

Winning submissions in machine learning competitions are usually more complicated than a simple regression, but that doesn’t mean you should throw the data into a neural network or boosted decision tree right away. Your first goal after loading, cleaning and doing some initial exploration of the data should be to create a simple model and submit it to get a baseline score. Linear regression and logistic regression are good first models for regression and classification problems respectively. Creating a simple model will help you build a pipeline that you can adapt to make more complicated solutions and submitting the predictions it creates will let you confirm that you are making predictions in the proper format while providing a baseline score that you can use to assess the quality of other models.

5. Focus on the features

After building your first pipeline and submitting a simple model, it can be tempting to start building complex models right away. It’s okay to try a couple of different models to start, but don’t fixate on choosing the right model or model parameters early in the competition. Modeling is an important part of machine learning, but feature engineering, i.e., drawing insights from the data and creating new data columns of variables to use as predictors in your models, is usually the most important part of getting a good score. Sometimes including features as simple as the product, sum or quotient of two other variables in the data set can yield significant improvements. Exploring the data and deriving new features from it presents an opportunity to leverage creativity and domain knowledge to get an edge over the competition.

6. Always validate

Competitions usually limit the number of submissions you can make per day and, depending on the size of the data, uploading and checking a solution can be a lengthy process. Therefore, it is important to be able to assess the quality of a model without submitting it. To evaluate model quality, you should always use some form of validation. Validation describes splitting the training data into parts, using one to train the model and the other, known as the validation set, to check its performance.

The simplest form of validation is holdout validation where you split the training dataset into two parts, one for training and one for validation. It is typical to use about 80 percent of the data in the training set and 20 percent in the validation set. Another option is to use cross-validation, where you build several different models with several different splits of the training data and then aggregate their performance. Cross-validation tends to provide more reliable results but also takes longer as it involves building several models. Either way, using some form of validation will help you avoid the pitfall of using the training data itself to assess your models, which can lead to extreme overfitting and poor submission scores.

7. Try top-performing models

Although cleverly constructed features are often the key to getting a top submission, your final score could depend on you choosing the right model. There are a few top-performing models that consistently find their way into winning solutions that you should consider applying to your data in any competition. For general regression and classification prediction tasks on tabular data, boosted decision tree models including XGBoost, LightGBM, and CatBoost usually are sufficient. It is not a bad idea to try a vanilla random forest or regression model as well. For any task related to image data, some form of convolutional neural network will probably be required to get a top score. For sequential data, recurrent neural network architectures tend to perform well.

8. Ensemble everything

When it comes to predictive modeling, two models are often better than one. One of the biggest obstacles to making accurate predictions is that complex models tend to overfit themselves to the peculiarities in the training data, causing them to fail to generalize well to new data that doesn’t exhibit those same peculiarities. Combining predictions generated from two or more models together, a process known as ensembling, can improve your prediction quality. Ensembling can be as simple as taking the average of the predictions of a few different models. Ensembling allows your solution to take advantage of the strengths of several different models at the same time that may detect different patterns in the data, while also serving to lessen the negative impact of overfitting made by any single model.

It is not uncommon for top solutions to consist of ensembles of several or even dozens of models trained with different subsets of features and with various parameters. Sometimes simple models that don’t achieve excellent performance on their own can yield improvements in an ensemble, so trying many different combinations of models may improve your score.

No-code & Low-code Data Transformation

Effective data science starts and ends with clean, organized, and processed data.  How you transform your data is critical to this, in terms of both process and how.  Data science datasets are assembled by analyzing complex and diverse datasets that need to be cleansed, blended, and shaped into final form.  Why write Python code to do all this?

Datameer SaaS Data Transformation is the industry’s first collaborative, multi-persona data transformation platform integrated into Snowflake.  The multi-persona UI, with no-code, low-code, and code (SQL) tools, brings together your entire team – data engineers, analytics engineers, analysts, and data scientists – on a single platform to collaboratively transform and model data.  Catalog-like data documentation and knowledge sharing facilitate trust in the data and crowd-sourced data governance.  Direct integration into Snowflake keeps data secure and lowers costs by leveraging Snowflake’s scalable compute and storage.

Learn more about our innovative SaaS data transformation solution, Sign up for your free trial today!