Skip to main content

Fit vs. Transform

 

Fit vs. Transform in SciKit libraries for Machine Learning

We have seen methods such as fit(), transform(), and fit_transform() in a lot of SciKit’s libraries. And almost all tutorials, including the ones I’ve written, only tell you to just use one of these methods. The obvious question that arises here is, what do those methods mean? What do you mean by fit something and transform something? The transform() method makes some sense, it just transforms the data, but what about fit()? In this post, we’ll try to understand the difference between the two.

To better understand the meaning of these methods, we’ll take the Imputer class as an example, because the Imputer class has these methods. But before we get started, keep in mind that fitting something like an imputer is different from fitting a whole model.

You use an Imputer to handle the missing value in dataset. Imputer gives you easy methods to replace NaNs and blanks with something like the mean of the column or even median. But before it can replace these values, it has to calculate the value that will be used to replace blanks. If you tell the Imputer that you want the mean of all the values in the column to be used to replace all the NaNs in that column, the Imputer has to calculate the mean first. This step of calculating that value is called the fit() method.

Next, the transform() method will just replace the NaNs in the column with the newly calculated value, and return the new dataset. That’s pretty simple. The fit_transform() method will do both the things internally and makes it easy for us by just exposing one single method. But there are instances where you want to call only the fit() method and only the transform() method.

When you are training a model, you will use the training dataset. On this dataset, you’ll use the Imputer, calculate the value, and replace the blanks. But when you fit this trained model on the test dataset, you don’t calculate the mean or median again. You’ll use the same value that you used on your training dataset. For this, you’ll use the fit() method on your training dataset to only calculate the value and keep it internally in the Imputer. Then, you’ll call the transform() method on the test dataset with the same Inputer object. This way, the value calculate for the training set, which was saved internally in the object, will be used on the test dataset as well.

To put it simply, you can use the fit_transform() method on the training set, as you’ll need to both fit and transform the data, and you can use the fit() method on the training dataset to get the value, and later transform() test data with it.






Comments

Popular posts from this blog

Batch and Online Learning

  It is the criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data. Batch learning In batch learning , the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning . If you want a batch learning system to know about new data (such as a new type of spam), you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one. Fortunately, the whole process of training, evaluating, and launching a Machine Learning system can be automated fairly easily (as shown in Figure 1-3 ), so even a batch

What is text.similar() & text.common_contexts() of nltk

Let's first define our input text, I will just Copy/Paste the first paragraph of  Game of Thrones Wikipedia page : input_text = "Game of Thrones is an American fantasy drama television series \ created by David Benioff and D. B. Weiss for HBO. It is an adaptation of A Song \ of Ice and Fire, George R. R. Martin's series of fantasy novels, the first of \ which is A Game of Thrones. The show was filmed in Belfast and elsewhere in the \ United Kingdom, Canada, Croatia, Iceland, Malta, Morocco, Spain, and the \ United States.[1] The series premiered on HBO in the United States on April \ 17, 2011, and concluded on May 19, 2019, with 73 episodes broadcast over \ eight seasons. Set on the fictional continents of Westeros and Essos, Game of \ Thrones has several plots and a large ensemble cast, and follows several story \ arcs. One arc is about the Iron Throne of the Seven Kingdoms, and follows a web \ of alliances and conflicts among the noble dynasties either vying to claim the