Snack's 1967
HomeBlogAbout Me

Easy Data Transform 1 1 0 3



Download Easy Data Transform 1.9.0 or any other file from Applications category. HTTP download also available at fast speeds. Although the best value is -1.54 (estimate in Figure 3), the process works better if this value is rounded to a whole number; this will make it easier to transform the data back and forth. The best whole-number values here are -1 and -2 (the inverse function of Y and Y 2, respectively). This data has three types of cultivar classes: 'class0', 'class1', and 'class2'. Here, you can build a model to classify the type of cultivar. The dataset is available in the scikit-learn library, or you can also download it from the UCI Machine Learning Library. Easy Data Transform 1.3.0 BaDshaH 17 Mar 2020 07:05 SOFTWARE. English File size: 22.9 MB Easy Data Transform is suitable for a wide range of data. Easy Data Transform 1.3.0 macOS 25 mbTransform your Excel and CSV files without programming with Easy Data Transform.Features:Clean, re-format, merge, dedupe, filter and analyze table and list.

Introduction

When dealing with continuous numeric data, it is often helpful to bin the data intomultiple buckets for further analysis. There are several different terms for binningincluding bucketing, discrete binning, discretization or quantization. Pandas supportsthese approaches using the cut and qcut functions.This article will briefly describe why you may want to bin your data and how to use the pandasfunctions to convert continuous data to a set of discrete buckets. Like many pandas functions,cut and qcut may seem simple but there is a lot of capability packed intothose functions. Even for more experience users, I think you will learn a couple of tricksthat will be useful for your own analysis.

Binning

One of the most common instances of binning is done behind the scenes for youwhen creating a histogram. The histogram below of customer sales data, shows how a continuousset of sales numbers can be divided into discrete bins (for example: $60,000 - $70,000) andthen used to group and count account instances.

Here is the code that show how we summarize 2018 Sales information for a group of customers.This representation illustrates the number of customers that have sales within certain ranges.Sample code is included in this notebook if you would like to follow along.

There are many other scenarios where you may wantto define your own bins. In the example above, there are 8 bins with data. What if we wanted to divideour customers into 3, 4 or 5 groupings? That’s where pandas qcut and cut come intoplay. These functions sound similar and perform similar binning functions but have differences thatmight be confusing to new users. They also have several options that can make them very usefulfor day to day analysis. The rest of the article will show what their differences are andhow to use them.

qcut

The pandas documentation describes qcut as a “Quantile-based discretization function.”This basically means that qcut tries to divide up the underlying data into equal sized bins. The functiondefines the bins using percentiles based on the distribution of the data, not the actual numeric edges of the bins.

Imyfone lockwiper registration code. If you have used the pandas describe function, you have already seen an example of the underlyingconcepts represented by qcut:

How to change picture on macbook air. Keep in mind the values for the 25%, 50% and 75% percentiles as we look at using qcut directly.

The simplest use of qcut is to define the number of quantiles and let pandas figure outhow to divide up the data. In the example below, we tell pandas to create 4 equal sized groupingsof the data.

The result is a categorical series representing the sales bins. Because we asked for quantiles with q=4the bins match the percentiles from the describe function.

Adobe photoshop lightroom classic cc 2017 7 0. A common use case is to store the bin results back in the original dataframe for future analysis.For this example, we will create 4 bins (aka quartiles) and 10 bins (aka deciles) and store the resultsback in the original dataframe:

account numbernameext pricequantile_ex_1quantile_ex_2
0141962Herman LLC63626.03(55733.049000000006, 89137.708](55732.0, 76471.0]
1146832Kiehn-Spinka99608.77(89137.708, 100271.535](95908.0, 100272.0]
2163416Purdy-Kunde77898.21(55733.049000000006, 89137.708](76471.0, 87168.0]
3218895Kulas Inc137351.96(110132.552, 184793.7](124778.0, 184794.0]
4239344Stokes LLC91535.92(89137.708, 100271.535](90686.0, 95908.0]

You can see how the bins are very different between quantile_ex_1 and quantile_ex_2.I also introduced the use of precision to define how many decimal points to usefor calculating the bin precision.

The other interesting view is to see how the values are distributed across the bins using value_counts:

Now, for the second column:

This illustrates a key concept. In each case, there are an equal number of observations in each bin.Pandas does the math behind the scenes to figure out how wide to make each bin. For instance, in quantile_ex_1the range of the first bin is 74,661.15 while the second bin is only 9,861.02 (110132 - 100271).

One of the challenges with this approach is that the bin labels are not very easy to explainto an end user. For instance, if we wanted to divide our customers into 5 groups (aka quintiles)like an airline frequent flier approach, we can explicitly label the bins to make them easier to interpret.

account numbernameext pricequantile_ex_1quantile_ex_2quantile_ex_3
0141962Herman LLC63626.03(55733.049000000006, 89137.708](55732.0, 76471.0]Bronze
1146832Kiehn-Spinka99608.77(89137.708, 100271.535](95908.0, 100272.0]Gold
2163416Purdy-Kunde77898.21(55733.049000000006, 89137.708](76471.0, 87168.0]Bronze
3218895Kulas Inc137351.96(110132.552, 184793.7](124778.0, 184794.0]Diamond
4239344Stokes LLC91535.92(89137.708, 100271.535](90686.0, 95908.0]Silver

In the example above, I did somethings a little differently.First, I explicitly defined the range of quantiles to use: q=[0, .2, .4, .6, .8, 1].I also defined the labels labels=bin_labels_5 to use when representing the bins.

Let’s check the distribution:

Autodesk flame 2020 14. As expected, we now have an equal distribution of customers across the 5 bins and the resultsare displayed in an easy to understand manner. https://hereefil627.weebly.com/cleaner-mac-os.html.

One important item to keep in mind when using qcut is that the quantiles must all be less than 1.Here are some examples of distributions. In most cases it’s simpler to just define q as an integer: Sig sauer serial number decoder.

  • terciles: q=[0, 1/3, 2/3, 1] or q=3
  • quintiles: q=[0, .2, .4, .6, .8, 1] or q=5
  • sextiles: q=[0, 1/6, 1/3, .5, 2/3, 5/6, 1] or q=6

One question you might have is, how do I know what ranges are used to identify the differentbins? You can use retbins=True to return the bin labels. Here’s a handysnippet of code to build a quick reference table:

ThresholdTier
055733.050Bronze
187167.958Silver
295908.156Gold
3103606.970Platinum
4112290.054Diamond

Here is another trick that I learned while doing this article. If you try df.describeon categorical values, you get different summary results:

quantile_ex_1quantile_ex_2quantile_ex_3
count202020
unique4105
top(110132.552, 184793.7](124778.0, 184794.0]Diamond
freq524

I think this is useful and also a good summary of how qcut works.

While we are discussing describe we can using the percentilesargument to define our percentiles using the same format we used for qcut:

account numberext price
count20.00000020.000000
mean476998.750000101711.287500
std231499.20897027037.449673
min141962.00000055733.050000
0%141962.00000055733.050000
33.3%332759.33333391241.493333
50%476006.500000100271.535000
66.7%662511.000000104178.580000
100%786968.000000184793.700000
max786968.000000184793.700000

There is one minor note about this functionality. Passing 0 or 1, just meansthat the 0% will be the same as the min and 100% will be same as the max. I alsolearned that the 50th percentile will always be included, regardless of the values passed.

Before we move on to describing cut, there is one more potential way thatwe can label our bins. Instead of the bin ranges or custom labels, we can returnintegers by passing labels=False

account numbernameext pricequantile_ex_1quantile_ex_2quantile_ex_3quantile_ex_4
0141962Herman LLC63626.03(55733.049000000006, 89137.708](55732.0, 76471.0]Bronze0
1146832Kiehn-Spinka99608.77(89137.708, 100271.535](95908.0, 100272.0]Gold2
2163416Purdy-Kunde77898.21(55733.049000000006, 89137.708](76471.0, 87168.0]Bronze0
3218895Kulas Inc137351.96(110132.552, 184793.7](124778.0, 184794.0]Diamond4
4239344Stokes LLC91535.92(89137.708, 100271.535](90686.0, 95908.0]Silver1

Personally, I think using bin_labels is the most useful scenario but there could be caseswhere the integer response might be helpful so I wanted to explicitly point it out.

cut

Now that we have discussed how to use qcut, we can show how cut is different.Many of the concepts we discussed above apply but there are a couple of differences withthe usage of cut.

The major distinction is that qcut will calculate the size of eachbin in order to make sure the distribution of data in the bins is equal. In other words,all bins will have (roughly) the same number of observations but the bin range will vary.

On the other hand, cut is used to specifically define the bin edges. There is no guarantee aboutthe distribution of items in each bin. In fact, you can define bins in such a way that noitems are included in a bin or nearly all items are in a single bin.

In real world examples, bins may be defined by business rules. For a frequent flier program,25,000 miles is the silver level and that does not vary based on year to year variation of the data.If we want to define the bin edges (25,000 - 50,000, etc) we would use cut. We can alsouse cut to define bins that are of constant size and let pandas figure out how to define thosebin edges.

Some examples should make this distinction clear.

For the sake of simplicity, I am removing the previous columns to keep the examples short:

For the first example, we can cut the data into 4 equal bin sizes. Pandas will perform themath behind the scenes to determine how to divide the data set into these 4 groups:

Let’s look at the distribution:

The first thing you’ll notice is that the bin ranges are all about 32,265 but thatthe distribution of bin elements is not equal. The bins have a distribution of 12, 5, 2 and 1item(s) in each bin. In a nutshell, that is the essential difference between cut and qcut.

If you want equal distribution of the items in your bins, use qcut. If you want to define yourown numeric bin ranges, then use cut.

Before going any further, I wanted to give a quick refresher on interval notation. In the examplesabove, there have been liberal use of ()’s and []’s to denote how the bin edges are defined.For those of you (like me) that might need a refresher on interval notation, I found this simplesite very easy to understand.

To bring this home to our example, here is a diagram based off the example above:

When using cut, you may be defining the exact edges of your bins so it is important to understandif the edges include the values or not. Depending on the data set and specific use case, this may or maynot be a big issue. It can certainly be a subtle issue you do need to consider.

To bring it into perspective, when you present the results of your analysis to others,you will need to be clear whether an account with 70,000 in sales is a silver or gold customer.

Here is an example where we want to specifically define the boundaries of our 4 bins by definingthe bins parameter.

account numbernameext pricecut_ex1
0141962Herman LLC63626.03silver
1146832Kiehn-Spinka99608.77gold
2163416Purdy-Kunde77898.21gold
3218895Kulas Inc137351.96diamond
4239344Stokes LLC91535.92gold

One of the challenges with defining the bin ranges with cut is that it can be cumbersome tocreate the list of all the bin ranges. There are a couple of shortcuts we can use to compactlycreate the ranges we need.

First, we can use numpy.linspace to create an equally spaced range:

Numpy’s linspace is a simple function that provides an array of evenly spaced numbers overa user defined range. In this example, we want 9 evenly spaced cut points between 0 and 200,000.Astute readers may notice that we have 9 numbers but only 8 categories. If you map out theactual categories, it should make sense why we ended up with 8 categories between 0 and 200,000.In all instances, there is one less category than the number of cut points.

The other option is to use numpy.arange which offers similar functionality.I found this article a helpful guide in understanding both functions. I recommend trying bothapproaches and seeing which one works best for your needs.

There is one additional option for defining your bins and that is using pandas interval_range. Downie extension safari. I had to look at the pandas documentation to figure out this one. It is a bit esoteric but Ithink it is good to include it.

The interval_range offers a lot of flexibility. For instance, it can be used on date rangesas well numerical values. Here is a numeric example:

1.0.3

There is a downside to using interval_range. You can not define custom labels.

account numbernameext pricecut_ex1cut_ex2
0141962Herman LLC63626.03gold(60000, 70000]
1146832Kiehn-Spinka99608.77silver(90000, 100000]
2163416Purdy-Kunde77898.21silver(70000, 80000]
3218895Kulas Inc137351.96diamond(130000, 140000]
4239344Stokes LLC91535.92silver(90000, 100000]

As shown above, the labels parameter is ignored when using the interval_range.

In my experience, I use a custom list of bin ranges or linspace if I have a large numberof bins.

One of the differences between cut and qcut is that you can alsouse the include_lowest paramete to define whether or not the first bin should include all of the lowest values.Finally, passing right=False will alter the bins to exclude the right most item. Becausecut allows much more specificity of the bins, these parameters can be useful to make sure theintervals are defined in the manner you expect.

The rest of the cut functionality is similar to qcut. We can return the bins using retbins=Trueor adjust the precision using the precision argument.

Log X 0.1 1 3

One final trick I want to cover is that value_counts includes a shortcut for binning and countingthe data. It is somewhat analogous to the way describe can be a shortcut for qcut.

If we want to bin a value into 4 bins and count the number of occurences:

By defeault value_counts will sort with the highest value first. By passing sort=Falsethe bins will be sorted by numeric order which can be a helpful view.

Ptrmodellib 1.0.3

Summary

The concept of breaking continuous values into discrete bins is relatively straightforwardto understand and is a useful concept in real world analysis. Fortunately, pandas providesthe cut and qcut functions to make this as simple or complex as you need it to be.I hope this article proves useful in understanding these pandas functions. Please feel free tocomment below if you have any questions.

Easy Data Transform 1 1 0 3 Dll

Updates

  • 29-October-2019: Modified to include value_counts shortcut for binning and counting the data.
  • 17-December-2019: Published article on natural breaks which leverages these concepts and provides another useful method for binning numbers.

Comments





Easy Data Transform 1 1 0 3
Back to posts
This post has no comments - be the first one!

UNDER MAINTENANCE