Scikitlearn to "learn them all"
Video in TIB AVPortal:
Scikitlearn to "learn them all"
Formal Metadata
Title 
Scikitlearn to "learn them all"

Alternative Title 
Why SCIKITLEARN is so cool

Title of Series  
Part Number 
49

Number of Parts 
120

Author 

License 
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2014

Language 
English

Production Place 
Berlin

Content Metadata
Subject Area  
Abstract 
Valerio Maggio  Scikitlearn to "learn them all" Scikitlearn is a powerful library, providing implementations for many of the most popular machine learning algorithms. This talk will provide an overview of the "batteries" included in Scikitlearn, along with working code examples and internal insights, in order to get the best for our machine learning code.  **Machine Learning** is about *using the right features, to build the right models, to achieve the right tasks* However, to come up with a definition of what actually means **right** for the problem at the hand, it is required to analyse huge amounts of data, and to evaluate the performance of different algorithms on these data. However, deriving a working machine learning solution for a given problem is far from being a *waterfall* process. It is an iterative process where continuous refinements are required for the data to be used (i.e., the *right features*), and the algorithms to apply (i.e., the *right models*). In this scenario, Python has been found very useful for practitioners and researchers: its highlevel nature, in combination with available tools and libraries, allows to rapidly implement working machine learning code without *reinventing the wheel*. **Scikitlearn** is an actively developing Python library, built on top of the solid `numpy` and `scipy` packages. Scikitlearn (`sklearn`) is an *allinone* software solution, providing implementations for several machine learning methods, along with datasets and (performance) evaluation algorithms. These "batteries" included in the library, in combination with a nice and intuitive software API, have made scikitlearn to become one of the most popular Python package to write machine learning code. In this talk, a general overview of scikitlearn will be presented, along with brief explanations of the techniques provided outofthebox by the library. These explanations will be supported by working code examples, and insights on algorithms' implementations aimed at providing hints on how to extend the library code. Moreover, advantages and limitations of the `sklearn` package will be discussed according to other existing machine learning Python libraries (e.g., "Shogun Toolbox", "PyML", "MLPy"). In conclusion, (examples of) applications of scikitlearn to big data and computational intensive tasks will be also presented. The general outline of the talk is reported as follows (the order of the topics may vary): * Intro to Machine Learning * Machine Learning in Python * Intro to ScikitLearn * Overview of ScikitLearn * Comparison with other existing ML Python libraries * Supervised Learning with `sklearn` * Text Classification with SVM and Kernel Methods * Unsupervised Learning with `sklearn` * Partitional and Modelbased Clustering (i.e., kmeans and Mixture Models) * Scaling up Machine Learning * Parallel and Large Scale ML with `sklearn` The talk is intended for an intermediate level audience (i.e., Advanced). It requires basic math skills and a good knowledge of the Python language.

Keywords  EuroPython Conference EP 2014 EuroPython 2014 
00:00
words
machine learning
Development Kit
words
Computer animation
Lecture/Conference
coma
framework
Coloured
machine
Scalable Coherent Interface
00:45
machine learning
Development Kit
margin
tasks
tasks
Scalable Coherent Interface
01:24
clustering
statistics
algorithm
machine learning
data analysis
8th
VM's
objects
graphs
mining
information
classes
Robots
management
unit
algorithm
studies
analysis
theoretical
coma
data analysis
machine
mining
words
Computer animation
Schätzung
testing
02:12
point
statistics
PINs
states
sets
loss
part
pattern
essence
diagrams
model
PINs
predictive
machine learning
CAP
studies
relation
Venn diagrams
analysis
data analysis
machine
mining
words
Computer animation
predictive
pattern
fundamental
03:45
classical
man
algorithm
bottom
Actions
functionality
machine learning
Clusters
sets
scans
sun
rules
supervised learning
processes
real vector
case
different
model
Results
classes
05:19
Actions
words
processes
different
states
part
05:53
choice
man
programming language
choice
Graph
unique
Bit errors
sets
coma
Computing
applications
programs
system call
information
information
Computer animation
case
case
computer scientist
Rolling
Right
07:08
mathematics
lines
projects
manifolds
interpretations
website
Right
Code
Computing
systems
computational
08:01
12th
machine
sets
computational
information
mathematics
rates
different
PEAS
systems
machine learning
algorithm
scale
neural network
interfaces
sample
analysis
The list
Amsterdam Ordnance Datum
workgroup
bits
Code
DoS
unsupervised learning
information
machine
processes
Computer animation
real vector
environment
topology
orders
naturally
genetics
cycle
alpha
libraries
11:44
install
Slides
focus
algorithm
install
scale
interfaces
experts
sets
Menu
scalability
Development Kit
Computer animation
commitment
naturally
band
website
box
model
conduct
DeanZahl
tasks
libraries
13:36
Slides
functionality
install
regression
algorithm
necessities
sets
Code
versions
evaluation
different
crossvalidation
box
selection
encoding
Scalable Coherent Interface
machine learning
necessities
algorithm
focus
regression
cores
evaluation
Social Tagging
Development Kit
functions
orders
interpretations
cycle
metrics
matrix
15:05
algorithm
neural network
Clusters
sets
van
Emulation
structured data
estimates
reduce
model
Cats
CAMS
Scalable Coherent Interface
man
CAP
algorithm
mapping
regression
interfaces
logical consistency
effects
instance
LAN
CANbus
category
Development Kit
Computer animation
estimates
predictive
URN
interfaces
figure
sort
objects
reduce
Cats
libraries
16:55
necessities
man
modes
algorithm
tablets
Transformers
knot
print
transfer
structured data
Development Kit
Computer animation
estimates
PIT
orders
selection
PEAS
sum
Scalable Coherent Interface
17:47
point
presentation
Actions
breadth
NET
Coloured
structured data
different
real vector
sets
reduce
matrix
box
selection
model
classes
Scalable Coherent Interface
ChiQuadratVerteilung
man
algorithm
constraints
regression
building
applications
Development Kit
number
processes
Computer animation
real vector
case
Right
matrix
record
19:55
Virus
implementation
functionality
algorithm
breadth
resampling
fields
number
Iris
real vector
sets
species
source code
ChiQuadratVerteilung
information
width
load
breadth
sample
lines
number
real vector
Iris
state machines
species
objects
cycle
matrix
libraries
21:52
Slides
key
time
inner
neural network
sample
print
shape
sun
variance
number
dual
words
Computer animation
environment
case
matrix
Permian
metrics
descriptions
23:01
algorithm
Computer animation
different
case
states
URN
analog
classes
Databases
inverse
Results
classes
23:48
Virus
algorithm
load
decision
time
sample
fit
3rd
print
lines
Code
scans
Computer animation
case
uniformity
boundaries
WPAN
Right
pattern
species
box
tasks
24:55
clustering
algorithm
interfaces
Clusters
sets
cores
Actions
distances
number
CANbus
chain
Computer animation
ring
case
speech
objects
cycle
26:10
diagonals
corresponds
neural network
machine
number
structured data
loss
different
matrix
model
Cats
classes
conditions
Scalable Coherent Interface
predictive
man
matchings
relation
key
validation
evaluation
Development Kit
macros
Computer animation
predictive
case
life
metrics
record
28:39
Context
sets
argument
computational
independent
structured data
splitting
programme
rates
different
crossvalidation
sets
box
processes
model
errors
Scalable Coherent Interface
predictive
statistics
Development Kit
processes
CPUs
predictive
Key
fitness
Results
reading
Slides
statistics
functionality
mud
machine
analysis
number
Correlation
testing
encoding
sum
addition
default
scale
Code
lines
applications
system call
words
Cut
Computer animation
case
box
libraries
32:36
circuits
token
images
ones
sun
information
image
processes
mazes
classes
Scalable Coherent Interface
modules
Zoom's
interfaces
Code
information
Development Kit
means
processes
Computer animation
real vector
case
URN
Normierte Räume
naturally
sort
libraries
33:54
machine learning
Development Kit
implementation
Lecture/Conference
Integrationen
images
coma
libraries
machine
libraries
Scalable Coherent Interface
34:34
testing
Actions
directions
time
combination
sheaf
sets
argument
part
energy
image
memory
Em's
crossvalidation
cores
matrix
box
model
Partial
predictive
Fourier
regression
fit
completion
Types
means
data management
processes
real vector
organization
Right
Results
online education
point
Slides
12th
component
second
specific
Forum
web
Greedy
interfaces
projects
Code
words
redundancy
visualization
case
lies
00:16
so the talk today is about cited learned or in other
00:19
words why I think it's I could learn so socalled of possible I would like to ask you
00:25
3 questions know what your framework color actually but and if you already know 1 Russian again all right perfect and the 2nd 1 is any you ever use secular OK and the 3rd 1 is
00:46
only give you also attended this great training on cited yesterday OK OK it's just too raised questions again
00:58
so what actually much learning means Our washer learning there are many definitions about margin running a 1 and this uh inasmuch much allowing teachers machines how to carry out task by themselves OK with bird simple definition and it's that simple the complex come with the details of character who is very general definition of just to give you the intuition and behind
01:25
marshalling at a glance motion learning yes about algorithms that are able to analyze to crunch the data and in particular to limit the data obtained from the data of there we basically uh exploits statistical approaches so that's why sort that
01:43
also is bird you each word in this class must learning is
01:48
almost related to data analysis techniques and there are many buzz words about Marcia learning and you may have heard about dead analysis Data mining data Big data and data science kind of a data science actually is the of the generalizable extraction of knowledge from data and machine learning is related to data science
02:12
according to your comment on this road diagram much learning is in the middle of and Data Science is a part of national and because it exploits Marshall and mentioning is a fundamental part in data science step OK
02:28
but what what is actually the relation with all data mining and the 2 losses in general with much learning um model learning is
02:38
about to make predictions OK so instead of only we analyze the data we have much learning is also able to generalize from the state OK so we have the the we have a bunch of data that we may want to crunch these data to make certain statistics analysis on the later but and that's it OK and this is also called data mining 1st Russia learning is a bit different because much learning performs this analysis by the the goal is as lively different the goal is analyze this data and generalize try to find a way to learn from this data general model for future data for data that are already uh that are almost and see at this point OK so the idea is a pattern exists in the data we cannot pin this pattern manually but we have data on it OK so we may learn from this data in other words this kind of also known as learning by examples OK but a learning comes in 2 different settings and there is the
03:47
supervised settings this is the general pipeline all much learning algorithm using you have all the data on the upper left corner you translate uh you translate the data into feature vectors this all about almost uh a common step in preprocessing the data then you feed this feature vectors to you're Marshall learning algorithms and the supervised learning setting supports also the the labels which is the set of expected results on this data and then we combine we g generate visible uh from feature vectors and labels and we generalize that we we get the model to predict for a future data in there the bottom left corner of the fear a kind of
04:34
a classical example of supervised learning is the classification you have 2 different groups of data in this case if you want to find a general rule to seperate these data this data Kaiser you you find in this case even a function that separates the data and for future data you will all you will be able to to know which is the class this is it's it's about our classifications we have 2 classes and in the future when you've got a new data you will be able to predict which has the class associated to this data another example is the clustering in this case the setting is also unsupervised learning the pipeline processing is this
05:20
1 year have the same old processing but what you missed is the label part of character because that's why this a school uncivilized because you have no supervision on the data you have no label to predict OK and of the 4 it as for the clustering the problem is
05:38
that a bunch of data and try to cluster rise in another words to uh separated the data into different groups OK so you have a bunch of data you want to identify the groups inside the state of again just a brief introduction to what
05:54
about Python Python and data science of the related nowadays actually apply than is getting more and more packages to for computational science according to this graph Python is Academy edge technology for this kind of computation it's about almost it's in the upper right corner and actually it's not him replacing uh
06:26
and substituting all the technology is 1 of the advantages such as R or Matlab persons 1 of the advantages of Python is that Python provides a unique programming language across different application it has a very huge set of libraries to exploit and this is the case this is why the reason why the Python is the language of choice on its for it assigns almost like and this this placing Matlab and by the way there will be also a piedaterre conference at the end of the day of the week it will be decided on friday so if you if you can please call
07:10
of data science and Python actually math what can be easily substituted by all of these technologies such as Python you apply scifi matplotlib for plotting but there are many other possibilities for special for reporting on are it's got to be easily subdued substituted with appendices right package and In the Python legal system we have also some efficient of Python interpreters that have been compiled for it this kind of computations such as medical knowledge and volcanic pipes and we cited or projects like site and sight and it's very great project to allow to to be used the computation of the Python code
07:59
the packages for much learning on manifold
08:02
actually I am trying to to to to to as described the British bit all the a set wellknown packages for for much learning code and so I would like to teach to make some consideration why 2nd learners at very great 1 kind of we have so far commercially primal natural language what took it and I'll take a sometimes called the should be much learning toolbox this morning there have been a talk about it secular and of course by Graham of no OK and there's a guy who the answer to a list and this on the town where uh and everybody can is our offered a contribution to this list uh in order to distribute the the knowledge about available packages in different languages and Python is very slow yeah and so we have park and lead spartanly actually used implemented in skull it's not uh Python it's is rotting in Python which is called prices py spot at but actually the library for Russia learning is an atom very Ellis stage but should you and is written in C + + and it offers a lot of interfaces 1 of these interfaces in Python the other packages there are Python powered so we're trying to to to take to talk about this just in natural language from Acadia for is implemented in pure Python so no new pyrosulphite allowed but in the other packages are implemented in Pine of so that the code there is a flight more efficient for large scale computation and not the supports Python to and Python 3 all is also in a stage PyMOL suppose Python to actually put the sport by country is not so clear I rates of course only Python to and these all the 2 guys they're supposed both Python to him by the tree account in what about the purpose of this packages and is for natural language processing OK and that's some algorithms for much learning but actually it is not supposed to be uh used in complete must learning environment it's almost related to Text Analysis Natural Language Processing general Piomelli is almost a focuses on supervised learning in particular through as the the technique which is a sample vector machine that doesn't many of algorithm especially related to use of self supervised learning Pyrenees foreign nation neural network with which is another set of techniques in the Marshall learning because system the other 2 guys there are some more general problems of so psychic MML much learning pi r contains all algorithms for supervised and unsupervised learning and some others their friends and like the different settings for machine learning OK so we're we remove that we will not consider any more there and brain OK from Europe so but we ended up with this these tree aligner is written in Python for our national code so why to
11:43
change cycling up
11:46
Bell itself and he's that Big Data Guide recommends secular 1st 6th reasons the first one is commitment to the conduction mentation reusability secular as a brilliant documentation and it's very very useful for newcomers and for people without any background about much learning the 2nd reason is moles altruism and implemented by a dedicated team of experts and then the the 2nd model supported by the Library covers most Marshall learning tasks OK uh python and PPI data improved the support forward uh data science the 2 science tools that defines problems and actually I think I know if you know Candle candle is a site where you made of apply competition for it data science and psychic is 1 of the most used package for this kind of competition the fire the another reason should be the focus secularism Russia learning library and its goal is to provide a set of common algorithm to Python uses from a consistent interface these 2 features are to all the features that I like the most up there will be a lot more precise in a few slides about this L and finally but by no means the sole last month 9 wins places like scales data problems of characters so scalability is another feature this site can learn supports out of the box if you want to
13:27
install secular you have to pay very few comments you need to install new pi scifi natural
13:37
IPython actually is not needed is just for convenience and then you install secular and all the other packages that pions sell find predictor are required because secular is based on new points OK but anyway if you want to install on the uh version of the Python interpreter such as a it's already provided on the box
14:01
the design philosophy of cycle so 1 of the greatest features of this package I guess uh in my opinion it includes all the batteries necessary for general populace national encode each has as its supports features for and functionalities for data in datasets feature selection and extraction of feature extraction algorithms much learning algorithms in general in different settings social classification regression clustering and stuff like that and finally evaluation functions for crossvalidation confusion metrics will see some examples in the next slides the algorithm selection philosophy for this package is try to keep the court as light as possible and try to include only the wellknown L largely used much learning algorithms a so the focus here is to be as much generalpurpose as possible OK so in order to include a broad audience of users at a
15:06
glance this is a a great up to like there's a great fit picture depicting all that the fact that the features of provided by secular and this figure here this as being governed by the documentation this is a sort of map you may follow to is that allows you to to choose the particular much learning techniques you wanted to use in your emotional learning so there are some clusters in this picture there is regression over their classification clustering and dimensionality reduction and you may follow these kind of I pass over there to to decide which kind of which is the setting was suited for your problem can the API of psychiatrists very intuitive and so of mostly consistence to every motion learning technique uh there are 4 different objects that there is the estimated the predictor former animal cat they In this interfaces are implemented might not almost all the of learning algorithms included
16:20
in the library for instance let's make an example of the API for the estimator is 1 of the maidens effects OK the at and estimator is an object that fits the model based on some training data and is capable of inferring some properties on new data for example if we want to create an algorithm which is called k or k neighbors classifiers we the KNN algorithm which is a classifier so it's an is for classification problems and then supervised learning it as the feet more method
16:56
but for all of also sorry for also and supervised learning algorithms such as kmeans the became order is an estimated as well as it implements the FT method to for feature selection is always sign OK then the predictor
17:14
in the predictor provides the predicted and the predictive probability
17:20
method and finally the transformer here's the transformer is about the transfer method that and sometimes there's also the the transform method that applies the fate and then the transformation of the data but also raises to to make the transformation of the data 2 to to to to make data uh able to end for that is able to be processed by the algorithms finally the
17:50
last 1 is the model the model s and the the general model you might create your lecture learning algorithm the mobile is for supervised and 4 as revised algorithms and another great feature of Bachelor of reflected in these points because the psychic provides and rights way to create a pipeline processing so in this case you may create a pipeline of different processing steps OK just out of the box you might apply these select k best which is a feature selection step then after the feature selection you might apply PCA PCA so feature is in an algorithm forward the dimensionality reduction and then you may apply a logistic regression which is it possible a classifier OK so you might associate pipelined processing very well very easily see it and then you call the fixed method on the pipe length and the feed method will um and then the predict the only constraint here is that the last step of the pipeline should be class that implements the predictive methods so a predictor so can suppose a good OK great so let's see some examples I
19:17
actions we have but it's very introductory example um the 1st thing to to consider is the data presentation actually cite it is based on replies advising so all the data are usually represented as matrices and vectors in general in much learning by definition we have the axis is not fixed forward there which is usually identified by the cup to let because it doesn't matter as a maverick self uh and different rows and the different colors in this case I'm sorry in this case
19:56
N is the number of samples we have in our dataset and the is the number of features and so the number of the relevant information on the data we have OK so the data comes the training data comes in this flavor and it under the hood it is implemented by scifi don't spots Madison can usually it is defined and not mistaken and should be as implementations will come sparse wrote a compresses phosphoryl again and finally we have the labels because we know that the the values for each of this data about about the problem we have in the problem
20:40
we're going to consider is about the IRIS dataset and we want to design a library and that is able to automatically recognize Iris species OK so we have 3 different species survive this we have I was VersaColor in on the left I was with Jenny cannot here and uh I was that tells us you going on some the features we're going to consider out for an obvious the length of the stable and we the disabled line pay and the we of the paper OK so every data in this dataset comes as a vector and then resample sorry comes as a vector for 4 different features against this for years cycle already has a great package to handle datasets actually being is particularly because it is very well known in many fields and is already embedded in the 2nd learned library so you only need to import the datasets package called that Lord iris and then you you called the functional load virus and the I was object is a bunch object that
21:52
contains different keys it as the target names to the data the target a description of the dataset and the feature names a kind description is the description of a word was description of these feature names the 4 different features I've already mentioned in the previous slides the time the names are that targets we expected on this dataset in particular Sytos VersaColor originated at 2 different I was suspicious we want to predict then we have the data
22:22
so we go there and I was not data comes environment metrics and you rate the shape of this matrix A S A 1 50 100 and 150 rose times for uh for which is 4 different the columns and the target start 100 50 because we have a value for the target value of target for each sample in the dataset so and the number of samples in this case is 150 d the number of feature in this case is for and this a and the targets here is
23:02
that the result of stars OK so we have a value that ranges from 0 to 2 corresponding to the pre different classes we want to predict
23:12
we might try to apply a classification problem on
23:15
this data we want to exploit the KNN algorithm the idea of the KNN classifier is is pretty simple but in for
23:24
example if we consider K which is able to 6 were going to check the the the classes this is the new data we trying our global with the training data and if we want to predict the class of this new data on the and that the classes of the the 60 years the inverse of the state of that in this case
23:50
should be the bridging the gap because the dot on the red dot spoken were sample is a
23:58
few lines of code we import the dataset we call the k neighbor classifier algorithm in in this case we select and neighbours equals to 1 then we call the fits method and we train all mobile that if
24:13
this is what we get actually if you want to plot the data and these these are called the decision boundaries of the classifier and if you want to know for new data which of the kind of which is a species of virus that task 3 centimeters times 5 centimeters of several and 4 times 2 centimeters of pattern weird OK right let's check I just got target names in and and OR predict because Canada is a classifier so that you may fit the data and also predicts that after the training and its cells OK it's virginica again to president right then we
24:56
might also try to instead of of facing this problem as a classification you may also face the problem as a nose ring unsupervised setting so as a clustering problem in this case we're going to use the Kmeans algorithm the kmeans algorithm is pretty the ideas a simple the we want to we create an idiot cluster of object in each object these equal distance to the center of this of this cluster of and
25:28
that's it and cycle this 1st simple we have the the painting we see we specify the number of clusters we want to have in the kmeans in this case 1 precursors because we're going to predict 3 different our speeches from the Irish and then this is the ground truth so this is the value we expect that this is what we got after calling the the kmeans as you might already noticed the interface for the 2 ordering is exactly the same even if that much learning settings are completely different in the former case it was supervised in this latter case is unsupervised OK so classification versus clustering finally reversed his
26:12
life to conclude another great battery included in inside and I'm I had a little many other machine learning a libraries in Python at has so complete internal batteries is about the model evaluation irony model agent is necessary to know how do we know if our predicted or our prediction model is good so
26:40
we apply model validation techniques we might simply try to verify that every prediction correspond to the actual to the actual target kind but this is meaningless because we're trying to verify if we train all the data on the training OK so there's there's this kind of relation is very poor because our because it's based only on the training so we we're just checking if we are you able to fit the data but we are not able to use that to test the from the mobile the final model is able to generalize cat because a key feature of this kind of technique is the generalization so no the goal too much to the training data because it's if you will end up being a problem which is called overfitting but you need to generalized to to to be able to analyze and to be able to predict even new data that are not actually uh identical to the training data that 1 usually technique uses much learning is the socalled confusion metrics of that movement psychic provides the in that the metrics package provides a different kind of metrics to evaluate your performance in this case we're we're going to use the confusion matrix the condition method is is very simple is a methods work it's the number it has is a square matrix where the rows and the columns corresponds to the number of classes you want to predict a guide and in the diagonal you have all these the classes of that you expect with respect to the class that you predict correct so you have all the possible matchings if you have all the data the on the other hand on the on the diagonal itself that you predicted perfectly all the classes again is that there OK right thank you but and
28:41
added a very well known as word you guys that are already aware Russia learning in the crossvalidation technique crossvalidation is a movable addition techniques for assessing how the results of the statistical analysis of data is able to generalize to independent datasets not only to the set we use for training a character and 2nd are ready provide all the features to handle this kind of stuff so all psychic uh Our imposes us to write very few code just the few lines of code necessary to import the functions already provided in the library but in other cases we were indeed we were required to implement this kind of functions over and over for every time in In our Python code can so this is very weak so very useful even for laser programmers like me again In this case we have we exploit the trying test late so we the idea of the crossvalidation here is of the 2 splitting the data the training data into different sets of the data to the training set and the test set so we fit on the training set and we predict on the test set OK so in this case we will see we we we see that there are some errors of cash coming from this prediction OK this is a more obvious way to evaluate our work prediction model OK so the last couple of things think in the last couple of things
30:22
fears largescale out of the box kind of great battery into inside here is the support for large scale computation an array of of the box you buy combinedcycle encode we every library you want to use for multiprocessing Laura Paolo computation distributed computation but if you the 1 to exploit the already provided features for this kind of stuff is some there are many techniques in the library that allows for a parameter which is called an analytical jobs if you use this set this us with the value different to 1 which is the default value it's uh perform the cut performs the computation on the did different CPU you have in your machine if you put the minus 1 value here and this means that is going it is going to exploit all this accuse you have when you're a single motion and this is 1 of 4 different settings for different kind of applications in machine learning you may apply multiple processing for clustering the kmeans examples that we made a few slides ago for crossvalidation 1st of all 4 and research and research is another of the rates of features included a future influence like that is able to the identify the best parameter for prediction that would for a predictor that maximizes the value for the crossvalidation so we want to get the best parameters for our model that maximizes the crosscorrelation so that is able to generalize the best OK just to to to to to give the intuition OK this is my possible thanks to the job leave a library which is provided in the background of Paxil and the food the new number jobs here respond to a call to the job that there the job welldocumented as well so you might read review documentation for any additional details and last but by no
32:37
means least psyche admits any all the libraries can as a sort
32:43
of psychic could be integrated with an L 2 K the is that is never language token and for psychic image just to make a couple examples in details like made natural and spoken but by design and not a k includes additional modules which is an optic Adolph classified uptake merit which is actually a rapid in and take a library that allows us to translate the API taking it in the API using them together OK so if you have a code on an Lck you want to apply a classifier Exploiting the psyche library upon you my translate like you may what the classifier from psycinfo and then you might say use the sacred learned classifier class from the end decay package over there and I wrap the interface for this classifier to the ones of psyche and that it it is in this case we as the C. that stands for support vector OK and then you may also include this kind of stuff in the pipeline processing of circuit so in conclusion
33:57
secular is not the only national learning library available in Python but this powerful and in my opinion easytouse very efficient implementation provided if it's based on replies scifi insight and under the Food and it is highly integrated for example in and LTK or the 2nd thing is just an example
34:18
so I really hope that you are looking forward to using it and thanks for the
34:25
kind of tensions few thank you thank you very very you with the fix minute left for your
34:35
question please raise your hand them over and Weibull microphone the 1st of the the well things little to show questions that I could provide any online learning methods yeah yeah actually this is a point I I I wasn't able to include in the slides the online learning is already provided and there are many classifiers or techniques that allows for a method which is called partial fit OK so you have this method to up provide the the mobile a bunch of data oneatatime character so the interface has been extended by a partial fit method so some techniques allow for online learning and another very a great the usage of this part of it is in case of the so called out of core core learning in that case the the in the out of cold up of course our the learning setting you're your data are to too big to fit in the memory so you provide the data 1 bunch of bunch of data 1 time because they're too big to fit in the memory so you call powerful fit method to In case because referred to fit a model a bunch of a bunch of partner again things seconds could the regression on is there any support for missing values or missing labels apart from just leading In case of online learning or any other in general for Energy learning for missing labels missing labels are missing data on what you mean so like you get a feature vector that dismisses like a value at the 3rd component actually had enough budget and further really yes thank you just that much so we have a very simple and future that can include by uh median or mean in the different directions uh so if you have very few missing that doesn't work well if you have a lot uh then you you might want to look at matrix completion uh and that is which we do not have we had a Google Summer of Code project in this cluster it didn't finished we welcome contributions of course but here the type and I have some experience section of a psychic before and an action image which is meant mathematician and as I had no way of 8 years about all that to stuff under the hood and you want to deep to use a to be too deep inside of world organism starts and mathematics and and this is the biggest problem from the words to realize what to it so if you got some kind of big data sets with features labels supervised learning how what would you and a tries to someone who doesn't know how does work what which set for which so smaller smaller easy solutions should I consider to improve the results of specific excitation you actually national learning is about finding the right will when the right parameters and so there are many steps you may want to apply in in your training the different algorithms in general you apply data normalization step so you you might forcible that the 1st step I suggest it is preprocessing of the data analyze the data you make some uh statistical tests on the data some preprocessing some visualization of your data manager no what kind of data you're dealing with eventually 1st step the 2nd 1 is trying the same balls mobile you you want to fly and then improving 1 step at a time can if you find the right will you want to use that you want to Our finally you should you're required to find the best settings for that will get in that case you might end up using the greedy search method for sensor which is a method provided of box just to find the best combination of parameters that maximizes the values of the crossvalidation person and of course as a training on the job right so you see you may find that the the right will Fourier predictions forum I find were small than you start over again the for different roles so of that of the cells but yes things again malaria laughing he hears going he's going to give a talk at the data as well I think on Saturday isn't yet on saturday so if you attend PPI data don't miss that talk of well and yet thanks again