CMU Machine Learning Project

来源:百度文库 编辑:神马文学网 时间:2024/03/29 13:05:18
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/ml-projects.html
Following are potential course projects for 15-781, Machine Learning.If you would like to pursue one of these ideas, contact thecorresponding faculty by email. You may also propose your ownproject, subject to approval by the instructor.
In either case, please turn in a one-page project proposal toTom.Mitchell@cmu.edu by the beginning of class on Oct 29. Yourproposal should explain (1) the problem you will look at, (2) theapproach and algorithm(s) you will use, and (3) how you will evaluatethe results.
============================================================================
TITLE: Text Classification with Bayesian Methods
PROJECT SUPERVISOR: Andrew McCallum (andrew.mccallum@cs.cmu.edu)
DESCRIPTION: Given the growing volume of online text, automaticdocument classification is of great practical value, and anincreasingly important area for research. Naive Bayes has beenapplied to this problem with considerable success, however, naiveBayes makes many assumptions about data distribution that are clearlynot true of real-world text. This project will aim to improve uponnaive Bayes by selectively removing some of these assumptions. Iimagine beginning the project by removing the assumption that documentlength is indepedent of class---thus, designing a new version of naiveBayes that uses document length in order to help classify moreaccurately. If we finish this, we‘ll move on to other assumptions,such as the word independence assumption, and experiment with methodsthat capture some dependencies between words. The paper athttp://www.cs.cmu.edu/~mccallum/papers/multinomial-aaai98w.ps is agood place to start some reading. You should be highly proficient inC programming, since you‘ll be modifying rainbow(http://www.cs.cmu.edu/~mccallum/bow/rainbow).
============================================================================
TITLE: Support vector machines for face recognition
PROJECT SUPERVISOR: Rahul Sukthankar (rahuls@cs)Adjunct Faculty, Robotics Institute
DESCRIPTION: Face recognition is a learning problem that hasrecently received a lot of attention. One standard approachinvolves reducing the dimensionality of the problem using PCAand then selecting the nearest class (eigenfaces). SupportVector Machines (SVM) are becoming very popular in themachine learning community as a technique for tacklinghigh-dimensional problems. No one has yet (to my knowledge)applied SVMs to face recognition. Can SVMs outperformstandard face recognition algorithms?
Issues that the student should address:
- How best to apply SVM to the n-class problem of face recognition;
- Figure out training and/or image preprocessing strategies (wavelets?);
- Compare how SVMs compare to other techniques (see notes)
Notes:
- A good implementation of SVMs is available (Thorsten‘s SVMlight);
- We can give the student access to two datasets used widely inthe community, ORL and FERET, for training & testing;
- We have results for eigenfaces, fisherfaces and JPRC‘s facerecognition system on these datasets, as well as implementations,so comparing SVM to other algorithms will be straightforward.
- I can recommend tutorials and papers on SVMs to supplement whatwas covered in class, if needed.
============================================================================
TITLE: Predictive Exponential Models
PROJECT SUPERVISOR: Roni Rosenfeld
DESCRIPTION: A new predictive model recently introduced by Chen &Rosenfeld can incorporate arbitrary features using exponentialdistributions and sampling (see www.cs.cmu.edu/~roni/wsme.ps).Although the model was originally developed for language modeling, itcan be used for prediction or classification in any domain. In thisproject you will be expected to read and understnad this paper, and toapply the model to a Machine Learning problem of your choice. Forexample, you could choose one of the ML problem cases used in thecourse, and try to improve on the existing, "baseline", solution.
COMMENT: This project is open to more than one student. Each studentcould work on their own ML problem, or (subject to Tom‘s approval) wecan choose a larger problem for joint work.
============================================================================
TITLE: Natural Language Feature Selection for Exponential Models
PROJECT SUPERVISOR: Roni Rosenfeld
DESCRIPTION: A new predictive model recently introduced by Chen &Rosenfeld can incorporate arbitrary features using exponentialdistributions and sampling (see www.cs.cmu.edu/~roni/wsme.ps). Themodel was originally developed for modeling of natural language, andhas highlighted feature selection as the main challenge in thatdomain. In this project you will be expected to read and understnadthis paper. Then, you will be given two corpora. The first oneconsists of transcribed over-the-phone conversations. The secondcorpus is artificial, and was generated from the best existinglanguage model (which was trained on the first corpus). Your job isto use machine learning and statistical methods of your choice (andother methods if you wish) to find systematic differences between thetwo corpora. These differences translate directly into new features,which will be added to the model in an attempt to improve on it (animprovement in language modeling can increase the quality of languagetechnologies such as speech recognition, machine translation, textclassification, spellchecking etc.)
COMMENT: This project is open to several students, who would beworking separately.
============================================================================
Learning of strategies for energy trading
Prof. Sarosh Talukdar, ECE (talukdar@cmu.edu)
In my design course, 39-405, students write software agents for energytrading. each agent is given an energy quota and money to spend to fill thisquota, for each of about 50 consecutive periods. There are penalties for notfilling the quota, including death (elimination) if 5 the quotas are notfilled in 5 consecutive periods. Agents obtain energy through a doubleauction. they submit bids and the highest bids win. After each round, allthe bids are made public, so each agent knows what its competitors did inthe past. The purpose is to spend as little money as possible and still meetone‘s quotas, that is, to anticipate what the other agents will bid in thenext period, and then bid just enough to get the energy one needs.
The students in 39-405 know little or nothing about automatic learning. Theyrely strictly on their intuition to devise their bidding algorithms. Itwould be interesting to see if some of your students could use automaticlearning technique to build winning agents. (We provide an auction simulatorwith a very easy to use interface--it is simple to connect agents in avariety of languages to the simulator.) If necessary, we can have practicesessions to give agents with the ability to learn enough data to do theirlearning.Sarosh
Sarosh TalukdarECE Dept., Carnegie Mellon UniversityPittsburgh, PA 15213Tel: 412 268 8778, fax: 412 268 2860e-mail: talukdar@cmu.edu
www.ece.cmu.edu/~talukdar/
============================================================================
TITLE: Learning from labeled and unlabeled data
PROJECT SUPERVISOR: Prof. Tom Mitchell, tom.mitchell@cs.cmu.edu
DESCRIPTION: The recent paper by Blum & Mitchell on co-trainingproposes an algorithm for learning from unlabeled as well as labeleddata in certain problem settings (seewww.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/colt98_final.ps).In this project you will be expected to read and understand thispaper, and to extend the experimental results in this paper. Inparticular, I have some ideas for creating synthetic data sets thattest the robustness of the algorithm to changes in the problem settingdiscussed in the paper.
============================================================================
TITLE: Similarity matching in high-dimensional space on discrete data
PROJECT SUPERVISOR: Prof. Latanya Sweeney, Heinz School,latanya.sweeney@cmu.edu
DESCRIPTION: Given a database with hundreds of attributes (or fields) andthousands of tuples (or records), finding similar tuples (records) is verydifficult and we do not have any efficient algorithms to accomplish thistask. I have some ideas for new algorithms that may prove to be effective.In this project, you will implement these algorithms and explore variantsto determine their effectiveness.
============================================================================
TITLE: Using a repository of old text to answer new questions
PROJECT SUPERVISOR: Latanya Sweeney
DESCRIPTION: Consider a repository of email messages in which discussioncenter around living with a disease, such as celiac, heart disease ordiabetes. Frequently new people become diagnosed and join the list,resulting in a good number of questions being asked repeatedly.Unfortunately, messages do not adhere to a restricted vocabulary and sotraditional web-based keyword searching is often ineffective. In thisproject, you will use and evaluate algorithms to generate responses to newemail messages based on the repository of old email messages. You can beginwith a Bayesian text classifier [as discussed in class: Lewis, 1991; Lang,1995; Joachims, 1996] and a semantic generalization algorithm I haveconstructed and based on your analysis, explore interesting variants todetermine the effectiveness of this new approach.
============================================================================
Datamining of consumer purchase data
Professor Kannan Srinivasan, GSIA,kannans@smtp2.andrew.cmu.edu
Tom: The easiest thing would be to ask any one who is interested inlooking at consumer purchase (frequently bought or frequentlytransacted) to see me. I can explain to them my research work in thisarea. I have several ideas and expect that at least some of themwould be interested in that. Cheers. Kannan
PS If they want to see me as a group, that would be fine too.