Jean-Philippe Fauconnier

Jean-Philippe Fauconnier

I am Natural Language Processing and Machine Learning Researcher at Apple 

Previously, I have obtained my PhD in Computer Science at the Université Paul Sabatier (Toulouse, France) and I have completed my Master Degree in Natural Language Processing at the Catholic University of Louvain (Belgium).

About

I have obtained my PhD in Computer Science at the Toulouse Institute of Computer Science Research, Université Paul Sabatier (France). My work took place under the supervision of Drs Mouna Kamel and Nathalie Aussenac-Gilles, and focused on Natural Language Processing, Document Analysis and Machine Learning fields. More particularly, I was interested in the acquisition of lexical relations using the layout and the formatting of documents. For this work, I received the ATALA Best PhD Thesis Award.

Previously, I graduated in Natural Language Processing at the Catholic University of Louvain (Belgium) in 2012. This Master's degree programme was delivered by the CENTAL laboratory.

Recent (2024)

Research Interests

Natural Language Processing

Document Analysis

Statistics

Works

National journal papers

International conference papers

National conference papers

Phd Thesis

Talks

Teaching

2015-2016

2014-2015

2013-2014

2012-2013

Software

Most of resources are located on my github repository. The fast way to download a given resource is to use git:

mkdir resource
cd resource
git clone https://github.com/fauconnier/resource

Personal softwares

AMI

AMI (Another Maxent Implementation) is a R implementation of multinomial logistic regression, also known as Maximum Entropy classifier. This implementation deals with binary and real-valued features and uses standard R functions to optimize the objective. Then, it is possible to use several iterative methods: LM-BFGS, Conjugate Gradient, Gradient Descent and Generalized Iterative Scaling.

LARAt

LARAt (Layout Annotation for Relation Acquisition tool), pronounced /laʁa/, is an annotation tool which supports the layout and the formatting of HTML documents. LARAt was used during an annotation campaign in 2013 and, in his current state, is dedicated to the annotation of enumerative structures. The typology implemented is the one described in the TIA 2013 paper.

LaToe

LaToe (Layout Annotation for Textual Object Extraction) is a tool which extracts the text layout from HTML, MediaWiki, or PDF documents for identifying specific textual objects (such as enumerative structures). Currently, the CRF model used for the PDF analyzer was trained on a small corpus (LING_GEOP). This implies that LaToe could be not efficient for unseen PDF documents with specific formatting.

Source code reviews

code_review_tsuruoka

Code review of a C++ library for maximum entropy classification. On his website, Tsuruoka proposed a fast implementation of a multinomial logistic regression. In order to get a better and deeper understanding of implementation details, I propose a simple code review. The code base is relatively small (around 2500 lines of code). Those notes are primary intended for my personal use and reflect my current understanding. I propose them here, in case it could help someone. Note that this document is currently a work in progress.

Open source contributions

Some open source contributions:

Data

French word embeddings models

I propose here some pre-trained word2vec models for French. Their format is the original binary format proposed by word2vec v0.1c. Depending on your needs, you may want to convert those models. A simple way to convert them into text can be:

git clone https://github.com/marekrei/convertvec
cd convertvec/
make
./convertvec bin2txt frWiki_no_phrase_no_postag_700_cbow_cut100.bin output.txt

Alternatively, you can load a binary model directly into a few Python libraries. Below I give a minimal usage example with Gensim:

pip install gensim
python
>>> from gensim.models import KeyedVectors
>>> model = KeyedVectors.load_word2vec_format("frWac_postag_no_phrase_700_skip_cut50.bin", binary=True, unicode_errors="ignore")
>>> model.most_similar("intéressant_a")
[('très_adv'        , 0.5967904925346375),
('intéresser_v'     , 0.5439727902412415),
('peu_adv'          , 0.5426771640777588),
('assez_adv'        , 0.5398581027984619),
('certainement_adv' , 0.5246292352676392),
('plutôt_adv'       , 0.5234975814819336),
('instructif_a'     , 0.5230029225349426),
('trouver_v'        , 0.5131329894065857),
('aussi_adv'        , 0.505642294883728),
('beaucoup_adv'     , 0.5034803152084351)]

For this model, we can see that the adjective 'intéressant' has a lot of shared contexts with adverbs. Note that the color code and the layout are mine. Please check (Mikolov et al., 2013) to gain insight into the model hyper-parameters.

Thanks to Tim V. C., Adam B., Claude C., Sascha R., Philipp D., Nirina R., Ian W. and Antoine V. who all helped in retrieving some of the original models.

frWac2Vec

FrWac corpus, 1.6 billion words.

lem pos phrase train dim cutoff md5
bin (2.7Gb) - - - cbow 200 0 7e49
bin (120Mb) - - - cbow 200 100 5b5f
bin (120Mb) - - - skip 200 100 6b86
bin (298Mb) - - - skip 500 100 af38
bin (202Mb) - - - skip 500 200 e2c6
bin (229Mb) - - cbow 500 100 1c85
bin (229Mb) - - skip 500 100 54fc
bin (494Mb) - - skip 700 50 e235
bin (577Mb) - skip 700 50 0695
bin (520Mb) - skip 1000 100 8d09
bin (2Gb) - cbow 500 10 14da
bin (289Mb) - cbow 500 100 f500
frWiki2Vec

FrWiki dump (raw file), 600 millions words.

lem pos phrase train dim cutoff md5
bin (253Mb) - - - cbow 1000 100 087c
bin (195Mb) - - - cbow 1000 200 0a19
bin (253Mb) - - - skip 1000 100 7d5c
bin (195Mb) - - - skip 1000 200 48a0
bin (128Mb) - - cbow 500 10 052f
bin (106Mb) - - cbow 700 100 8ff0
bin (151Mb) - - skip 1000 100 5ac9
bin (121Mb) - - skip 1000 200 bc16

How to cite those models?

Given the attribution is provided and according to the licence CC-BY 3.0, you are free to copy, distribute, remix and tweak those models for any purpose. The attribution must be made by quoting my name with a link to this page, or by using the bibtex entry below. Those models were trained during my PhD Thesis, and are in no way linked to my current or any future activities. Note also that those models are shared without any guarantees or support.

@misc{fauconnier_2015,
	author = {Fauconnier, Jean-Philippe},
	title = {French Word Embeddings},
	url = {http://fauconnier.github.io},
	year = {2015}}

Below, public projects and papers using those models:


Annotated copora

Annotated corpora built during my PhD Thesis:

Experience

Laboratory life

Review

Jobs & internships

Links