Loading ...
Sorry, an error occurred while loading the content.

Re: [code-switching] Code-switching corpora

Expand Messages
  • Gustav Eje Henter
    Hello Anna, This was a very interesting link indeed. I am right now experimenting with the BilingBank Blum Snow corpus, and have obtained an interesting Markov
    Message 1 of 4 , Apr 5 4:06 AM
    • 0 Attachment
      Hello Anna,

      This was a very interesting link indeed. I am right now experimenting with the
      BilingBank Blum Snow corpus, and have obtained an interesting Markov chain model
      which regularly mixes English and Hebrew.

      Thanks to everyone who has shared their advice and/or data so far!

      Best regards,
      Gustav


      PS. Just for fun, here is a (lightly edited) random sample from a
      character-level 5-gram Markov chain based on the Blum Snow data:

      eha'avir et ze anashim your fork. al ha# matay maspik. bananot ve# pit'om here's
      a twenty-five about he's probably where write lots of them tembelit. lo ta'avir

      On 2012-04-04 18:14, Anna S wrote:
      > Dear Gustav,
      >
      > I recommend you to have a look at www.talkbank.org, especially http://talkbank.org/data/BilingBank/
      >
      > On that website you find monolingual and bilingual spoken data, sometimes even with the original audio files available.
      >
      > I used the eppler corpus for a study on the grammatical side of codeswitching, it contains cs german-english. on the website you will also find other language pairs; some even with more than two languages switching.
      >
      > Good luck!
      >
      > Best regards from Bremen, Germany,
      >
      > Anna (M.A. linguistics)
      >
      > To: code-switching@yahoogroups.com
      > From: ghe@...
      > Date: Tue, 3 Apr 2012 18:34:48 +0200
      > Subject: [code-switching] Code-switching corpora
      >
      > Dear code-switching experts,
      >
      > My name is Gustav Henter, and I am a Ph.D. student in signal processing and
      >
      > machine learning at KTH - The Royal Institute of Technology in Stockholm,
      >
      > Sweden. Currently, I am looking for a large body of roman-alphabet text (>2
      >
      > million characters, say) with significant code-switching. I intend to use the
      >
      > data to build a character-level Markov chain model for some experiments with a
      >
      > computer algorithm I am researching.
      >
      >
      >
      > Do you know where such a corpus or corpora can be obtained in digital form?
      >
      >
      >
      > I have already experimented with Vivan de Klerk's Xhosa English corpus (thanks,
      >
      > Vivian!), but it had too low a rate of code-switching---about 1 switch for every
      >
      > 500 words---to show any significant difference in my experiments.
      >
      >
      >
      > Best regards,
      >
      > Gustav Henter
      >
      >
      >
      > ======================================================================
      >
      > Gustav Eje Henter, Ph.D. student E-mail: gustav.henter@...
      >
      > Sound and Image Processing Lab, EES, Web: http://www.ee.kth.se/sip/
      >
      > KTH - Royal Institute of Technology Phone: (+46) 8 790 7420
      >
      > Osquldas väg 10, SE-100 44 Stockholm, SWEDEN Office: A:327, floor 3
      >
      > ======================================================================
      >
      >
      >
      >
      >
      >
      >
      >
      >
      >
      >
      >
      >
      >
      >
      >
      > [Non-text portions of this message have been removed]
      >
      >
      >
      > ------------------------------------
      >
      > To Post a message: code-switching @ yahoogroups.com
      > To Unsubscribe, send a blank message to:
      > code-switching-unsubscribe @ yahoogroups.com
      > Web page: http//groups.yahoo.com/group/code-switchingYahoo! Groups Links
      >
      >
      >
    Your message has been successfully submitted and would be delivered to recipients shortly.