Loading ...
Sorry, an error occurred while loading the content.

Re: Code-switching corpora

Expand Messages
  • leonikotze
    Dear Gustav, I suppose literature would be your best bet. I have a play by Athol Fugard entitled Boesman and Lena . It is FULL of code-switching between
    Message 1 of 4 , Apr 4 7:01 AM
    View Source
    • 0 Attachment
      Dear Gustav,

      I suppose literature would be your best bet. I have a play by Athol Fugard entitled 'Boesman and Lena'. It is FULL of code-switching between English and Afrikaans. I think the play might be ideal for you, as the book in which the play appears, has a glossary which explains the meaning of the Afrikaans terms. I will scan the play for you if you wish and send it to your private email. Kindly let me know whether you would like me to do this.

      Kindest Regards
      Leoni Kotze, South Africa

      --- In code-switching@yahoogroups.com, Gustav Eje Henter <ghe@...> wrote:
      >
      > Dear code-switching experts,
      >
      > My name is Gustav Henter, and I am a Ph.D. student in signal processing and
      > machine learning at KTH - The Royal Institute of Technology in Stockholm,
      > Sweden. Currently, I am looking for a large body of roman-alphabet text (>2
      > million characters, say) with significant code-switching. I intend to use the
      > data to build a character-level Markov chain model for some experiments with a
      > computer algorithm I am researching.
      >
      > Do you know where such a corpus or corpora can be obtained in digital form?
      >
      > I have already experimented with Vivan de Klerk's Xhosa English corpus (thanks,
      > Vivian!), but it had too low a rate of code-switching---about 1 switch for every
      > 500 words---to show any significant difference in my experiments.
      >
      > Best regards,
      > Gustav Henter
      >
      > ======================================================================
      > Gustav Eje Henter, Ph.D. student E-mail: gustav.henter@...
      > Sound and Image Processing Lab, EES, Web: http://www.ee.kth.se/sip/
      > KTH - Royal Institute of Technology Phone: (+46) 8 790 7420
      > Osquldas väg 10, SE-100 44 Stockholm, SWEDEN Office: A:327, floor 3
      > ======================================================================
      >
    • Anna S
      Dear Gustav, I recommend you to have a look at www.talkbank.org, especially http://talkbank.org/data/BilingBank/ On that website you find monolingual and
      Message 2 of 4 , Apr 4 9:14 AM
      View Source
      • 0 Attachment
        Dear Gustav,

        I recommend you to have a look at www.talkbank.org, especially http://talkbank.org/data/BilingBank/

        On that website you find monolingual and bilingual spoken data, sometimes even with the original audio files available.

        I used the eppler corpus for a study on the grammatical side of codeswitching, it contains cs german-english. on the website you will also find other language pairs; some even with more than two languages switching.

        Good luck!

        Best regards from Bremen, Germany,

        Anna (M.A. linguistics)



        To: code-switching@yahoogroups.com
        From: ghe@...
        Date: Tue, 3 Apr 2012 18:34:48 +0200
        Subject: [code-switching] Code-switching corpora




























        Dear code-switching experts,



        My name is Gustav Henter, and I am a Ph.D. student in signal processing and

        machine learning at KTH - The Royal Institute of Technology in Stockholm,

        Sweden. Currently, I am looking for a large body of roman-alphabet text (>2

        million characters, say) with significant code-switching. I intend to use the

        data to build a character-level Markov chain model for some experiments with a

        computer algorithm I am researching.



        Do you know where such a corpus or corpora can be obtained in digital form?



        I have already experimented with Vivan de Klerk's Xhosa English corpus (thanks,

        Vivian!), but it had too low a rate of code-switching---about 1 switch for every

        500 words---to show any significant difference in my experiments.



        Best regards,

        Gustav Henter



        ======================================================================

        Gustav Eje Henter, Ph.D. student E-mail: gustav.henter@...

        Sound and Image Processing Lab, EES, Web: http://www.ee.kth.se/sip/

        KTH - Royal Institute of Technology Phone: (+46) 8 790 7420

        Osquldas v�g 10, SE-100 44 Stockholm, SWEDEN Office: A:327, floor 3

        ======================================================================
















        [Non-text portions of this message have been removed]
      • Gustav Eje Henter
        Hello Anna, This was a very interesting link indeed. I am right now experimenting with the BilingBank Blum Snow corpus, and have obtained an interesting Markov
        Message 3 of 4 , Apr 5 4:06 AM
        View Source
        • 0 Attachment
          Hello Anna,

          This was a very interesting link indeed. I am right now experimenting with the
          BilingBank Blum Snow corpus, and have obtained an interesting Markov chain model
          which regularly mixes English and Hebrew.

          Thanks to everyone who has shared their advice and/or data so far!

          Best regards,
          Gustav


          PS. Just for fun, here is a (lightly edited) random sample from a
          character-level 5-gram Markov chain based on the Blum Snow data:

          eha'avir et ze anashim your fork. al ha# matay maspik. bananot ve# pit'om here's
          a twenty-five about he's probably where write lots of them tembelit. lo ta'avir

          On 2012-04-04 18:14, Anna S wrote:
          > Dear Gustav,
          >
          > I recommend you to have a look at www.talkbank.org, especially http://talkbank.org/data/BilingBank/
          >
          > On that website you find monolingual and bilingual spoken data, sometimes even with the original audio files available.
          >
          > I used the eppler corpus for a study on the grammatical side of codeswitching, it contains cs german-english. on the website you will also find other language pairs; some even with more than two languages switching.
          >
          > Good luck!
          >
          > Best regards from Bremen, Germany,
          >
          > Anna (M.A. linguistics)
          >
          > To: code-switching@yahoogroups.com
          > From: ghe@...
          > Date: Tue, 3 Apr 2012 18:34:48 +0200
          > Subject: [code-switching] Code-switching corpora
          >
          > Dear code-switching experts,
          >
          > My name is Gustav Henter, and I am a Ph.D. student in signal processing and
          >
          > machine learning at KTH - The Royal Institute of Technology in Stockholm,
          >
          > Sweden. Currently, I am looking for a large body of roman-alphabet text (>2
          >
          > million characters, say) with significant code-switching. I intend to use the
          >
          > data to build a character-level Markov chain model for some experiments with a
          >
          > computer algorithm I am researching.
          >
          >
          >
          > Do you know where such a corpus or corpora can be obtained in digital form?
          >
          >
          >
          > I have already experimented with Vivan de Klerk's Xhosa English corpus (thanks,
          >
          > Vivian!), but it had too low a rate of code-switching---about 1 switch for every
          >
          > 500 words---to show any significant difference in my experiments.
          >
          >
          >
          > Best regards,
          >
          > Gustav Henter
          >
          >
          >
          > ======================================================================
          >
          > Gustav Eje Henter, Ph.D. student E-mail: gustav.henter@...
          >
          > Sound and Image Processing Lab, EES, Web: http://www.ee.kth.se/sip/
          >
          > KTH - Royal Institute of Technology Phone: (+46) 8 790 7420
          >
          > Osquldas väg 10, SE-100 44 Stockholm, SWEDEN Office: A:327, floor 3
          >
          > ======================================================================
          >
          >
          >
          >
          >
          >
          >
          >
          >
          >
          >
          >
          >
          >
          >
          >
          > [Non-text portions of this message have been removed]
          >
          >
          >
          > ------------------------------------
          >
          > To Post a message: code-switching @ yahoogroups.com
          > To Unsubscribe, send a blank message to:
          > code-switching-unsubscribe @ yahoogroups.com
          > Web page: http//groups.yahoo.com/group/code-switchingYahoo! Groups Links
          >
          >
          >
        Your message has been successfully submitted and would be delivered to recipients shortly.