FAQ  •  Register  •  Login

CyTOF data: publicly availability of datasets

<<

mleipold

Guru

Posts: 2256

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Wed Feb 20, 2019 4:46 pm

CyTOF data: publicly availability of datasets

Hi all,

In writing the recent review article with Lars Olsen and the 5 year anniversary of Cytoforum, I've been doing a lot of reviewing of the CyTOF literature in the last several months.

This also goes back to previous discussion threads about CyTOF datasets (viewtopic.php?f=3&t=874&p=2529&hilit=cytof+dataset#p2529 ; viewtopic.php?f=3&t=1047&p=3650&hilit=Google+Form#p3650) and CyTOF analysis algorithms (viewtopic.php?f=3&t=1087&p=3230&hilit=analysis+tools#p3230)


I have attached a PPT of some Cytoforum-based analysis I completed recently on the subject.


As some of you may have noticed, I've recently started putting explicit statements about dataset availability in my Literature posts when I find a new CyTOF paper. This is because in the review and links mentioned above, most people were *not* posting their new datasets.


Additionally, we found that of those datasets which *were* posted, only a handful were *actually* being reused (for comparisons with newly generated data, or for algorithm development in computational-only articles). Specifically, 5-7 datasets accounted for roughly half of all dataset reuse.

To be clear: I'm not saying that these aren't great datasets, but I do think there's a potential issue when they're about the only ones being used to generate new analysis algorithms. Not every dataset is the same, and not every assay type (surface vs phospho vs ICS, for example) is the same. Cell rarity is a big one, for instance.


There's a range of place that people do post datasets. Immport, FlowRepository, and Cytobank are probably the big 3, but there are datasets on personal websites, institutional websites (like https://library.stanford.edu/research/s ... repository, used for the recent 2019-Good et al), Zenodo, Mendeley, Figshare, and a few others.

In some cases, this raises other issues:
1. Link rot: for example, Dana Pe'er's original Columbia website no longer exists, and the links have not been forwarded (viewtopic.php?f=3&t=1007&p=3560&hilit=analysis+tools#p3560). Therefore, datasets like Bendall 2014 Cell (Wanderlust paper) aren't readily available.
- institutional websites would hopefully be slightly more resistant to this, as they're often managed by the institution's library system (who know a thing or two about archiving....)

2. Actual availability: the paper may give a Cytobank or FlowRepository accession number, but it's not *actually* public.

In the case of FlowRepository, by design, once a dataset is made Public, the owner *cannot* switch it back to Private without the direct involvement of the FlowRepository team.

However, Cytobank allows users to switch back and forth (search.php?keywords=Problems+accessing+Cytobank-stored+datasets%3F), which to me defeats the purpose of posting the link in the first place.

Immport does release permanently accessible data, but only quarterly for some reason. I have heard from them that they *can* make a specific dataset available faster than that if necessary. However, there's an issue when a paper is released in October and the dataset doesn't become accessible until January (as happened with Kronstad et al).


In talking with some researchers here at Stanford, reasons for not posting datasets fall into some major categories (more than one may apply in various cases):
1. I didn't know that we *could* upload....people do that?
2. It's a pain to upload (particularly Immport).
3. We're not *required* to upload, so we won't bother.
4. We're not *required* to upload, and we're not done analyzing our data, so we won't.

In my opinion, none of these are persuasive. Various (free) repositories have existed since at least 2014. Additionally, it's now basically required for RNAseq data to be released in order to get a paper published: CyTOF data (or, really *any* data) should be the same. I know some funding agencies require this (Gates, Parker, Chan-Zuckerberg, NIH at the end of the grant), but in my experience enforcement is lacking.

Immport is often a pain, but talking with their staff, they're switching more and more from the original "users are required to do everything theirselves" to giving significant help in the uploads.....I've even heard of cases where a user sent them the Zipped data and Immport staff did most of the upload.

FlowRepository is far easier to upload: fewer required fields. Annotation is always a good thing of course, but the annotation section can be a pain. In most of my uploads, I've ignored that in the "FCS files" section and just uploaded a Word or Excel or Text document for annotation. The worst that happens is that you get a bad MiFlowCyt score.

Zenodo, Mendeley, and Figshare often take about any kind of file you want to upload. They also usually generate DOIs, which should be stable against link rot.


In summary, in my personal opinion: every CyTOF paper published with new data should have that data deposited in a freely accessible online repository at the time of publication. My preference would be a repository resistant to Link Rot, and ideally one where a Public dataset cannot be re-set to Private.

Here are the links for many of the repositories mentioned earlier:

Immport: https://www.immport.org/shared/home
**FlowRepository:  http://flowrepository.org/

Mendeley: https://data.mendeley.com/
Zenodo:  http://help.zenodo.org/features/

Figshare:  https://figshare.com/features/


Mike
Attachments
Datasets and reuse-for Cytoforum.pptx
(711.82 KiB) Downloaded 150 times
<<

adriansmith

Participant

Posts: 5

Joined: Thu Dec 19, 2013 12:23 am

Post Wed Feb 20, 2019 8:52 pm

Re: CyTOF data: publicly availability of datasets

Hi Mike,

Thanks for doing this - I think it is really important work and as you say has implications far wider than just CyTOF data.

I would strongly encourage you to distribute this much more widely (eg Purdue mailing list) and also try to present it at CYTO if at all possible.

Regards,

Adrian
<<

mleipold

Guru

Posts: 2256

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Wed Feb 20, 2019 9:19 pm

Re: CyTOF data: publicly availability of datasets

Hi Adrian,

I actually never signed up for the Purdue list (I don't do any regular fluorescence flow), so in the interest of time, feel free to cross-post a link to it. I did cross-post it to my Linkedin, which has gotten responses about some other CyTOF posts I've made in the past.

I do agree that with Auroras, Symphonies, and other high-D fluorescence flow cytometers, the same issues apply around dataset availability. And with online free repositories (with various stringencies of reporting the metadata), the barrier to posting the data seems low.


As I said earlier, funding agency/journal enforcement is lacking. I haven't done an exhaustive search on journal guidelines and requirements, so I can't directly comment on which ones require and which ones only "suggest" data deposition.

Cytometry A's Author Information states: "Data Repository. We require that data suitable for independent reproduction of the result be made available prior to publication, and we recommend that all authors submit their data files to http://flowrepository.org prior to submission."


However, I've posted at least one recent CyTOF paper in Cytometry A which did not include any statement about data availability (no accession number, etc). Reading that above statement, it doesn't formally make "Public" data deposition a *requirement* of publication; I guess it could be interpreted to only being relevant during the review process (FlowRepository can give a reviewer's link, for example).

Please note: I don't mean to pick on Cytometry A alone, but only as one example of a much broader topic.


Mike
<<

kunicki

Contributor

Posts: 36

Joined: Thu Apr 13, 2017 8:46 pm

Post Wed Feb 20, 2019 11:22 pm

Re: CyTOF data: publicly availability of datasets

I second Adrian's remark. It would be great to see you present this at CYTO!

Thank you for this Mike!

Matthew
<<

PaulNL

Participant

Posts: 5

Joined: Thu Sep 13, 2018 9:54 am

Post Tue Feb 26, 2019 8:26 am

Re: CyTOF data: publicly availability of datasets

It seems searching for accession IDs in flowrepository is broken. For some IDs, when you search for the ID you will get no results, but the ID does exist.

Example:

When I search for FR‐FCM‐ZYPY I get no results found, however here it exists:

https://flowrepository.org/id/FR-FCM-ZYPY
<<

mleipold

Guru

Posts: 2256

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Tue Feb 26, 2019 2:27 pm

Re: CyTOF data: publicly availability of datasets

I've run into that before....sometimes the search doesn't work properly. I always check by typing in the URL with the accession number at the end, rather than searching by the accession number.

Usually that will tell you whether the dataset actually exists (regardless of whether you have access to it).

I did that URL direct check for the 2019-Palgen et al dataset, so I'm not sure why it didn't come up if it was created months ago.
<<

mleipold

Guru

Posts: 2256

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Mon Mar 04, 2019 6:33 pm

Re: CyTOF data: publicly availability of datasets

Hi all,

I've broken this analysis down a bit more, by some major cytometry journals or journal families:
Cytometry A
Science family
Nature family
Cell Press family

Please see attached PDF for the breakdown. I looked at all years (2008-end of Feb 2019). However, since some of those years are before Immport, FlowRepository, and other repositories existed, I also analyzed just 2017 and just 2018.


Mike
Attachments
Literature-Cytoforum-new dataset link working-by journal family.pdf
(143.23 KiB) Downloaded 142 times
<<

mleipold

Guru

Posts: 2256

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Fri Mar 08, 2019 4:38 pm

Re: CyTOF data: publicly availability of datasets

Hi all,

Since RNAseq data deposition is *required* by most major journals (Nature, etc), I took a look at CyTOF papers which *also* contain RNAseq data.

1. CyTOF papers
2. 2017-2019
3. Major journals only: Nature family, Science family, Cell Press family, and Cytometry A.
4. RNAseq, scRNAseq. Also counted qPCR, TCRseq, etc, since they also usually get accession numbers.

Please see attached PDF:
CyTOF papers in total: N=113
CyTOF papers also performing RNAseq: total N=47

Second N is smaller mainly bc Cytometry A had no RNAseq-containing CyTOF papers. For the other journals, a bit over 50% of Nature journal and Cell Press journal articles had both CyTOF and RNAseq assays. For Science journals, only about 25% of CyTOF papers also performed RNAseq.

However: of those 47 articles which perform *both* assays:
- Over 50% of them (N=27) only had RNAseq accessions (CyTOF accession=no, RNAseq accession=yes).
- Only ~1/3 (N=16) had *both* CyTOF accessions and RNAseq accessions
- Only 1 had the CyTOF accession but no RNAseq accession


Granted, this is only a subsample of all CyTOF papers, but it's consistent with my general observation about dataset availability across years and journals. I'd be interested in people's thoughts about why this is.


Mike
Attachments
Literature-CyTOF papers-RNAseq question-major journals.pdf
(87.97 KiB) Downloaded 115 times

Return to CyTOF general discussion

Who is online

Users browsing this forum: No registered users and 2 guests